ALL Metrics
-
Views
-
Downloads
Get PDF
Get XML
Cite
Export
Track
Research Article

Encoding Retina Image to Words using Ensemble of Vision Transformers for Diabetic Retinopathy Grading

[version 1; peer review: 1 not approved]
PUBLISHED 21 Sep 2021
Author details Author details
OPEN PEER REVIEW
REVIEWER STATUS

This article is included in the Research Synergy Foundation gateway.

Abstract

Diabetes is one of the top ten causes of death among adults worldwide. People with diabetes are prone to suffer from eye disease such as diabetic retinopathy (DR). DR damages the blood vessels in the retina and can result in vision loss. DR grading is an essential step to take to help in the early diagnosis and in the effective treatment thereof, and also to slow down its progression to vision impairment. Existing automatic solutions are mostly based on traditional image processing and machine learning techniques. Hence, there is a big gap when it comes to more generic detection and grading of DR. Various deep learning models such as convolutional neural networks (CNNs) have been previously utilized for this purpose. To enhance DR grading, this paper proposes a novel solution based on an ensemble of state-of-the-art deep learning models called vision transformers. A challenging public DR dataset proposed in a 2015 Kaggle challenge was used for training and evaluation of the proposed method. This dataset includes highly imbalanced data with five levels of severity: No DR, Mild, Moderate, Severe, and Proliferative DR. The experiments conducted showed that the proposed solution outperforms existing methods in terms of precision (47%), recall (45%), F1 score (42%), and Quadratic Weighted Kappa (QWK) (60.2%). Finally, it was able to run with low inference time (1.12 seconds). For this reason, the proposed solution can help examiners grade DR more accurately than manual means.

Keywords

Diabetic Retinopathy Grading, Ensemble Learning, Imbalanced Data, Vision Transformer, Self-attention Mechanism

Introduction

Background

Diabetes mellitus (DM) is a group of metabolic disorders that are characterized by high levels of blood glucose and are caused by either the deficient secretion of the hormone insulin, its inaction, or both. Chronically high levels of glucose in the blood that come with DM may bring about long-term damage to several different organs, such as the eyes.1,2 DM is a pandemic of great concern3-6 as approximately 463 million adults were living with DM in 2019. This number is expected to rise to about 700 million by the year 2045.4

High levels of glucose in the blood damage the capillaries of the retina (diabetic retinopathy [DR]) or the optic nerve (glaucoma), cloud the lens (cataract), or cause fluid to build up in the macula (diabetic macular edema), thereby causing diabetic eye disease.6-11 DR is the leading cause of blindness among adults in the working age12 and has brought about several personal and socioeconomic consequences,13 and a greater risk of developing other complications of DM and of dying.14 According to a meta-analysis that reviewed 35 studies worldwide from 1980 to 2008, 34.6% of all patients with DM globally have DR of some form, while 10.2% of all patients with DM have vision-threatening DR.15

A study found that screening for DR and the early treatment thereof could lower the risk of vision loss by about 56%,16 proving that blindness due to DR is highly preventable. Moreover, the World Health Organization (WHO) Universal Eye Health: A Global Action Plan 2014–2019 advocated for efforts to reduce the prevalence of preventable visual impairments and blindness including those that arise as complications of DM.

Many tests can be used for the screening of DR. While sensitivity and specificity are certainly important, the data about performance of tests for DR are different. Researchers employ different outcomes to measure sensitivity, e.g., the ability of a screening test to detect any form of retinopathy, and the ability to detect vision-threatening DR. Additionally, some tests may detect diabetic macular edema better than the different grades of DR according to World Health Organization. Diabetic retinopathy screening: a short guide. Copenhagen: WHO Regional Office for Europe. The examiner’s skill is also a source of variation in the test results. A systematic review found that the sensitivity of direct ophthalmoscopy (DO) varies greatly when performed by general practitioners (25%–66%) and by ophthalmologists (43%–79%).17

DR grading is an essential step in the early diagnosis and effective treatment of the disease. Manual grading is based on high-resolution retinal images examined by a clinician. However, the process is time-consuming and is prone to misdiagnosis. This paper aims to address the matter by developing a fast and accurate automated DR grading system. Here, a novel solution that is based on an ensemble of vision transformers was proposed to enhance grading. Moreover, a public DR dataset proposed in a 2015 Kaggle challenge was used for the training and evaluation.

Related work

Traditional machine learning (ML) methods have been used to detect DR. Typically, these ML methods require hand-tuned features extracted from small datasets to aid in classification. These traditional methods may involve ensemble learning18; the calculation of the mean, standard deviation, and edge strength19; and the segmentation of hard macular exudates.20,21 However, these methods require tedious and time-consuming feature engineering steps that are sensitive to the chosen set of features. Work that employs traditional ML methods to detect DR usually yield favorable results using one dataset but fail to obtain a similar success when another dataset is used.18,19 This is a common limitation of hand-crafted features.

Deep neural networks, such as CNNs, with much larger datasets have also been used for classification tasks in the diagnosis and grading of DR. These methods involve CNNs developed from scratch to grade the disease using images of the retinal fundus22; transfer learning based on Inception-v3 neural network to perform multiple binary classification (moderate versus worse DR, and severe or worse DR)23; segmentation prior to detection by pixel classification24 or patch classification.25 A deep learning (DL)-based framework that uses advanced image processing and a boosting algorithm for grading of DR was also proposed by.26 This is one of only a handful of works that have effectively employed transfer learning to train large neural networks for this purpose. Recently, ResNet, a deep CNN, was proposed to address the problem brought about by imbalanced datasets in DR grading.27 Additionally, a bagging ensemble of three CNNs: a shallow CNN, VGG16, and InceptionV3, was used to classify images as DR, glaucoma, myopia and normal.28

Previously, a transformer was also proposed by Vaswani et al.29 for natural language processing tasks especially for machine translation. Inspired by the successes of the transformers in NLP, transformers were transferred to computer vision tasks e.g. image classification.

Methods

In this section, the DR detection dataset is explored. Additionally, the vision transformer, a DL model that was used on these data, is discussed in detail.

Dataset overview

The DR detection dataset is highly imbalanced and consists of high-resolution images with five levels of severity including No_DR, Mild, Moderate, Severe, and Proliferative_DR. It has significantly more samples for the negative (No_DR) category than for the four positive categories. Table 1 shows the class distribution of the training and testing sets. Figure 1, on the other hand, shows a few samples from each class. The images come with different conditions and were labeled with subject IDs. The left and right fields were provided for every subject. The images were captured by different cameras, thus affecting the visual appearance of the images of left and right eyes.

c5c3e606-04b1-4f98-964d-42c31c234033_figure1.gif

Figure 1. A few samples of each class in dataset of EyePACS, Diabetic Retinopathy Detection: “No_DR” (red borders), “Mild” (blue), “Moderate” (green), “Severe” (yellow), “Proliferative_DR” (violet).

The images have various sizes but were resized uniformly.

Table 1. Training and Testing Class Distribution in dataset of EyePACS, Diabetic Retinopathy Detection.

ClassTrainingTesting
No_DR2581039533
Mild24433762
Moderate52927861
Severe8731214
Proliferative_DR7081206

The samples of the training set were rescaled between 01, cropped to remove their black borders, and augmented by randomly flipping the samples horizontally and vertically, and by randomly rotating the samples by 360°. The samples of the test set were only cropped and rescaled. Figure 2 shows a few augmented samples from the training set.

c5c3e606-04b1-4f98-964d-42c31c234033_figure2.gif

Figure 2. A few samples cropped and augmented randomly.

Vision transformer

A vision transformer is a state-of-the-art DL model that is used for image classification and was inspired by Dosovitskiy et al.30 Figure 3 shows the architecture of the vision transformer. In this paper, a retinal image that has a sequence of patches encoded as a set of words was applied to the transformer encoder as shown in Figure 3. The original image’s patchesN=H×W/P2were extracted with a fixed patch size PP where P=16, W is the image width, H is the image height, and N is the number of patches. The extracted patches were flattened and each patch xp belonged to P2.C, where C is the number of channels.

c5c3e606-04b1-4f98-964d-42c31c234033_figure3.gif

Figure 3. The vision transformer architecture.30

As a result, the 2D image was converted into a sequence of patches xN×P2.C. Each patch in the sequence x was mapped to a latent vector with hidden size D=768. A learnable class embedding z00=xclass was prepended for the embedded patches, whose state at the output of the transformer’s encoder zL0 serves as the representation y of the image. After that, a classifier was attached to the image representation y. Additionally, a position embedding Epos was added to the patch embeddings to capture the order of patches that were fed into the transformer encoder. Figure 4 illustrates the architecture of transformer’s encoder with L blocks, each block containing alternating layers of multi-head self-attention (MSA)29 and multi-layer perceptron (MLP) blocks. The layer normalization (LN)31 was applied before every block, while residual connections were applied after every block.30

c5c3e606-04b1-4f98-964d-42c31c234033_figure4.gif

Figure 4. Encoder Architecture of the Transformer.30

Ensemble learning of vision transformers

Ensemble learning is a ML ensemble meta-algorithm. Bagging (Bootstrap Aggregating) is a type of ensemble learning that uses “majority voting” to combine the output of different base models to produce one optimal predictive model and improve the stability and accuracy.32

The advantage of ensemble bagging several transformers is that aggregation of several transformers, each trained on a subset of the dataset, outperforms a single transformer trained over the entire set. In other words, it leads to less overfit by removing variance in high-variance low-bias datasets. To increase the speed of training, the training can be done in parallel by running each transformer on its own data prior to result aggregation, as shown in Figure 5.

c5c3e606-04b1-4f98-964d-42c31c234033_figure5.gif

Figure 5. Ensemble learning of vision transformers.

Experimental setup and protocol

The images available in this dataset were resized to H=256,W=256, the latent vector hidden size was set to D=768, the number of layers of the transformer to L=12, the MLPsize to 3072, the MSAheads to 12, and the default value of the patch size to P=16. Thus, the sequence’s number N was 256.

In the experiments conducted, 20% from each class in the training set were selected for validation. All transformers were fine-tuned using the weights of the transformer pre-trained on ImageNet-21K.33

For optimization, the ADAM algorithm34 was utilized with a batch size of 8. Furthermore, the mean squared error loss function was used. The training process for each transformer consists of two stages:

  • 1) All layers in the transformer backbone were frozen and the regression head that was initialized randomly was unfrozen. Then, the regression head was trained for five epochs.

  • 2) The entire model (transformer backbone + regression head) which was trained for 40 epochs was unfrozen.

Data augmentation, early stopping, dropout, and learning rate schedules were used to prevent overfitting and loss divergence. Figure 6 shows the attention map of a few samples extracted from the transformer.

c5c3e606-04b1-4f98-964d-42c31c234033_figure6.gif

Figure 6. The Attention Map of samples A) No DR, B) Mild, C) Moderate, D) Severe, E) Proliferative DR.

The classification heads of all transformers were removed and replaced by a regression head with one node instead of logits. The regression output of a transformer was interpreted as shown in Table 2 to be converted into a category.

Table 2. Pseudocode for the transformer regression output interpretation.

Algorithm: Regression output interpretation
function classify (xsample) {
inputs: xsample output float number from the transformer
outputs: ysample which represents the class of the presented sample
If xsample < 0.8 then
  return ysample = No DR
Else If xsample < 1.5 then
  return ysample = Mild
Else If xsample < 2.5 then
  return ysample = Moderate
Else If xsample < 3.5 then
  return ysample = Severe
return ysample = Proliferative DR
}

An ensemble of ten transformers with similar architectures and hyperparameters was used. The samples were divided randomly into ten sets and each transformer was trained on each one. After interpreting the regression output from each transformer, the predicted classes from ten transformers were aggregated with “majority voting” to predict the final class.

Training, validation, and testing were carried out using the TensorFlow framework on an NVIDIA Tesla T4 GPU.

Results and discussion

Performance metrics

In this section, the results of the proposed ensemble of transformers are discussed. The performance metrics, such as precision, recall, and F1 score were calculated. Additionally, the quadratic weighted kappa (QWK) metric was utilized in this dataset because these data needed specialists to label the images manually since the small differences among the classes can only be recognized by specialist physicians. QWK which lies in the range 1+1 measures the agreement between two ratings and is calculated between the scores assigned by human raters (doctors) and predicted scores (models) as shown in Table 3. The dataset has five ratings: 0, 1, 2, 3, 4.

Table 3. OWK interpretation.

KappaAgreement
< 0No
0.01 – 0.20Slight
0.21 – 0.40Fair
0.41 – 0.60Moderate
0.61 – 0.80Substantial
0.81 – 0.99Almost perfect

QWK was calculated as follows:

  • 1) The confusion matrix O between predicted and actual ratings was calculated.

  • 2) A histogram vector was computed for each rating in the predictions and in the actual.

  • 3) The E N×N matrix which represents the outer product between the two histogram vectors was calculated.

  • 4) The W (N×N) weight matrix was constructed representing the difference between the ratings as shown in Table 4.35

    (1)
    Wij=ij2N12

    Where 1 ≤ i ≤ 5, 1 ≤ j ≤ 5

  • 5) QWK was defined as follows35:

    (2)
    QWK=1iNjNWij×OijiNjNWij×Eij

    where N is the number of classes.

Table 4. The Weight Matrix W represents the difference between the classes.

No_DRMildModerateSevereProliferative_DR
No_DR00.06250.250.56251
Mild0.062500.06250.250.5625
Moderate0.250.062500.06250.25
Severe0.56250.250.062500.0625
Proliferative_DR10.56250.250.06250

Experimental results

Table 5 shows the performance metrics of ten transformers with each one trained on a subset of data. It is obvious that there is a big difference among the performances of these individual transformers. Transformer_1 was able to yield a Kappa of 55.1%. On the other hand, transformer_10 yielded a Kappa of 30.9%. Ensembles of various numbers of transformers including all ten transformers, four transformers (1,3,8,9), and other configurations were also evaluated. The best model was an ensemble of two transformers (1,3) which yielded a Kappa of 60.2%.

Table 5. Performance metrics for various ensemble models.

ModelPrecision %Recall %F1 Score %QWK %
Transformer_143474055.1
Transformer_237463140.2
Transformer_339463752.0
Transformer_445413043.3
Transformer_541453548.6
Transformer_639453344.9
Transformer_744423343.3
Transformer_840473951.3
Transformer_940463851.0
Transformer_1035362630.9
Ensemble of ten transformers45473853.0
Ensemble of four transformers44474157.5
Ensemble of two transformers47454260.2

This Kappa is at the boundary between moderate and substantial agreement. The previous results confirm that the performance of the ensemble of transformers (1,3) trained with fewer training images outperformed the ensemble of ten transformers trained with five times the number of images. Table 6 compares the performance of the ensemble of transformers with the ensemble of ResNet50 CNNs. The ResNet50 CNN was transferred from ImageNet 1K. The top layers were replaced by a support vector machine that was tuned with this dataset. The proposed ensemble of transformers outperformed the ensemble of ResNet50 CNNs significantly by >18% Kappa.

Table 6. Comparison between the ensemble of transformers and the ensemble of ResNet50 CNNs.

ModelPrecision %Recall %F1 Score %QWK%
Ensemble of two transformers47454260.2
Ensemble of ten ResNet5032443236.97
Ensemble of two ResNet5035403541.52

The confusion matrix of each configuration including ensembles of transformers with ten, four, and two transformers, and the ensemble of two ResNet50

CNNs were shown in Figure 7. The confusion matrix (c) which represents the best Kappa of 60.2% shows that the model was able to recognize the categories of severe and proliferative DR from one side, and NO DR and mild DR from the second side.

c5c3e606-04b1-4f98-964d-42c31c234033_figure7.gif

Figure 7. The Confusion Matrix for the ensemble bagging of A) ten transformers, B) four transformers, C) two transformers, D) two ResNet50 CNN.

Conclusion

This study is a new attempt to demonstrate the capability of the ensemble bagging of vision transformers applied on retinal image classification for the grading of DR into five levels of severity. The experiments conducted showed that even when the dataset was challenging, the proposed method was able to yield promising performance measures in terms of precision (47%), recall (45%), F1 score (42%), and QWK (60.2%). Furthermore, the inference time was low at 1.12 seconds. Hence, we intend to enhance the performance by utilizing a collection of various DR datasets. This can increase the size and variety of training data to train the proposed model from scratch instead of starting from the weights of the ImageNet 21K-based model. By doing so, we can ultimately enhance performance.

Author contributions

Conceptualization by N.A., M.A.M.; Data Curation by N.A.; Formal Analysis by N.A., H.A.K., M.A.M.; Funding Acquisition by H.A.K.; Investigation by N.A., J.L.F; Methodology by N.A., H.A.K., M.A.M.; Project Administration by H.A.K.; Software by N.A., M.A.M.; Validation by N.A., M.J.T.T.; Visualization by N.A.; Writing – Original Draft Preparation by N.A., M.A.M, J.L.F.; Writing – Review & Editing by N.A., H.A.K., M.J.T.T., M.A.M, J.L.F.

Ethics and consent

All procedures performed in studies involving human participants were in accordance with the ethical standards of the institutional and/or national research committee and with the 1964 Helsinki declaration and its later amendments or comparable ethical standards.

The Retinal images are public third part dataset provided by EyePACS, a free platform for retinopathy screening.

Competing interests

None of the authors declare any competing interests.

Grant information

This research project was funded by Multimedia University, Malaysia.

Data availability

The dataset used in this work is accessible to the public on the Kaggle website. It was created in 2015 for the Kaggle Diabetic Retinopathy Detection competition. This competition is sponsored by the California Healthcare Foundation. Retinal images were provided by EyePACS, a free platform for retinopathy screening.

Comments on this article Comments (0)

Version 1
VERSION 1 PUBLISHED 21 Sep 2021
Comment
Author details Author details
Competing interests
Grant information
Copyright
Download
 
Export To
metrics
Views Downloads
F1000Research - -
PubMed Central
Data from PMC are received and updated monthly.
- -
Citations
CITE
how to cite this article
AlDahoul N, Abdul Karim H, Joshua Toledo Tan M et al. Encoding Retina Image to Words using Ensemble of Vision Transformers for Diabetic Retinopathy Grading [version 1; peer review: 1 not approved]. F1000Research 2021, 10:948 (https://doi.org/10.12688/f1000research.73082.1)
NOTE: If applicable, it is important to ensure the information in square brackets after the title is included in all citations of this article.
track
receive updates on this article
Track an article to receive email alerts on any updates to this article.

Open Peer Review

Current Reviewer Status: ?
Key to Reviewer Statuses VIEW
ApprovedThe paper is scientifically sound in its current form and only minor, if any, improvements are suggested
Approved with reservations A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.
Not approvedFundamental flaws in the paper seriously undermine the findings and conclusions
Version 1
VERSION 1
PUBLISHED 21 Sep 2021
Views
22
Cite
Reviewer Report 25 Nov 2021
Shruti Jain, Department of Electronics and Communication Engineering, Jaypee University of Information Technology, Solan, Himachal Pradesh, India 
Not Approved
VIEWS 22
  1. Add a section highlighting the main contributions of your methodology, with detailed reference to, and comparison with, existing work.
     
  2. I find it difficult to understand, from the abstract, the proposed methodology by which
... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Jain S. Reviewer Report For: Encoding Retina Image to Words using Ensemble of Vision Transformers for Diabetic Retinopathy Grading [version 1; peer review: 1 not approved]. F1000Research 2021, 10:948 (https://doi.org/10.5256/f1000research.76706.r94946)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.

Comments on this article Comments (0)

Version 1
VERSION 1 PUBLISHED 21 Sep 2021
Comment
Alongside their report, reviewers assign a status to the article:
Approved - the paper is scientifically sound in its current form and only minor, if any, improvements are suggested
Approved with reservations - A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.
Not approved - fundamental flaws in the paper seriously undermine the findings and conclusions
Sign In
If you've forgotten your password, please enter your email address below and we'll send you instructions on how to reset your password.

The email address should be the one you originally registered with F1000.

Email address not valid, please try again

You registered with F1000 via Google, so we cannot reset your password.

To sign in, please click here.

If you still need help with your Google account password, please click here.

You registered with F1000 via Facebook, so we cannot reset your password.

To sign in, please click here.

If you still need help with your Facebook account password, please click here.

Code not correct, please try again
Email us for further assistance.
Server error, please try again.