Pancreatic cancer grading in pathological images using deep learning convolutional neural networks

Background: Pancreatic cancer is one of the deadliest forms of cancer. The cancer grades define how aggressively the cancer will spread and give indication for doctors to make proper prognosis and treatment. The current method of pancreatic cancer grading, by means of manual examination of the cancerous tissue following a biopsy, is time consuming and often results in misdiagnosis and thus incorrect treatment. This paper presents an automated grading system for pancreatic cancer from pathology images developed by comparing deep learning models on two different pathological stains. Methods: A transfer-learning technique was adopted by testing the method on 14 different ImageNet pre-trained models. The models were fine-tuned to be trained with our dataset. Results: From the experiment, DenseNet models appeared to be the best at classifying the validation set with up to 95.61% accuracy in grading pancreatic cancer despite the small sample set. Conclusions: To the best of our knowledge, this is the first work in grading pancreatic cancer based on pathology images. Previous works have either focused only on detection (benign or malignant), or on radiology images (computerized tomography [CT], magnetic resonance imaging [MRI] etc.). The proposed system can be very useful to pathologists in facilitating an automated or semi-automated cancer grading system, which can address the problems found in manual grading.


Introduction
Pancreatic cancer is one of the most lethal malignant neoplasms in the world, 1 developed when cells multiply and grow out of control in the pancreas, 2 forming cancer cells caused by cells mutation in their genes. 3Doctors commonly perform a biopsy to diagnose cancer when physical examination or imaging tests like magnetic resonance imaging (MRI) and computerized tomography (CT) scans are insufficient.In pancreatic cancer, grading is essential for planning treatment but is currently done using a meticulous microscopic examination. 4Limited work found on analysis of pathological images for pancreatic cancer.Niazi et al. 5 presented a deep learning method to differentiate between pancreatic neuroendocrine tumor and non-tumor regions based on Ki67 stained biopsies.The purpose is for the quantification of positive tumor cells in a hotspot.Up to now, there has been no successful implementation of artificial intelligence (AI) for classifying pancreatic cancer grade.The absence of such AI work motivates this paper to use transfer-learning to grade pathological pancreatic cancer images using 14 deep learning (DL) models.This work can facilitate an automated cancer grading system to address the exhaustive work of manual grading.

Contributions
This work presents an automated grading system focusing on pancreatic cancer from pathology images, which has not been done before to the best of our knowledge.The work also contributes a comparison of performance for 14 DL models on two different pathological stains, namely the May-Grünwald-Giemsa and haematoxylin and eosin.

Pancreatic cancer and digital pathology
Pancreatic cancer is considered to be under-studied, and improvements in the diagnosis and prognosis of pancreatic cancer have therefore been minor. 6Digital pathology is an image-based environment obtained by scanning tissue samples from glass slides.Staining, usually using May-Grünwald-Giemsa (MGG) and haematoxylin and eosin (H&E) stains, is carried out on the tissue samples before digitization into whole-slide images.The cancer grade is identified by the degree of differentiation of the tumour cells 7 ranging from well to poorly differentiated as described in Table 1.

Deep learning and related works
Convolutional neural network (CNN) is a widely used deep learning (DL) algorithm in medical image-based classification and prediction. 8Several methods use CNN in cancer detection and diagnosis 9 such as the Gleason grading of prostate cancer, [10][11][12] colon cancer grading, 13 breast cancer detection, 14,15 and pancreatic cancer detection [16][17][18][19] and classification. 20AI has been proven to assist clinicians with better prediction and faster diagnosis for breast cancer screening. 21However, grading of pancreatic cancer with DL still needs comprehensive study.

Methodology
The methodology of this work was done at Multimedia University, Cyberjaya, from June 2020 to May 2021.The overall methodology of this research is as illustrated in Figure 1, with two major stages.In the data preparation stage, pathology images of pancreas tissue samples were obtained from our collaborator and the images were pre-classified by a pathologist into four classes.In the DL model development stage, the images were trained on the DL model and evaluated accordingly.All stages were carried out using Jupyter notebook in Google Colab.The source code is available from GitHub and archived with Zenodo. 26

Ethical approval
This work was approved by the Research Ethics Committee of Multimedia University with approval number EA2102021.This article does not contain any studies with human participants or animals performed by any of the authors.Only pathology images were used, and the patients' personal data were anonymized.

REVISED Amendments from Version 1
This version addresses the comments from the reviewers as below: 1) Revised "Introduction" and "Deep learning and related works" sections.2) Added "Contributions" after the "Introduction".
3) Revised "Effect of data augmentation" section and "Comparison between the best and the worst performing model" sections.
Any further responses from the reviewers can be found at the end of the article

Dataset preparation Pathology image procurement
A total of 138 high-resolution images with varying dimensions (1600 Â 1200, 1807 Â 835 and 1807 Â 896) were obtained and pre-classified by the collaborators (see Acknowledgements).Four classes were identified (as shown in Table 2): Normal, Grade-I, Grade-II and Grade-III.Each image consisted of a tissue-sample stained with MGG and H&E.The image distribution in each class was unequal with Grade-II having 58 images and Normal with only 20 images.To better capture the cells characteristics which is paramount in determining their grade and to match the lower-resolution setting of the network's input, the images were pre-processed into small non-overlapping patches.

Image pre-processing
The pre-trained models require a low dimension and square image for training and making predictions.The squared slicing method is used where smaller patches with approximately 200 Â 200 pixels of non-overlapping regions are sampled from the original images.Further processing was done to remove unwanted patches, as shown in Figure 2.

Image dataset
A total of 6468 patches were generated from the slicing process of the 138 original images which is an increase of 468% in number of images.Overall, 50.5% (3267) of the patches with background and non-tissue information were discarded, and the remaining are listed in Table 3. Examples of MGG stain and H&E stain pathology images are shown in Table 1, with  the mixed dataset combining all images from MGG and H&E stains.From the numbers in Table 3, these datasets still had an imbalanced number of patch images in each class but this can be mitigated by employing a weighted average to evaluate the model.

Training-validation splitting and K-fold cross-validation
To evaluate the DL model, images in each dataset were split into training and validation set with 80-20 ratio.K-fold cross-validation with K = 5 was used by splitting all MGG, H&E and the mixed dataset into five parts, producing cross-validation sets with five new copies of MGG, H&E and mixed datasets and labelled (e.g.MGG Set 1 to MGG Set 5 for MGG).Each set had a different set of images used for training (80%) and validation (20%).The average value was calculated from the five-training iterations to evaluate the performance.

Image data augmentation and normalisation
Image data augmentation was implemented to virtually expand the training set, but not on the validation set.The transformation parameter involved was horizontal flip, vertical flip and -90°to 90°rotation range.Image data normalisation was used to rescale image pixels from a range of [0,255] to [0,1] so the input pixels will have similar data distribution.

CNN deep learning model development
Deep CNN algorithm was used for developing a model for classifying pancreatic cancer grading from pathology images.

Transfer-learning
A total of 14 CNN pre-trained models from recognizing 1000 classes in ImageNet was selected from Keras API 22 to get the best model for classifying the 4-grade classes of pancreatic cancer.The proposed pre-trained models are listed in Table 4 along with their original model's image input shape, and its respective top-1 accuracy on the ImageNet validation set.

Fine-tuning
All 14 models were fine-tuned with four newly added layers to extract the features from pathology images: a flatten-layer to form a 1D fully connected layer; a dense-layer with 256 nodes and ReLu activation-function; a dropout-layer with a rate of 0.4 to regularise the network; and lastly another dense-layer with 4-nodes and SoftMax activation-function to normalize the probability of prediction.

Setup and evaluation parameters
Batch size of 64 was chosen to allow the computer to train and validate 64-patch-samples at the same time.Adam optimizer with default initial learning rate of α = 0.01 and moment decay rate of β1 = 0.9 and β2 = 0.999 was used.The loss function is calculated using categorical cross-entropy for the 4-class classification task.With this setup, the models are compiled and trained for 100-epochs.
The confusion matrix, precision, recall, f1-score and weighted-average were used to evaluate the model's performance.
Weighted-average was used to calculate the performance of individual cross-validation set and suitable for imbalanced dataset.The equation for the weighted-average is:

Effect of data augmentation
This experiment was done with the first cross-validation set of the mixed dataset, to observe how data augmentation impacts a model training performance. 25Table 5 and Table 6 display the final accuracy and loss of training and validation set after 100 epochs.Without data augmentation in Table 5, it is evident that overfitting has occurred, because the model is doing very well on the training set but not on the validation set.With data augmentation, the validation accuracy improved, specifically on VGG19 model (54.83% to 77.22%).The training accuracy of other models are slightly reduced with data augmentation (except for VGG19) but it is normal as the model is learning newly transformed images.The validation loss also shows a reduction, as in Table 6, such as on NASNetLarge model from 3.36376 to 0.68587.Overall, these results show that data augmentation may reduce overfitting and improve model performance as reported in. 11,14,15he reason behind this is the model is becoming more robust with data augmentation for getting to learn various transformed images of the limited size dataset.These kind of images are high-likely to exist in real-world applications, especially when it comes to unique human cells.

Comparison analysis of model performance
The overall performance results of all 14 different transfer-learning models proposed for this experiment are presented.Each model was trained with the 3 datasets and 5-fold cross-validation.Figure 3 illustrates the overall performance in terms of mean f1-score.

Comparison between MGG, H&E and the mixed dataset
This comparison shows how a DL model learns from single coloured stain.In Figure 3, all models trained with the H&E obtained the highest f1-score compared to MGG and mixed.Most models scored above 0.9 except for VGG19 (0.87).
When trained with the MGG, models other than VGG16 and VGG19 performed the lowest compared to H&E and Mixed.
The performance of mixed is as expected because it contains a mixture of both datasets.The VGG16 and VGG19 model, however, performed better on MGG than Mixed, due to the small VGG network architecture and small fully-connected layers making it unable to learn complex features and patterns in pathology image.The trend described in Figure 3 indicates that image patches in the H&E are easier to learn with better prediction than MGG.

Comparison between pre-trained models
From the result, DenseNet network architecture was the best at classifying pathology images where all three variations trained on MGG, H&E and mixed take the top spot among the 14 models.The ResNet on mixed dataset were ranked ascendingly from ResNet101V2, ResNet50V2 and ResNet152V2 before the three DenseNet models.This supports the   23 where DenseNet was designed to improve the ResNet architecture.DenseNet201 which is a much deeper layer than the other two DenseNet models managed to achieve the highest f1-score of 0.88, 0.96 and 0.89 for MGG, H&E and mixed, respectively.The DenseNet121 and DenseNet169 performance on the three dataset scores were marginally lower at 0.87, 0.95, 0.89 and 0.87, 0.95, 0.88, respectively.This shows that a deeper DenseNet layer can perform more accurate prediction.
The Xception 23 and InceptionResNetV2 24 are improvements of InceptionV3, which performs better than their ancestor.The f1-scores for Xception trained by MGG, H&E and Mixed are 0.85, 0.94 and 0.86, as compared to InceptionV3, 0.80, 0.92 and 0.83, respectively.However, InceptionResNetV2 are just slightly higher than InceptionV3 (0.93 and 0.83 for H&E and mixed) but lower for MGG (0.80).VGG models did not perform quite well when compared to its earlier models.VGG19, which is supposed to be an improvement to VGG16, failed to achieve a greater f1-score, with 0.74, 0.87 and 0.65 for the MGG, H&E and mixed, respectively, while VGG16 was higher at 0.80, 0.93 and 0.78.The results concluded that VGG19 was the worst performing model for our datasets.
This experiment applied transfer-learning on 14 ImageNet pre-trained models to classify pancreatic cancer grades.From the comparisons, DenseNet201 model is suggested for practical application of pancreatic grading system of MGG or H&E stains.
Comparison between the best and the worst performing model Table 7 and Table 8 shows the precision and recall of VGG19 (the worst) and DenseNet201 (the best) for the three datasets.VGG19 struggles to make prediction for Grade-I patches in MGG where the precision and recall are 0.00 for CV sets 3, 4, and 5.A similar pattern is noticed in Grade-III patches, and from our observation, this is because most of the Grade-I and Grade-III patches were wrongly predicted as Grade-II.This is due to the imbalance classes in MGG where Grade-II patches account for 52.9% of the total images whereas Grade-I consist of only 5% and Grade-III 19.7%.This class imbalance has caused the VGG19 model to struggle a lot at recalling class with fewer data.For H&E images, the effect of class imbalance however did not affect the performance of VGG19.The recall and precision for Normal class are ranked among the highest despite its smallest number (10%) of patches.Looking back at Table 1, the H&E Normal images have a quite different stain colour compared to other classes, which explains the good prediction for both models.This could be seen as a problem where limited image variation can cause biasness.The precision of Normal class would score poorly if it were tested to predict different variation of H&E stain image even with the same set of ground truth, but can be assuaged if the class have many different variations of stain colour.
For the mixed dataset, VGG19 also struggled to predict Grade-III class, especially on CV 4 and 5 where it scored 0.00 for both metrics.The reason could be that the Grade-III patches are difficult for the VGG19 model to learn.This is the reason why cross-validation should be performed to rigorously evaluate a DL model.DenseNet201 managed to get good recall for Grade-III patches for both CV sets, confirming its ability to learn complex features on the pathology image.
From the study, we can see that integrating AI into the diagnosis system can assist the pathologist in getting the suggestive grading based on the prediction.It is however meant to assist and not on decision-making.The future aim of this study is to have a platform for screening of pancreatic cancer biopsies.

Conclusion
This paper presents development of several deep learning models through transfer-learning for classifying pancreatic cancer grade from pathology images.The datasets were trained on a total of 14 ImageNet pre-trained models.Image data augmentation was performed to counter the low number of images and has proven to improve the validation accuracies of all pre-trained models up to 40%.The evaluation on 14 pre-trained models shows that the DenseNet models performed best compared to the other models.Most of the models trained by H&E managed to achieve f1-score above 0.9.The MGG dataset scores lower f1-score compared to the mixed dataset.The highest f1-scores were achieved by DenseNet201, with 0.8786, 0.9561 and 0.8915 for MGG, H&E and Mixed, respectively.To the best of our knowledge, no similar work on pancreatic cancer grading has been reported in the literature.With these promising early results, this work can aid pathologists in facilitating an automated pancreatic cancer grading system for better cancer diagnosis and prognosis.This study has not been tested with whole slide images (WSI), but similar approaches can be applied.Further improvements to the system can potentially be achieved by using future state-of-the-art DL models.

Data availability
Underlying data Open Science Framework: Dataset for Pancreatic Cancer Grading in Pathological Images using Deep Learning Convolutional Neural Networks.https://doi.org/10.17605/OSF.IO/WC4U9. 25is project contains the following underlying data: -Dataset PCGIPI-Original.zip(pancreatic pathological image patches used for our analysis.The stain types are May-Grünwald-Giemsa (MGG) and Haematoxylin and Eosin (H&E)).
-Dataset PCGIPI-sliced.zip -PCGIPI Results.xlsx-Slicing Process for Table 3.docx Data are available under the terms of the Creative Zero "No rights reserved" data waiver (CC0 1.0 Public domain dedication).

Summary:
The paper reports on novel work undertaken in automated grading of pancreatic cancer using pathology images and deep learning.The classification is based on the grade of the cancer, instead of just detecting the presence or type of cancer, and uses pathology images as opposed to the conventional radiology images prevalent in most prior research efforts.The main advantage of the proposed system is that it overcomes the very tedious and error prone conventional pancreatic cancer grading method that requires manual microscopic examination of the cancerous tissues after biopsy.The reduced misdiagnosis promotes providing correct treatment planning for patients.Although using only a limited size training set, the proposed system was tested using 14 different ImageNet pre-trained models and achieved high accuracies of up to 95.61% for two different pathological stains.

Comments:
The paper is well written and the flow is both logical and guides the reader through the work in an easy to understand manner.The technical content is good with appropriate use of vocabulary and terminology.The subject matter is of current interest and the publication is a worthwhile contribution to the body of knowledge.

Recommendations for Improvement:
Some points of consideration for improvement of the paper are listed below: Abstract -Results: It would be advisable to change "small sample set" to "small training set", to better represent the effectiveness even when training was limited.This is especially so as deep learning is employed for the grading system.

1.
Introduction: Consider changing "paper" in "The absence of such AI work motivates this paper to" to "project" or "research".

2.
References: Although the references are fairly recent, the paper would benefit by additionally citing the most recent relevant publications from the current year.It should also be checked if "et al." can be used in the References or the names of all authors should be listed in the References section.

3.
Methodology: In terms of language, better to change "The methodology of this work was done at" to "The research work was undertaken at".

4.
Figure 1: Do use consistent style of verbs in the flowchart, e.g.change "Fine tuning" to "Fine tune" and "Training" to "Train".

5.
Pathology image procurement: Consider changing "which is" to "that are".6.
Image dataset: Consider adding a comma before "which" in "images which is". 7.

Image data augmentation and normalization:
It is stated that the images were rescaled from [0,255] to [0,1].However, was this as a grayscale intensity image or for each of the RGB (or other colour space) components?Do clarify as there are no further images shown, making it difficult to infer.
Effect of data augmentation: Citation 23 appears to be out of place.Consider changing "is doing very well on the training set" to "performed well on the training set".The citations at he end of the last sentence should be moved to before the full-stop.

10.
Table 5: Consider simplifying the caption to "Model accuracy after 100 epochs" or removing "for".Similarly for the Table 6 caption.

11.
Figure 3: The names of the dataset do not tally with the description given earlier in the text.For consistency, "Blue" and "Purple" should be changed to "MGG" and "H&E", respectively.The y-axis of the graph should be labelled ("Mean f1-score").Also ensure consistency of spelling for "f1-score" in the caption and text.Similarly, ensure consistency of the spelling "Mixed" dataset (cf."mixed"). 12.
Comparison between pre-trained models: Consider deleting ", which performs better than their ancestor."

13.
Comparison between the best and the worst performing model: Consider changing "imbalance classes" to "imbalanced classes".Also, "recalling class with fewer data" to "recalling classes with less data".

Data availability:
The authors have made their data and results available for other researchers.This is useful as it allows for verification of the findings, should the need be, and can propagates further research in this area by helping overcome the scarcity of labelled data.The code used is also provided to assist other researchers.

Conclusion:
The paper presents useful findings that would be beneficial in quick and accurate pancreatic cancer grading for correct timely treatment planning.The finding may also open avenues of research and implementation in for other types of cancers and tissue analysis.This work is a useful contribution to the body of knowledge in this domain.

If applicable, is the statistical analysis and its interpretation appropriate? Yes
Are all the source data underlying the results available to ensure full reproducibility?Yes Lisbon, Lisbon, Portugal In this manuscript, the authors are presenting the development of several DL (deep learning) models through transfer-learning for classifying pancreatic cancer grade from pathology images.Specifically, the authors performed data augmentation across a dataset of images to counter the low number of images and improve the validation accuracies of all pre-trained models.Overall, this is an exciting line of work, but the current draft of this manuscript has some minor concerns that must be addressed first before publication.Contributions, if so, must be described clearly.Additionally, the findings are also important attributes of such significant research work.As follows, further comments are detailing improvements for the next iterations of the manuscript.

Requested Changes
There is much to appreciate in this manuscript, as it investigates several DL models for classifying pancreatic cancer.However, the provided findings and work implications are not enough, and the discussion section would need to be considerably strengthened by a more robust engagement with the cited literature.I would encourage the authors to think about how their findings extend or add to both DL and medical literatures in this space in order to refine their contribution to both fields.For instance, it would be interesting to understand the application of these DL models to a real scenario [1, 4].At least, a small discussion on how could these techniques help real problems, such as different medical domains [2, 3, 5].
The paper needs a brief proofreading.Be aware of tenses also.It would be wise to proofread the paper and correct these, as well as other spelling errors.training accuracies of other models are slightly reduced with data augmentation (except for VGG19) but it is normal as the model is learning newly transformed images.The validation loss also shows a reduction, as in Table 6, such as on NASNetLarge model from 3.36376 to 0.68587.Overall, these results show that data augmentation may reduce overfitting and improve model performance as reported in.10,13,14The reason behind this is the model is becoming more robust with data augmentation for getting to learn various transformed images of the limited size dataset.These kind of images are high-likely to exist in real-world applications, especially when it comes to unique human cells.

Added last paragraph before Conclusion section:
From the study, we can see that integrating AI into the diagnosis system can assist the pathologist in getting the suggestive grading based on the prediction.It is however meant to assist and not on decision-making.The future aim of this study is to have a platform for screening of pancreatic cancer biopsies.
We look forward to getting this article indexed.Thank you very much.
Competing Interests: No competing interests were disclosed.
The benefits of publishing with F1000Research: Your article is published within days, with no editorial bias • You can publish traditional articles, null/negative results, case reports, data notes and more • The peer review process is transparent and collaborative • Your article is indexed in PubMed after passing peer review • Dedicated customer support at every stage • For pre-submission enquiries, contact research@f1000.com

Figure 1 .
Figure 1.Flowchart of the research work.

Figure 2 .
Figure 2. Process of slicing an image and discarding unwanted non-tissue patches.

Strengths 1 . 1 . 1 . 2 .Weaknesses 2 . 1 . 2 . 2 .
The manuscript addresses key concerns about transfer-learning for classifying pancreatic cancer; The manuscript brings forward novel perspectives for DL developments in a specific medical domain; Contributions and impact for this scientific community are not clear; Findings and implications of the work are merely discussed;

Text 2 .
Calisto FM, Santiago C, Nunes N, Nascimento JC: BreastScreening-AI: Evaluating medical intelligent agents for human-AI interactions.Artif Intell Med.127: 102285 PubMed Abstract | Publisher Full Text 3. Budak C, Mençik V: Detection of ring cell cancer in histopathological images with region of interest determined by SLIC superpixels method.Neural Computing and Applications.2022.Publisher Full Text 4. Calisto F, Santiago C, Nunes N, Nascimento J: Introduction of human-centric AI assistant to aid radiologists for multimodal breast image classification.International Journal of Human-Computer Studies.2021; 150.Publisher Full Text 5. Shinde S, Kulkarni U, Mane D, Sapkal A: Deep Learning-Based Medical Image Analysis Using Transfer Learning.932: 19-42 Publisher Full Text 6. Calisto FM, Nunes N, Nascimento JC: BreastScreening: On the Use of Multi-Modality in Medical Imaging Diagnosis.AVI '20: Proceedings of the International Conference on Advanced Visual Interfaces.2020.Publisher Full Text | Reference Source 7. Niazi MKK, Tavolara TE, Arole V, Hartman DJ, et al.: Identifying tumor in pancreatic neuroendocrine neoplasms from Ki67 images using transfer learning.PLoS One.2018; 13 (4): e0195621 PubMed Abstract | Publisher Full Text 8. Calisto FM, Ferreria A, Nascimento JC, Gonçalves D: Towards Touch-Based Medical Image Diagnosis Annotation.ISS '17: Proceedings of the 2017 ACM International Conference on Interactive Surfaces and Spaces.2020.Publisher Full Text | Reference SourceIs the work clearly and accurately presented and does it cite the current literature?PartlyIs the study design appropriate and is the work technically sound?YesAre sufficient details of methods and analysis provided to allow replication by others?YesIf applicable, is the statistical analysis and its interpretation appropriate?YesAre all the source data underlying the results available to ensure full reproducibility?YesAre the conclusions drawn adequately supported by the results?YesCompeting Interests: No competing interests were disclosed.Reviewer Expertise: Human-Computer Interaction, Health Informatics, Artificial Intelligence I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.

Table 5 .
Model accuracy for without and with data augmentation after 100 epochs.

Table 6 .
Model loss for without and with data augmentation after 100 epochs.
work of Huang et al. in

Table 7 .
Precision rate of VGG19 and DenseNet201.

Table 8 .
Recall Rate of VGG19 and DenseNet201.