A deep learning segmentation strategy that minimizes the amount of manually annotated images

Deep learning has revolutionized the automatic processing of images. While deep convolutional neural networks have demonstrated astonishing segmentation results for many biological objects acquired with microscopy, this technology's good performance relies on large training datasets. In this paper, we present a strategy to minimize the amount of time spent in manually annotating images for segmentation. It involves using an efficient and open source annotation tool, the artificial increase of the training dataset with data augmentation, the creation of an artificial dataset with a conditional generative adversarial network and the combination of semantic and instance segmentations. We evaluate the impact of each of these approaches for the segmentation of nuclei in 2D widefield images of human precancerous polyp biopsies in order to define an optimal strategy.


Introduction
Over the last decade, deep learning approaches have outperformed all existing methods for image segmentation [1][2][3][4] . Semantic segmentation, the estimation of a label at each pixel, and instance segmentation, the identification of individual objects, were successfully applied to spatially characterize biological entities in microscopic images 5-8 . However, these powerful approaches rely on large annotated datasets. While more and more datasets become publicly available 9,10 , annotated data for every combination of modalities, tissues and biological objects is far from completion. If the number of images to be segmented is low, a fully manual workflow might be the most time efficient option. Otherwise, procedures to efficiently build training datasets are required to use the full potential of deep learning-based segmentation at a single biological lab scale.
In this paper, we propose a strategy to minimize the amount of time dedicated to manually annotate images and investigate several approaches to maximize accuracy when only using one annotated image. We apply this strategy to segment nuclei stained with DAPI in widefield images of human colorectal adenomas (i.e. precancerous polyps) as follows. First, we take advantage of existing training datasets 11,12 and massive data augmentation to obtain a preliminary segmentation. We then use an open source annotation software 12 to manually correct this segmentation and consequently define the training dataset. Next, we simulate synthetic images using a conditional generative adversarial network (GAN) 13 to increase the size of the training dataset. Finally, we combine U-Net 14,15 , a semantic segmentation approach, and Mask R-CNN 16 , an instance segmentation approach, to improve the nuclear segmentation accuracy.

Sample preparation
In this study, we used the Medical University of South Carolina (MUSC) pathology laboratory information system CoPath (Cerner Corporation, Kansas City, MO), to identify a convenience sample of colorectal adenomas excised from patients who underwent a sigmoidoscopy or colonoscopy with polypectomy between October 2012 and May 2016. For each patient, we obtained a formalin-fixed, paraffin-embedded (FFPE) tissue block and prepared one H&E and 5, 5-micron sections for immunofluorescence (IF) on FFPE tissue. Prior to the start of the IF procedures, all antibodies were optimized and reviewed by the study immunologist, the pathologist, the epidemiologist, and laboratory personnel to ensure agreement and proper staining. The MUSC Institutional Review Board has approved the research study (IRB # PRO-00007139).
Image acquisition DAPI was used for nuclear counterstaining. Stained slides were mounted with ProLong™ Gold Antifade Reagent (Cat. # P36934, ThermoFisher) and imaged using the Akoya Vectra® Polaris™ Automated Imaging system (Akoya Biosciences, Marlborough, MA). Whole slide scans were done at 20X magnification and regions of interest where chosen randomly.

Training dataset
The training dataset consisted of three 1868 × 1400 images manually annotated with Annotater 12 . Only one image was used to train U-Net and Mask R-CNN as well as pix2pix (conditional GAN) for most of the study, in addition to publicly available datasets (image set BBBC039v1 available from the Broad Bioimage Benchmark Collection 9 and a mouse intestinal epithelium dataset 12 ). The two other images were added to the training dataset in the last section to be compared with the combination of results obtained with U-Net and Mask R-CNN (see Figure 3a).

U-Net training
The annotated 1868 × 1400 image was divided into six 622 × 700 images for training: five of these images were included in the training dataset while the last one defined the validation dataset. As U-Net is a semantic segmentation approach, three classes were defined to allow separating nuclei as proposed in 22: inner nuclei, nuclei contours and background. To facilitate nuclei separation, the nuclei contours between touching cells were dilated 22 . To limit over-fitting, the imaging field for images in the training dataset was set to 256 × 256 by randomly cropping the 622 × 700 input images. These cropped images were then normalized to obtain intensity values between 0 and 1. A root mean square prop was used to estimate the parameters of the deep convolutional neural network by minimizing a weighted cross entropy loss to handle class imbalance for 100 epochs without data augmentation and 25 epochs with data augmentation. The weights associated with each class were defined from the training dataset as their inverse proportion. A data augmentation to increase the training dataset by a factor of 100 was processed after normalization with the imgaug python library 23 and included flipping, rotation, pixel dropout, blurring, noise addition and contrast modifications. In Figure 2 and Figure 3, augmented simulated images were obtained by applying the same

Amendments from Version 1
In this new version, we have changed the first section by: 1 . Adding a comparison with a Stardist model trained on  the 2018 Data Science Bowl, which is available on Fiji, 2. Better explaining the way we trained U-Net and Mask R-CNN to obtain the results shown in Figure 1.
We also have toned down the benefit of using a conditional GAN to expand the size of the training dataset as it only improves marginally the segmentation accuracy.
Finally, we have completely rewritten the discussion to present observations made in the manuscript rather than a universal guideline. Mainly: 1. The use of publicly available datasets and massive data augmentation are beneficial to build a training dataset and are now common practices in the field, 2. The conditional GAN approach does not improve drastically the segmentation accuracy, 3. Combining instance and semantic segmentations lead to a substantial increase in segmentation accuracy and has the potential to be widely adopted in the field.
Any further responses from the reviewers can be found at the end of the article modifications with the imgaug python library to simulated images with pix2pix. When combining the annotated image for this study with simulated images and/or existing datasets, the number of augmented images was defined to be balanced between the different data.
U-Net post-processing An ImageJ macro 24,25 was used to convert the three classes obtained with U-Net to individual nuclei. More specifically, individual nuclei were identified by thresholding the subtraction of the nuclei contours component from the inner nuclei component with a threshold equal to 0.35. A 3D Voronoi tessellation 26 was then applied to assign each pixel to a nucleus. The object component was defined as all pixels whose background component was inferior to 0.95. This object component was then multiplied by the Voronoi tessellation to obtain individual nuclei. The Voronoi tessellation implies that a 1-pixel width area between nuclei is not assigned to any nucleus. To address this problem, the location of these pixels is obtained by subtracting the binary thresholding of the individual nuclei from the object component. The individual nuclei are then dilated 27 and multiplied to this subtraction to be added to the individual nuclei. Finally, nuclei with less than 35 pixels were removed.

Mask R-CNN
The annotated 1868 × 1400 image was divided into thirty-five 266 × 280 images for training: thirty of these images were included in the training dataset while the last five images defined the validation dataset. Version 2.1 of Mask R-CNN 16 was used in this study. The backbone network was defined as the Resnet-101 deep convolutional neural network 28 . We used the code in 5 to define the only class in this study, i.e. the nuclei. A data augmentation to increase the training dataset by a factor of 100 was processed before normalization with the imgaug python library 23 and included resizing, cropping, flipping, rotation, shearing, pixel dropout, blurring, sharpness and brightness modifications, noise addition and contrast modifications. Transfer learning with fine-tuning from a network trained on the coco dataset 29 was also applied. In the first epoch, only the region proposal network, the classifier and mask heads were trained. The whole network was then trained for the next three epochs. In Figure 2 and Figure 3, augmented simulated images were obtained by applying the same modifications with the imgaug python library to simulated images with pix2pix. When combining the annotated image for this study with simulated images and/or existing datasets, the number of augmented images was defined to be balanced between the different data. The maximum image size used for processing Mask R-CNN was larger than 256 as resizing and cropping were applied for data augmentation and set to 512. This parameter was defined as 1024 when other existing datasets were included for training as magnification in these images is higher.

Evaluation
One 1868 × 1400 and one 934 × 1400 manually annotated images were used for evaluation. As proposed in 11, we used the F1 score with respect to the Intersection over Union (IoU) to evaluate the different nuclei segmentation approaches.
and O E (e 2 ) do not share any pixel while an IoU(O GT (e 1 ), O E (e 2 )) equal to 1 means that O GT (e 1 ) and O E (e 2 ) are identical. To ensure that one ground truth nucleus is not associated to multiple estimated nuclei and conversely, we use the following definition for the IoU: F1 score for a given IoU* threshold t > 0 can be defined as: With a threshold t = 0.05, this metric gives the accuracy of a method to identify the correct number of nuclei, while with thresholds in the range 0.05 − 0.9, it evaluates the localization accuracy of the identified nuclear contours.

Conditional GAN
The annotated 1868 × 1400 image was divided into thirtyfive 256 × 256 images for training. As defined in 13, U-Net 14 was used for the generator and a convolutional PatchGAN classifier was used for the discriminator. Once trained, nuclei masks had to be generated to simulate images. Distributions for the number of nuclei per image and the size of nuclei were defined from the training dataset. The number of nuclei per image was then modeled as a Gaussian distribution while the size of nuclei was modeled by a Gumbel distribution to reflect the heavy tail distribution observed in the training dataset. Nuclei masks were then defined as ellipses randomly generated with these distributions with random orientation and a ratio between the two axes defined according to a Gaussian distribution of average s/π and standard deviation of 0.2s/π, where s is the area of the ellipse. 1000 256 × 256 nuclei images were simulated by considering the generated ellipses as nuclei masks.

Combination of instance and semantic segmentations
The combination of results obtained with instance and semantic segmentations was initialized as the nuclei segmented with Mask R-CNN. To prevent from hallucinations, nuclei identified with Mask R-CNN for which the area overlapping with nuclei obtained with U-Net was inferior to 20% were discarded. Then, nuclei identified with U-Net whose area overlapping with nuclei obtained with Mask R-CNN was inferior to 33% were added as new nuclei to the final segmentation. Finally, nuclei with an area inferior to 35 pixels were discarded.

Results
Deep learning-based instance segmentation with existing datasets and massive data augmentation is used to initialize the training dataset A training dataset is required to train a deep learning method for object segmentation. While new approaches emerge for this task such as interactive machine learning 30 , users most often start with manually annotating objects of interest with existing annotation tools 31,32 . As shown in Figure 1a, this task is particularly challenging in our case due to the wide range of morphologies and high density of nuclei in polyps. We use the ImageJ plugin Annotater 12 to efficiently annotate nuclei, a task that takes approximately 30 hours for the image shown in Figure 1a. To avoid a fully manual annotation and save time, it is possible to use the same plugin to correct a nuclei segmentation obtained with an existing method. The watershed method 33 , probably the most used method for nuclei segmentation in fluorescence microscopy images, correctly identifies a high number of nuclei (high F1 score for a low IoU threshold in Figure 1). Unfortunately, under-and over-segmentations, a well-known limitation of this approach, lead to a poor segmentation localization (rapidly decreasing F1 score with increasing IoU thresholds in Figure 1). Alternatively, pretrained deep learning models for nuclei segmentation are available. Stardist 7 , one of the most popular approaches in microscopy, can be processed as a Fiji plugin 25 with a model trained on the 2018 Data Science Bowl 10 . While the number of nuclei correctly identified is lower than with the watershed method (lower F1 score for low IoU thresholds in Figure 1b-c), their localization accuracy is much higher (higher F1 score for high IoU thresholds in Figure 1b-c). Another possibility is to train deep learning approaches with existing training datasets. We propose to train a U-Net model and a Mask R-CNN model with a high throughput chemical screen on U2OS cells dataset (CC) (image set BBBC039v1 available from the Broad Bioimage Benchmark Collection 9 ) and a widefield mouse intestinal epithelium dataset (MIE) 12 . These models are then used to segment the image shown in Figure 1a. While U-Net demonstrates a poor performance (Figure 1b), Mask RCNN clearly surpasses the watershed approach and the pretrained Stardist model (Figure 1c). When compared to the latter, the good performance of Mask R-CNN is explained by the fact that the MIE dataset includes epithelial nuclei, even though they come from mice. Correcting this segmentation with Annotater takes about 15-20 hours, which is clearly faster than an annotation from scratch. Training the Mask R-CNN model with the CC and MIE datasets takes about 12 hours but has the great advantage that it does not require human interaction. For both UNet and Mask R-CNN, a massive data augmentation (100 times) clearly improves the performance.
Increasing the training dataset by using a conditional GAN improves nuclear segmentation accuracy When only considering the annotated image in Figure 1a in the training dataset, U-Net leads to higher segmentation accuracy than Mask R-CNN (Figure 2a-b). To increase the training dataset, we use the same annotated image to train a conditional Generative Adversarial Network (GAN) 13 and simulate images showing nuclei from masks defined as random ellipses generated with the distributions of nuclei size and nuclei number observed in the training dataset (see Figure 2c and Methods). Only using simulated images lead to a lower accuracy for both deep learning approaches, even though applying mathematical operations to these synthetic images (augmented simulated training dataset, see Methods) improves the segmentation accuracy. However, pooling together augmented simulated images and the annotated image from Figure 1a slightly improves U-Net performance and distinctly increases the number of accurately identified nuclei with Mask R-CNN while decreasing the segmentation localization precision. Finally, adding existing datasets clearly leads to the optimal results for Mask R-CNN while degrading the accuracy for U-Net, which is consistent with the inability for this approach to generalize nuclear segmentation for different data as shown in Figure 1b. Overall, U-Net marginally benefits from using simulated images (red curve versus black curve in Figure 1a) while the main gain for Mask R-CNN comes from the use of CC/MIE datasets and data augmentation (orange curve versus red and black curves in Figure 1b).

Combining semantic and instance segmentations improves nuclear segmentation accuracy
Nuclei segmented with Mask R-CNN show a higher localization precision than those obtained with U-Net as shown in Figure 1a-b. However, nuclei that are harder to delineate are missed with Mask R-CNN while U-Net accurately identifies pixels that belong to nuclei, even though the separation between individual nuclei might not be precise. In order to get the best of both worlds, we propose to combine the results obtained with U-Net trained with one annotated image with data augmentation and augmented simulated images, and the results obtained with Mask R-CNN trained with one annotated image with data augmentation, augmented simulated images and existing datasets with data augmentation (see Methods). As shown in Figure 3, these results demonstrate a higher F1 score for any IoU threshold than obtained with U-Net or Mask R-CNN trained with 3 times more annotated images. The corresponding segmented nuclei are shown in Figure 4.

Data availability
The five annotated images are available at https://github.com/ tpecot/DeepLearningBasedSegmentationForBiologists/tree/main/ Data/AnnotatedNuclei. This project contains the following data: • The images generated with pix2pix and used for training U-Net and Mask R-CNN in Figure 2-Figure 3 are available at https:// github.com/tpecot/NucleiSimulationWithConditionalGAN/tree/ main/datasets/Nuclei_polyps_1image.

Software availability
The code with the parameters used to train and process all experiments presented in this manuscript with U-Net and

Discussion
This study explores several strategies to minimize the amount of manually annotated data required to successfully train a deep learning model for instance segmentation. As already established in the field, the use of existing training datasets, even though modalities and/or tissues differ, allows to train instance segmentation models and obtain results on targeted data that can be manually corrected to initialize a new training dataset. The use of massive data augmentation is another well known approach to drastically increase the segmentation accuracy. While the use of conditional GANs to expand the size of the training dataset seems promising, the gain in accuracy shown in this study is not very convincing. The simulation pipeline used to generate the masks might have been too simplistic. In particular, the variety of nuclei shapes could be enriched. Finally, combining semantic and instance segmentation results leads to a substantial increase in segmentation accuracy. While unusual in the field, we believe that this method has the potential to become more common in the community. Combining these strategies enables to remarkably reduce the amount of data to be manually annotated, waiting for methods that offer the promise to eliminate this time consuming task, such as self-and partially supervised methods, currently in development. Maybe the authors could further discuss why the gains are only small here: maybe the simulation pipeline from the masks to generate the training dataset of pix2pix is too simplistic for instance, wider range of shapes, background lights, heterogeneity of intensities or patterns on the nuclei etc.
On the contrary it's quite clear that the combination of U-Net and MaskRCNN output are beneficial to the overall performance of the method and that's nicely shown here. I think that combining DL model outputs is currently underused and this is a nice additional demonstration of this here.
As additional comments, I would highlight a couple of additional work that are missing from the context described in this paper: Kaibu (Wei Ouyang et al F1000, https://f1000research.com/articles/10-142) is an interactive tool for simultaneous training of segmentation models and segmentation, this circumvents a range of issues mentioned here, it should be mentioned.
○ StarDist (https://github.com/stardist/stardist) from Uwe Schmidt and Martin Weigert, is an excellent tool for nuclei segmentation and is not included here. I suggest that the authors compare their IoU curves to those obtained from the pre-trained models provided by the method (even as Fiji plugin). This will give the readers a baseline on which to compare the approaches described here, which still require an investment in time to train multiple models and annotations ○ The cost/benefit analysis of manual annotation vs automated (DL based or not) should be mentioned, it's not always worth doing DL for that, it often depends on the size of the dataset to be segmented.

○
Although having an annotator GUI and some tools to get some improvement on segmentation performance are important today, a large efforts is now put into approaches that are self-supervised or partially supervised, which would circumvent the issues of annotation time altogether. These are not currently available to the wide bioimaging community but should be mentioned in conclusion, looking at the future of segmentation pipelines.

○
Overall, I think that it is a nice piece of work describing the performance of a range of approaches in a systematic and clear manner, which are useful to the bioimaging community. However, they are presented as guidelines to building a segmentation pipeline and I would not think that as such, it describes the general thoughts about the matter in the community. I'd consider rewording the conclusions focusing on the observations of the tests the authors made rather than presenting it as a universal guideline.

Are sufficient details provided to allow replication of the method development and its use by others? Yes
If any results are presented, are all the source data underlying the results available to ensure full reproducibility? Yes Are the conclusions about the method and its performance adequately supported by the findings presented in the article? Partly We completely changed the discussion, focusing on the observations made in the manuscript. More particularly, we acknowledge that the use of publicly available datasets and massive data augmentation are beneficial to build a training dataset and are now common practices in the field. We also underline the disappointing accuracy obtained when using pix2pix (we also changed the end of section 2 accordingly). We emphasize the interest of combining instance and semantic segmentations. We finally introduce self-and partially supervised methods that offer ○ the promise to eliminate manual annotation.

Alice Lucas
Broad Institute, Cambridge, MA, USA The authors propose multiple strategies to improve segmentation results given a new dataset.
Instead of manually annotating a training image from scratch, the authors recommend to leverage knowledge learned by networks pre-trained on other larger datasets. Therefore, they propose to first train a model on large existing datasets (that differ from the final dataset of interest). The trained model is then used to annotate the desired training image, and these predictions are then manually corrected using Annotater. This allows the authors to manually annotate for 15-20 hours, compared with 30 hours when annotating the image from scratch.
A second solution that they implement in order to improve their final segmentation results is to (1) train a conditional GAN on their annotated image and (2) use the cGAN to predict additional synthesized segmentation masks. The UNet and Mask-RCNN can then use this additional data for training.
Finally, to further improve their results, they combine results obtained from their trained UNet and their trained Mask-RCNN to obtain a final instance segmentation map. More specifically, the semantic segmentations from UNet are post-processed to obtain instance segmentation masks, and merged (following a specific protocol) with those predicted by the trained Mask-RCNN.
A few comments: Clarity regarding the purposes of the different training sets used could be improved. At first it was not clear to me how the CC/MIE datasets related to the final training dataset of interest (the 1868 x 1400 image). It could be made a bit more explicit that (1) the CC / MIE data is used to pre-train a neural network, (2) this neural network is then applied on the image of interest to provide the annotations, and (3) final training data is obtained by correcting these predictions. (4) On this final training data will be trained UNet and Mask-RCNN.

○
The text "Only one image was used to train […]" is a bit of a misleading statement. In the ○ R-CNN demonstrates the most accurate results, the segmented nuclei obtained with this approach are then manually corrected to initialize a training dataset.
To clarify the misleading text "Only one image was used to train […]" , we changed the Training dataset section in Methods and added that publicly available datasets were used in addition to the manually annotated image of human precancerous polyp biopsy.

○
We added a sentence about the time taken to train a Mask R-CNN model on the CC/MIE datasets at the end of the first section to better compare the 30 hours of manual annotation from scratch vs. the 15-20 hours when using this strategy. The benefits of publishing with F1000Research: Your article is published within days, with no editorial bias • You can publish traditional articles, null/negative results, case reports, data notes and more • The peer review process is transparent and collaborative • Your article is indexed in PubMed after passing peer review • Dedicated customer support at every stage • For pre-submission enquiries, contact research@f1000.com