Background

F1000Research

2046-1402

F1000 Research Limited

London, UK

10.12688/f1000research.131775.2

Research Article

Articles

Novel Deep Learning Application: Recognizing Inconsistent Characters on Pharmaceutical Packaging

[version 2; peer review: 1 approved with reservations, 1 not approved]

Koponen

Jarmo

Conceptualization Data Curation Formal Analysis Investigation Methodology Visualization Writing – Original Draft Preparation Writing – Review & Editing https://orcid.org/0000-0002-8963-9315 a 1 Haataja

Keijo

Funding Acquisition Project Administration 1 Toivanen

Pekka

Supervision Validation 1 1School of Computing, Kuopio campus, University of Eastern Finland, Kuopio, Pohjois-Savo, FI-70211, Finland

a jarmo.koponen@uef.fi

No competing interests were disclosed.

23 7 2024

2023

427

24 6 2024

2024

This is an open access article distributed under the terms of the Creative Commons Attribution Licence, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Background

Machine vision faces significant challenges when applied to text recognition on cardboard packaging particularly due to multiple printing methods, irregular character shapes, and curved packaging surfaces.

Methods

This research introduces a novel deep learning application for recognizing binarized expiration date and batch code characters printed using multiple printing methods. The method, based on Region-based Convolutional Neural Networks (R-CNN), enables character recognition directly from in the images without the need for extracting handcrafted features. In detail, this approach performs character recognition by using the whole image as input, extracting and learning salient character features directly from the packaging surface images.

Results

The R-CNN model, with a precision of 91.1% and an F1 score of 80.9%, effectively recognizes manufacturing markings on pharmaceutical packages, with inconsistencies in the characters’ shapes. In a comparative experiment using the same dataset of images, the R-CNN model significantly outperformed Tesseract OCR, achieving much higher precision, recall, and F1 scores.

Conclusions

The results of this study reveal that the deep learning method outperforms the well-established optical character recognition method in recognizing text characters printed with different printing methods. Presented in this study, the deep learning method recognizes text characters with high precision. It is also suitable for recognizing text printed on curved surfaces, provided proper preprocessing is applied. The problem investigated in the study differs from previous research in the field, focusing on the recognition of texts printed with different printing methods. The research thus fills a gap in text recognition that existed in the research of the field. Furthermore, the study presents new ideas that will be utilized in our future research.

Text recognition Machine Vision Deep Learning Regions with Convolutional Neural Networks R-CNN Character Recognition Printing Methods Expiration Date Batch Codes Handcrafted Features Image Recognition Multi-Object Recognition OCR Tesseract OCR Precision Recall Accuracy F-Measure Curved Surfaces Preprocessing

The author(s) declared that no grants were involved in supporting this work.

Revised Amendments from Version 1

The title has been revised from "new method" to "new application." The English language usage has been improved. Details about the role of AlexNet in the method have been added to the method description. AlexNet has been included in the description of the Region CNN method. The performance curves of obtained over all threshold values have been added. The key performance metrics have been updated to be more relevant for text character performance measurement.

Introduction

By the end of 2023, the market value of the pharmaceutical packaging sector is expected to reach USD 101 billion. ¹ Recognizing product codes using machine vision enables the storage and processing of package-specific manufacturing data, as well as the electronic search and extraction of codes and dates, which is important in the development of intelligent product handling systems.

Cardboard is typically used in pharmaceutical product packaging. Boxes made of cardboard have curved surfaces. When accurate text recognition of text printed on surfaces is needed using 2D machine vision, the curvature is a difficulty in itself as it causes uneven illumination of the packaging surface as presented in Figure 1. Several printing methods are used to print the expiration date and manufacturing batch codes that are important for the usage and handling of pharmaceutical packages. Changes in physical conditions in package imaging as well as printing methods producing low-contrast and irregularly shaped characters make code recognition more challenging. The recognition process is especially challenging because there are many variants in the codes used on various packages using different printing methods, as well as variances in the codes’ forms, structures, regularity, and colors. In the past ten years, product code recognition techniques have advanced, and several researchers have become more interested in this area of study.

Figure 1. Two cardboard pharmaceutical packages with the expiration date and batch codes printed on the curved packaging surfaces.

The purpose of this research is to demonstrate that, despite difficulties in the field, text characters with imperfections and inconsistencies in their forms on pharmaceutical packaging can be accurately recognized using appropriate pre-processing and deep learning techniques. In the experimental part of the study, a deep learning model trained on a real-life image set was used to evaluate the recognition of dot matrix and ink-printed characters, employing three key metrics: True Positives (TP), False Positives (FP), and False Negatives (FN). Subsequently, the performance of this deep learning model was compared with that of the Tesseract OCR method using the same metrics relevant for text characters recognition. It is important to note that information related to expiration dates and manufacturing batch codes on pharmaceutical packages is important for safety and must be accurate and reliable. The rest of the paper is organized as follows: the next section discusses related work found in the literature. The next section details the approach to using deep learning for recognizing manufacturing marking texts on pharmaceutical packaging. Then deep learning with focus on R-CNN and using it for expiration date and batch codes recognition is presented in the extent to which it serves this research work. Analysis of experimental results is then presented and finally, the conclusion and future works are presented in the final section.

Related works

Optical character recognition (OCR) for document images is a well-studied field. ² In OCR, characters are recognized by comparing groups of pixels detected in the source image with the model patterns that the system has been trained on. OCR software is used for the recognition of manufacturing markings printed on product surfaces where it is highly effective at recognizing regularly shaped characters with good contrast against a simple background. ³ Product text recognition is performed on digital images produced by an imaging system that captures energy reflected from the product surface in the scene onto its image plane. ³ Varying positions and orientations of the text in the camera scene, changes in illumination on the package surface, low contrast between text and background, irregular fonts, and motion blur of images acquired from moving packages reduces the usability of OCR methods in this high-accuracy task. ³ Tesseract OCR is a widely used open-source optical character recognition program. It has been in use since 1985 and has evolved significantly throughout the years. Over time, Tesseract OCR has changed. Numerous languages are supported, image binarization is included, and Tesseract 4 added an OCR engine with a long short-term memory (LSTM)-based neural network, which process an input image line by line into boxes and then feeds them to the LSTM network, which produces the recognition result. ⁴

The capability of different methods to recognize product texts printed with irregular characters on product packaging varies. ³ Detecting text as binary large objects using K-nearest neighborhood (KNN) classification ⁵ for opaque labels, and traditional OCR methods ⁶ for consumer product cardboard packages have difficulties in recognizing irregular fonts. Although these techniques are fundamental, they often struggle to adapt to the diverse real-life conditions. Gabor filter-based methods, ⁷ ^, ⁸ are robust against variations in surface shapes and uneven illumination, but face challenges in recognizing irregular characters, emphasizing the difficulty of adapting to them. A Histograms of Oriented Gradients (HOG) feature-based neural network, ⁹ with an imaging environment with multi-directional illumination of the package surface and sub-images fusing alleviates problems related to low contrast and uneven illumination. However, methods performance on irregular font shapes and motion blur is still limited. In controlled imaging environments, deep learning methods, integrating Connectionist Text Proposal Neural Network (CTPN) for text area detection and capsule network (CapsNet) text recognition, ¹⁰ tolerate packaging surface shape changes, uneven packaging surface illumination, and motion blur. The method is capable of handling character irregularities, including irregular fonts. The integration of two deep learning neural network methods, FCN, and R-CNN ¹¹ and Faster-CNN and LeNet DNN, ¹² effectively detects and recognizes text in various food and beverage packaging, even with irregular fonts. The method, as is possible for deep learning text recognition, directly detects and learns the most important characters features directly from source images, enabling generalization to various conditions, such as irregular characters, which pose challenges to traditional pattern-matching-based text recognition techniques.

Methods

This research focuses on using deep learning to recognize manufacturing marking texts printed on pharmaceutical packaging surfaces, which contain irregularities in character shapes. Previously, our research into recognizing manufacturing markings on cardboard pharmaceutical packages led to the development and introduction of a novel method for binarizing texts on curved packaging surfaces. ¹³ This method employs source images acquired in a controlled imaging environment, enabling the accurate and robust binarization of texts while effectively addressing challenges associated with surface curvature. The text recognition ability of the trained R-CNN model is first evaluated using three key metrics. Additionally, the performance of this deep learning network for text recognition is compared with Tesseract OCR, using the same three key criteria. Text recognition of texts pressed without ink, stamped, or laser printed is excluded from the study. In the experimental recognition tests, we used a set of images of a Finnish health technology company taken from real medical packages, with dot matrix printed and pressed ink-marked texts. The study’s text recognition was performed on the packaging production batch and expiration date markings, and titles. An example of the binarized images used for text recognition is shown in Figure 2. In the experimental tests, real pharmaceutical packages have been used to ensure that the results of our research are reliable. Since the text recognition of pharmaceutical packages is a new area of research, the number of source images was limited. However, this limitation is only temporary, and research will continue as more source images become available.

Figure 2. Batch code and expiration date texts in a binarized images printed with a pressed ink-marked (left) and with the dot matrix printing methods. Deep Learning

Deep Learning (DL) is a subset of machine learning in which neural networks are used to learn from data and generate predictions or choices. It is described as a method of teaching a computer to learn from data and make predictions or judgments using neural networks, much like a child does. Deep Learning can be used for a variety of tasks, including image classification, object detection, speech recognition, and natural language processing, because of its ability to interpret and analyze complex patterns in data.

Regions with CNN features (R-CNN)

In the paper ‘Rich feature hierarchies for accurate object detection and semantic segmentation,’ Girshick, Ross, et al. ¹⁴ introduce an original R-CNN method for object detection from images, based on the use of region proposals. The paper details the operation of the detection layer, offering insights on how high-capacity convolutional neural networks (CNNs) are applied to bottom-up region proposals to localize and segment objects. It also discusses the strategy of using supervised pre-training for an auxiliary task when labeled training data is scarce, followed by domain-specific fine-tuning, which yields a significant performance boost. ¹⁴

The object detection algorithm in the Region CNN model is based on using high-capacity convolutional neural networks to find and segment objects in a bottom-up manner. Training the CNN model with a limited labeled training set is achieved by training in two steps: first with supervised learning using a large dataset, and then fine-tuned with a targeted, limited dataset. ¹⁴

In Region-CNN, the selective search method is used to produce region proposals. Selective search segments the image into multiple parts at various scales, iteratively combining these segments based on similarities in color, texture, or size, generating region proposals for potential object locations in the image. This process results in hundreds of region proposal areas (ROIs) from the image that potentially contain objects. Each region proposals area is scaled to a fixed size and fed into a convolutional network, which extracts features and classifies the regions. CNN uses a trained model to predict whether each area contains an object and, if so, which class of object it is. The output layer of the area CNN contains a classification layer (Classification Layer), which predicts the class of each region proposals. This layer produces probabilities that a given region belongs to a class defined in the training data. The Bounding Box Regression Layer produces adjustment values for the bounding boxes of the region proposals, allowing the creation of accurate, class-specific bounding boxes. By combining the results of these two layers, Region-CNN produces class-specific bounding boxes that are more accurate than the original region proposals. ¹⁴

As a result, Region-CNN produces a set of bounding boxes classified according to objects, along with confidence scores for each detection. This information allows determining the location of the object in the image as well as their classes. In multi-object detection, the Region-CNN model handles multiple classes and area proposals, with separate SVM classifiers for each area. Each area proposal found by the selective search method (Region Proposal) is scaled and fed into a CNN, which extracts its features. These features are then fed into each SVM classifier, each trained to recognize one class. Class-specific score values: Each SVM produces a score value that reflects its confidence that the particular area belongs to the class it represents. Thus, for example, if the model has 20 classes, each area receives 20 score values, one from each SVM. Class selection based on score values: After obtaining the score values for each region from all SVM classifiers, R-CNN selects the class with the highest score for the given region as the final classification. Each region is assigned the class that the model predicts is most likely in that location. ¹⁴

Non-Maximum Suppression (NMS): Since many region proposals may overlap with the same object, R-CNN uses a non-maximum suppression method to prune overlapping proposals. From each group of overlapping regions, the NMS selects the one with the highest point value and discards the others. Ensuring that each object detected is represented in the results only once according to the proposal with the highest score. Result: The result is a set of bounding boxes, each classified into the class with the highest score for that area, with overlaps pruned by NMS. This enables the identification and classification of multiple objects from the same image. ¹⁴

The R-CNN model allows for accurate and efficient multi-object detection. The model classifies each proposed area into the class that receives the highest score from the separate SVM classifiers. Each area is assigned the class it most likely belongs to, allowing for the reliable detection of multiple objects from the same image. ¹⁴

R-CNN for expiration date and batch codes recognition

In text character recognition, Regions with Convolutional Neural Networks (R-CNN) architecture combine rectangular region proposals with convolutional neural network features for character detection and classification. As illustrated in Figure 3, the process initiated by extracting multiple region proposals from the source image using the method of selective search ( Figure 3a). Each proposed region, potentially containing text, is sent on to the convolution neural network to create a fixed-size feature vector ( Figure 3b). This feature vector serves as the input to a set of linear SVMs that are trained to classify character classes, producing initial classification ( Figure 3c). Detection is further refined using a bounding box refinement layer, which adjusts the coordinates of each proposed bounding box to better align with the actual text characters in the image ( Figure 3d). ¹⁴ ^, ¹⁵

Figure 3. Region-CNN model working principle.

R-CNN’s training process, which combines a large, pre-trained network with a smaller, domain-specific network for fine-tuning, is well-suited for handling the limited datasets and complex labels typical of cardboard pharmaceutical packaging.

Training details

The dataset utilized for training the model contains 27 binarized source images: 16 with dot matrix characters and 11 with pressed, and ink-marked characters. With these images, the model was trained to recognize characters with inconsistencies in their shapes due to the different printing methods used on pharmaceutical packaging. The images were resized to fit the input size of 227 × 227 pixels to fit the input requirements of the pre-trained AlexNet model. The deep learning model employs a dual architecture, combining AlexNet and a region-based convolutional neural network (R-CNN). AlexNet, pre-trained on a large-scale image dataset, serves as a native feature extractor. In this setup, source images resized to 227 × 227 pixels undergo convolution-based feature extraction. This model follows the method proposed by Ross Girshick et al., ¹⁴ utilizing high-capacity CNNs for generating detailed region proposals to accurately locate and segment objects. Training was conducted using stochastic gradient descent with momentum optimization, starting with a learning rate of 1e-4, which was reduced by 0.02 after every eighth epoch to improve convergence. Other training parameters include 400 epochs and a mini-batch size of 58. The final layers of AlexNet were modified to match the number of 43 different object classes in the training dataset. To prevent network overfitting, data augmentation techniques such as rotations, scaling and translations were used. These augmented images were then used to train the Region-CNN, fine-tuning it for the specific tasks of detecting and recognizing varied text character features of pharmaceutical packaging. The dataset was randomly divided into a 70% training and 30% validation split for training and thorough validation. This method utilizes both AlexNet’s strengths for extensive initial learning and R-CNN’s strengths for targeted fine-tuning.

Key metrics for evaluating text character recognition models

In the evaluation of text recognition models, such as R-CNN, the most important metrics are True Positives (TP), False Positives (FP), and False Negatives (FN). In this domain True Negatives (TN) are not considered in the evaluation process as the concept of ‘non-character’ regions is inapplicable. ¹⁶ IoU (Intersection over Union) is a common object detection evaluation metric. IoU measures the overlap between the ground truth boundary box and the boundary box detected by the algorithm. ¹⁷ IoU, as defined in Equation (1), is calculated as the area of overlap between the ground truth bounding box and the predicted bounding box divided by the area of the union of these two. ¹⁷ IoU = Area of Overlap Area of Union = area ground truth bounding box ∩ predicted bounding box area ground truth bounding box ∪ predicted bounding box (1)

Positive samples are used in object recognition for supervised learning models, such as R-CNN’s application for text recognition, where the model classifies different areas as containing text. After training and testing, the sample group in such scenarios can be classified into three main states, as shown in Table 1. ¹⁸

Table 1. Prediction result of the sample set.

	Actual positive	Actual negative
Predicted positive	True positives (TP)	False positives (FP)
Predicted negative	False negatives (FN)	(Not applicable)

Positive samples may create the following scenarios: 1.

TP (true positive), in which a region is correctly identified by the model as containing text, IoU ≥ α.

FN (false negative), in which model fails to detect text region that is actually present.

Negative samples, which are areas not containing text can create the following scenario: 3.

The detector determines FP (false positive), in which the model identifies a region without text as containing text, IoU < α.

These formulas are used to calculate precision and recall:

Precision is the proportion of all correctly retrieved samples (TP) that account for all retrieved samples (TP + FP). ¹⁸ Its equation is, as given in (2): Precision = TP TP + FP (2)

The recall is the proportion of all correctly retrieved samples (TP) that accounts for all objects that should be retrieved (TP + FN). ¹⁸ Its equation is, as given in (3): Recall = TP TP + FN (3)

The F1 Score, as defined in Eq (4), is the weighted harmonic average of precision and recall. ¹⁸ F‐Measure = 2 ∗ Precision ∗ Recall Precision + Recall (4)

Results

The experiments aimed to evaluate the recognition accuracy of the deep learning method for text recognition, using images of real-world pharmaceutical packages. The experiments were carried out on available dot matrix printed and ink-marked texts. The recognition results were achieved by comparing the from test image set to the ground truth data.

R-CNN evaluation in MATLAB

At first, the text recognition performance of the R-CNN model was analyzed using a dataset and evaluated using Precision, Recall, and F1 scores. Functions measure the performance of a multi-object detector. The precision metric considers both the number of correctly recognized characters and the number of incorrectly recognized characters. Recall determines the ability of the model to correctly recognize all characters in the image, considering both correctly recognized and those that fail to be recognized. The F1 Score combines precision and recall into a single metric, balancing the rate of correctly recognized characters with the model’s ability to recognize all characters in an image, encompassing both precise recognitions and any missed characters. Figure 4 illustrates the results of the proposed text recognition model’s precision (top), recall (middle), and F1 Score (bottom) at varying thresholds.

Figure 4. Results of the proposed text recognition model’s precision (top), recall (middle), and F1 Score (bottom) at varying thresholds.

The results in the graph where threshold values are determined by the overlap between the bounding boxes produced by the model and the ground truth, show that the model’s precision remains high across different IoU thresholds. A decreasing recall at higher IoU thresholds indicates the model’s increasing specificity at the expense of sensitivity. The F1 score, a combination of precision and recall, reaches its optimal score at an IoU threshold of 0.6. This threshold indicates the most balanced performance between detecting true positives and avoiding false positives, highlighting the model’s accuracy in distinguishing relevant text characters from background in complex image datasets.

Comparison to Tesseract OCR

In the second experiment, both the R-CNN deep learning model and Tesseract OCR were used to recognize text characters from the same data set. Initially, the total number of characters in each image was counted. The performance evaluation included counting the number of characters that were correctly identified, incorrectly recognized, unrecognized, and those identified more than once. This provided detailed insights into the character recognition accuracy of each method. Notably, the threshold value (α) for the R-CNN model was set to 0.99 to compare its recognition precision against the widely used Tesseract OCR.

Using the same three metrics as in the previous experiment, the text character recognition performance of the R-CNN deep learning model and the Tesseract OCR algorithms is compared in Table 2, with the targets being characters printed on the images.

Table 2. Text recognition performance comparison table based on three metrics.

	Region-CNN (%)	Tesseract OCR (%)
Precision	91.1	38.3
Recall	72.7	61.3
F1 Score	80.9	47.1

The comparison reveals that the Region-CNN model achieved a precision of 91.1%, whereas the OCR method’s precision was 38.3%, indicating significantly fewer false positive detections by the Region-CNN model. The recall value of 72.7% indicates that the R-CNN model correctly identifies a significant portion of the characters on pharmaceutical packaging. However, fails to recognize 27.3% of the characters, meaning false negatives. OCR’s recall value of 61.3% to recognize manufacturing markings characters of images on pharmaceutical packages means about 38.7% of the appropriate markings remain unrecognized. The R-CNN model, with an F1 score of 80.9, accurately identifies a high percentage of characters (demonstrating high precision) and efficiently covers a significant proportion of all characters available in the dataset (indicating high recall).

The deep learning model has much higher accuracy and recall ratings. By balancing these factors, the F1 Score illustrates the amount better the model was in this test. In specific, the deep learning model shows to be a better text recognition method, especially able to precisely recognize characters containing more irregular shapes. Figure 5 illustrates the results of text recognition using the Region-CNN method.

Figure 5. The result of deep text recognition in the whole image.

Left: recognized printed ink-marked packaging text. Right: recognized dot matrix printed packaging text.

Conclusions

In this paper, a novel application of deep learning for text recognition on pharmaceutical packages is presented. A method combining a pretrained Convolutional Neural Network (CNN) with Region-based CNN (R-CNN) is employed for comprehensive feature extraction from each proposed character region. This approach has achieved promising results in the recognition of manufacturing markings and texts with inconsistent character shapes.

Various methods for recognizing text printed on products have been developed. Commercial OCR engines are available that can recognize high-contrast regularly shaped texts printed on flat surfaces. In the industry, large amounts of perishable food and pharmaceutical packages are produced daily, and the texts printed on their surface are read by people several times during the operation. Text is printed on product surfaces using various printing methods to achieve cost-effective production. The pharmaceutical industry employs various printing methods, such as laser printing, pressed without ink, and pressed with ink, resulting in consistent character shapes. However, characters printed using dot matrix and manual stamping methods may have inconsistencies in their character shapes. In places where pharmaceutical product packaging is handled, such as pharmacies and drug storage facilities, where it is essential to recognize texts printed with various methods, a targeted method is required.

We compare the text recognition performance obtained using a Tesseract OCR and R-CNN based method with real-life pharmaceutical packaging in this study. We find that the deep learning method produces highly accurate recognition results that completely outperform the results obtained using a Tesseract OCR, indicating the limitations of optical text recognition for the research’s text recognition needs.

Although text recognition for pharmaceutical packaging texts printed with different methods is an essential task, it is still in its early stages due to the limited availability of source images of actual pharmaceutical packages for research purposes.

To the best of our knowledge, this is the first study in which deep learning is used to recognize inconsistencies in texts printed using different printing methods. We are confident that with more training data, the proposed method’s performance would increase even more.

In the following study, we focus on the domain-specific contextual processing of recognized text characters. This results in a three-phase text recognition pipeline targeted at recognizing manufacturing markings on pharmaceutical packaging. With this approach, we improve the understanding of how deep learning can be applied in this field.

Data availability Underlying data

Open Scientific Framework: The underlying data for A R-CNN deep learning method for recognizing texts printed with multiple different printing methods, https://doi.org/10.17605/OSF.IO/MP3RB. ¹⁹

This project contains the following underlying data: •

R-CNN text recognition methods source code for the Octave software.

•

RCNN and OCR recognition methods metrics spreadsheet.xlsx (An Excel spreadsheet containing the achieved recognition results).

Data are available under the terms of the Creative Commons Zero “No rights reserved” data waiver (CC0 1.0 Public domain dedication).

References 1

Global Packaging Market Size.2020. Reference Source

Althobaiti

: A survey on Arabic optical character recognition and an isolated handwritten Arabic character recognition algorithm using encoded freeman chain code. Paper presented at the 2017 51st Annual Conference on Information Sciences and Systems, CISS 2017. 2017.

Koponen

Haataja

Toivanen

: Recent advancements in machine vision methods for product code recognition: A systematic review. F1000Res. 2022;11(1099):1099. 10.12688/f1000research.124796.1

Tesseract User Manual|tessdoc (tesseract-ocr.github.io).

Mishra

Jain

: A system on chip based serial number identification using computer vision. Paper presented at the 2016 IEEE International Conference on Recent Trends in Electronics, Information and Communication Technology. RTEICT 2016-Proceedings. 2016: pp.278–283.

Peng

Peursum

: Product barcode and expiry date detection for the visually impaired using a smartphone. Paper presented at the 2012 International Conference on Digital Image Computing Techniques and Applications. DICTA 2012:2012.

Zaafouri

Sayadi

Fnaiech

: A new method for expiration code detection and recognition using gabor features based collaborative representation. Adv. Eng. Inform. 2015;29(4):1072–1082.

Zaafouri

Sayadi

Fnaiech

: A vision approach for expiry date recognition using stretched gabor features. Int. Arab. J. Inf. Technol. 2015;12(5):448–455.

Xiang

You

Qian

: Metal stamping character recognition algorithm based on multi-directional illumination image fusion enhancement technology. EURASIP J. Image Video Process. 2018;2018(1). 10.1186/s13640-018-0321-7

Singh

Gangwar

Singh

: Deep capsule network based automatic batch code identification pipeline for a real-life industrial application. Paper presented at the Proceedings of the International Joint Conference on Neural Networks. 2019-July.

Gong

Thota

: A novel unified deep neural networks methodology for use by date recognition in retail food package image. SIViP. 2020;15(3):449–457. 10.1007/s11760-020-01764-7

Ashino

Takeuchi

: Expiry-Date Recognition System Using Combination of Deep Neural Networks for Visually Impaired. Expiry-date recognition system using combination of deep neural networks for visually impaired. 2020. 10.1007/978-3-030-58796-3_58

Koponen

Haataja

Toivanen

: Text Recognition of Cardboard Pharmaceutical Packages by Utilizing Machine Vision. Electron. Imaging. 2021;33:235-1–235-7. 10.2352/ISSN.2470-1173.2021.10.IPAS-235

Girshick

Donahue

Darrell

: Rich feature hierarchies for accurate object detection and semantic segmentation. Proceedings of the 2014 Conference on Computer Vision and Pattern Recognition. 2014.

Wang

Liu

Tang

: An R-CNN based method to localize speech balloons in comics. MultiMedia Modeling: 22nd International Conference, MMM 2016, Miami, FL, USA, January 4-6, 2016, Proceedings, Part I 22. 2016; pp.444–453.

Murphy

: Machine learning: a probabilistic perspective : MIT Press;2012.

Nachappa

Rani

Pati

: Adaptive dewarping of severely warped camera-captured document images based on document map generation. Int. J. Doc. Anal. Recognit. 2023;1–21. 36687334

10.1007/s10032-022-00425-4

PMC9838515

Gong

Liu

: Advanced image and video processing using MATLAB. Vol.12. Springer;2018; p.581.

Koponen

: Novel Deep Learning Application: Recognizing Inconsistent Characters on Pharmaceutical Packaging. 13 March 2023. 10.17605/OSF.IO/MP3RB

10.5256/f1000research.144650.r201970

Reviewer response for version 1

Ghosh

Mridul

1 Referee 1Computer Science, Shyampur Siddheswari Mahavidyalaya, Ajodhya, West Bengal, India

Competing interests: No competing interests were disclosed.

12 12 2023

2023

This is an open access peer review report distributed under the terms of the Creative Commons Attribution Licence, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

recommendation

reject

1. According to the title of the paper the "novel deep-learning method for recognizing texts with multiple different printing methods" the novelty is questionable. The authors should clearly explain the novelty. They claim they designed the R-CNN framework. The name R-CNN stands for Region-based CNN but the authors presented as region with CNN. What is the difference between them should be reflected in the paper.

2. The novelty in the framework is not seen. The authors should explain that.

3. The R-CNN framework is not explained in this paper. The authors should explain in detail.

4. The dataset section is weak. The number of images is very low. How deep-learning is working with these images though authors claim that they used data augmentation. But what are the techniques and how many images in the final dataset are not described?

5. The evaluation metrics are generally described in the experimental section before the results.

6. In the training the authors stated that they used Alexnet but it is not the R-CNN diagram. What are the requirements of these two deep learning methods?

7. In the training details the train-test rations are not mentioned.

8. The comparison with the state of the art is not performed.

9. The reference section is weak.

Is the work clearly and accurately presented and does it cite the current literature?

If applicable, is the statistical analysis and its interpretation appropriate?

Are all the source data underlying the results available to ensure full reproducibility?

Partly

Is the study design appropriate and is the work technically sound?

Are the conclusions drawn adequately supported by the results?

Partly

Are sufficient details of methods and analysis provided to allow replication by others?

Reviewer Expertise:

Artificial intelligence, Deep learning, image processing

I confirm that I have read this submission and believe that I have an appropriate level of expertise to state that I do not consider it to be of an acceptable scientific standard, for reasons outlined above.

Koponen

Jarmo

University of Eastern Finland, Finland

Competing interests: NO

20 6 2024

Thank you for your peer review feedback on my publication. I have revised the publication considering your comments. Here are detailed responses to the points raised:

This is valuable feedback. I have not developed the Region CNN method, but I have created a new application that uses the Region CNN method.

2. The novelty in the framework is not seen. The authors should explain that.

Referring to the previous response, I would like to add that the application is in a new area that has been scarcely studied.

3. The R-CNN framework is not explained in this paper. The authors should explain in detail.

The paper has been revised to enhance the explanation of the R-CNN framework. Although the original version included an explanation through both text and an illustrative figure, we have now provided a more detailed description to clarify the framework further.

In training details include that information. In addition underlying data section provides more information on the matter

5. The evaluation metrics are generally described in the experimental section before the results.

I have thoroughly revised the evaluation metrics to ensure they are relevant to the research domain. The metrics have been updated and are now specifically tailored to better align with the study’s objectives and context.

6. In the training the authors stated that they used Alexnet but it is not the R-CNN diagram. What are the requirements of these two deep learning methods?

I have clarified the requirements and differences between AlexNet and the R-CNN methods in the revised manuscript, ensuring that their respective roles and implementations are clearly explained.

7. In the training details the train-test rations are not mentioned.

This information is provided both in the paper and in the 'Underlying Data' section.

8. The comparison with the state of the art is not performed.

The comparison has been performed against both a method obtained from a pharmaceutical packaging publication (Kumar, G. P., & Prasad, P. B. (2014). Machine vision based quality control: importance in pharmaceutical Industry. International Journal of Computer Applications, 975, 8887.)) and an industry-validated method. Additionally, a systematic review article (Koponen, J., Haataja, K., & Toivanen, P. (2022). Recent advancements in machine vision methods for product code recognition: A systematic review. F1000Research, 11.) did not find an existing solution but identified a research gap regarding the state-of-the-art method in the pharmaceutical packaging field.

9. The reference section is weak

The related work section has been strengthened, particularly in relation to the context of the study.

10.5256/f1000research.144650.r174859

Reviewer response for version 1

Chandio

Asghar Ali

1 Referee https://orcid.org/0000-0001-8821-2355 1Quaid-e-Awam University of Engineering, Nawabshah, Pakistan

Competing interests: No competing interests were disclosed.

12 12 2023

2023

recommendation

approve-with-reservations

The authors have worked on the text recognition from pharmaceutical images. This is a complex and challenges problem, however, the methods used by the authors are not state-of-the-art. Moreover, the comparison is not made with the related work.

1. The paper lacks the novelty of the work.

2. A comparison has been made between a CNN model and Tesseract OCR, however, the Tesseract OCR works for scanned text documents. The authors should compare their work with text recognition from images (synthetic or natural scenes) methods.

3. The authors may use instance or semantic segmentation, which will give better accuracy than the RCNN.

4. Why the Recall and Accuracy values are very smaller than the Precision?

5. The Alex-Net is a very old model and is not preferred nowadays. The authors may use a state-of-the-art pretrained model.

6. The number of images in the dataset are very low. For this low data, RCNN model will not give better results.

7. There are several English grammar mistakes.

Is the work clearly and accurately presented and does it cite the current literature?

Yes

If applicable, is the statistical analysis and its interpretation appropriate?

I cannot comment. A qualified statistician is required.

Are all the source data underlying the results available to ensure full reproducibility?

Yes

Is the study design appropriate and is the work technically sound?

Partly

Are the conclusions drawn adequately supported by the results?

Yes

Are sufficient details of methods and analysis provided to allow replication by others?

Partly

Reviewer Expertise:

I am a PhD in Computer Vision and Image Processing. My area of research is related to text detection and recognition from natural scene images, document analysis and OCR.

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.

Koponen

Jarmo

University of Eastern Finland, Finland

Competing interests: No

20 6 2024

Thank you for your peer review feedback on my publication. I have revised the publication considering your comments. Here are detailed responses to the points raised:

1. The paper lacks the novelty of the work.

This is valuable feedback. The title of the entire publication has been reconsidered and adapted. I have not developed the Region CNN method, but I have created a new application that uses the Region CNN method.

2. A comparison has been made between a CNN model and Tesseract OCR, however, the Tesseract OCR works for scanned text

documents. The authors should compare their work with text recognition from images (synthetic or natural scenes) methods.

The comparison has been performed against both a method obtained from a pharmaceutical packaging publication ( Kumar, G. P., & Prasad, P. B. (2014). Machine vision based quality control: importance in pharmaceutical Industry. International Journal of Computer Applications, 975, 8887.) and an industry-validated method. Additionally, a systematic review article ( Koponen, J., Haataja, K., & Toivanen, P. (2022). Recent advancements in machine vision methods for product code recognition: A systematic review. F1000Research, 11.) did not find an existing solution but identified a research gap regarding the state-of-the-art method in the pharmaceutical packaging field.

3. The authors may use instance or semantic segmentation, which will give better accuracy than the RCNN.

Since OCR is proven to be in use in the pharmaceutical industry, we have not compared two new methods, but rather compared the new application to the previously used one. However, in a future study, we can compare this application using the Region CNN model to another application based on a different deep learning model.

4. Why the Recall and Accuracy values are very smaller than the Precision?

This feedback led to a thorough review and adaptation of the performance metrics. Now, the curve surpassing all threshold values enables evaluation with three relevant metrics.

5. The Alex-Net is a very old model and is not preferred nowadays. The authors may use a state-of-the-art pretrained model.

This is valuable feedback. However, although AlexNet and the Region CNN model are considered older, they contain beneficial features that are not present in newer members of the CNN family (Fast, Faster), such as a fixed size of 4096 feature vector for each region proposal.

6. The number of images in the dataset are very low. For this low data, RCNN model will not give better results.

This field of study, being new, is still in the development phase. The diagram of metrics added in the paper presents insights into the model's performance.

7. There are several English grammar mistakes.

I apologize for the errors. I have now corrected them.

Thank you again for your valuable feedback. I have taken it into account, and it has helped improve my skills.