Deep learning neural network development for the classification of bacteriocin sequences produced by lactic acid bacteria

Lady L. González; Isaac Arias-Serrano; Fernando Villalba-Meneses; Paulo Navas-Boada; Jonathan Cruz-Varela

doi:10.12688/f1000research.154432.2

Home Browse Deep learning neural network development for the classification of...

ALL Metrics

Views

Downloads

Get PDF

Get XML

Export

▬

✚

Research Article

Revised

Deep learning neural network development for the classification of bacteriocin sequences produced by lactic acid bacteria

[version 2; peer review: 4 approved]

Lady L. González ¹, Isaac Arias-Serrano¹, Fernando Villalba-Meneses¹, Paulo Navas-Boada¹, Jonathan Cruz-Varela ¹

Lady L. González ¹, Isaac Arias-Serrano¹, [...] Fernando Villalba-Meneses¹, Paulo Navas-Boada¹, Jonathan Cruz-Varela ¹

PUBLISHED 20 Jun 2025

Author details Author details

¹ School of Biological Sciences and Engineering, University Yachay Tech, Urcuqui, Provincia de Imbabura, 100119, Ecuador

Lady L. González
Roles: Conceptualization, Data Curation, Formal Analysis, Investigation, Methodology, Project Administration, Resources, Software, Supervision, Validation, Visualization, Writing – Original Draft Preparation, Writing – Review & Editing

Isaac Arias-Serrano
Roles: Supervision, Validation, Visualization, Writing – Review & Editing

Fernando Villalba-Meneses
Roles: Resources, Software, Validation, Visualization

Paulo Navas-Boada
Roles: Validation, Visualization, Writing – Review & Editing

Jonathan Cruz-Varela
Roles: Conceptualization, Methodology, Project Administration, Software, Supervision, Validation, Writing – Review & Editing

OPEN PEER REVIEW

REVIEWER STATUS

This article is included in the Bioinformatics gateway.

This article is included in the Artificial Intelligence and Machine Learning gateway.

Abstract

Background

The rise of antibiotic-resistant bacteria presents a pressing need for exploring new natural compounds with innovative mechanisms to replace existing antibiotics. Bacteriocins offer promising alternatives for developing therapeutic and preventive strategies in livestock, aquaculture, and human health. Specifically, those produced by LAB are recognized as GRAS and QPS. This study aims to develop a deep learning model specifically designed to classify bacteriocins by their LAB origin, using interpretable k-mer features and embedding vectors to enable applications in antimicrobial discover.

Methods

We developed a deep learning neural network for binary classification of bacteriocin amino acid sequences (BacLAB vs. Non-BacLAB). Features were extracted using k-mers (k=3,5,7,15,20) and vector embeddings (EV). Ten feature combinations were tested (e.g., EV, EV+5-mers+7-mers). Sequences were filtered by length (50–2000 AA) to ensure uniformity, and class balance was maintained (24,964 BacLAB vs. 25,000 Non-BacLAB). The model was trained on Google Colab, demonstrating computational accessibility without specialized hardware.

Results

The ‘5-mers+7-mers+EV’ group achieved the best performance, with k-fold cross-validation (k=30) showing: 9.90% loss, 90.14% accuracy, 90.30% precision, 90.10% recall and F1 score. Folder 22 stood out with 8.50% loss, 91.47% accuracy, and 91.00% precision, recall, and F1 score. Five sets of 100 LAB-specific k-mers were identified, revealing conserved motifs. Despite high accuracy, sequence length variation (50–2000 AA) may bias k-mer representation, favoring longer sequences. Additionally, experimental validation is required to confirm the biological activity of predicted bacteriocins. These aspects highlight directions for future research.

Conclusions

The model developed in this study achieved consistent results with those seen in the reviewed literature. It outperformed some studies by 3-10%. Its implementation in resource-limited settings is feasible via cloud platforms like Google Colab. The identified k-mers could guide the design of synthetic antimicrobials, pending further in vitro validation.

Keywords

Deep Learning Neural Network, Bacteriocin, Lactic Acid Bacteria , K-mers, Embedding Vectors

Corresponding authors: Lady L. González, Isaac Arias-Serrano, Jonathan Cruz-Varela

Competing interests: No competing interests were disclosed.

Grant information: The author(s) declared that no grants were involved in supporting this work.

Copyright: © 2025 González LL et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

How to cite: González LL, Arias-Serrano I, Villalba-Meneses F et al. Deep learning neural network development for the classification of bacteriocin sequences produced by lactic acid bacteria [version 2; peer review: 4 approved]. F1000Research 2025, 13:981 (https://doi.org/10.12688/f1000research.154432.2) First published: 30 Aug 2024, 13:981 (https://doi.org/10.12688/f1000research.154432.1) Latest published: 20 Jun 2025, 13:981 (https://doi.org/10.12688/f1000research.154432.2)

Revised Amendments from Version 1

We have implemented substantial revisions in response to the reviewers' constructive feedback. In the abstract, we now mention limitations related to computational cost and class imbalance to better reflect the model’s real-world feasibility. The introduction has been expanded to discuss shortcomings of existing deep learning models and potential biases due to underrepresentation of certain LAB genera. Suggested references were added, and the study’s aim was clarified at the end of the section.
In the methods, we clarified hyperparameter tuning and justified k-mer lengths based on conserved bacteriocin motifs (e.g., the "pediocin box" for k = 14–19), citing prior studies. We also added information on scalability, indicating that experiments were conducted on Google Colab.
The results section includes comparison with alternative models, which is developed further in the discussion. There, we highlight that while our model shows strong performance on the current dataset, broader scalability and real-world application require further validation. We also discuss how computational costs scale linearly with sequence length and emphasize the need for experimental validation in future work.
In the conclusion, we moderated our claims by explicitly addressing limitations, including reliance on public datasets and the lack of biological testing.
Additional revisions include: update and improved description of Figure 2; clarification of Figure 6; relocation of Table 8 to the discussion; removal of Table 2; and clarification of subfigures (a) and (b) in Figure 8. We appreciate the reviewers’ detailed suggestions and confirm that all points have been addressed accordingly.

See the authors' detailed response to the review by John J. Georrge
See the authors' detailed response to the review by Alaa Kareem Niamah

Introduction

The emergence of antibiotic-resistant bacteria and the rise of new diseases are critical challenges that demand the search for new natural compounds with innovative mechanisms of action to support or replace current antibiotics in use.¹^,² Some bacteria have the ability to produce antimicrobial proteins to inhibit or kill other nearby bacteria. This serves as a form of microbial competition and defense.³ These antimicrobial proteins, known as bacteriocins, are effective against related or similar bacteria to those that produce them, but generally do not affect other organisms such as human or animal cells.⁴^,⁵ Bacteriocins have emerged as alternatives for treating urinary tract, skin, respiratory, gastrointestinal infections, among others. They provide additional or alternative treatment options compared to conventional antibiotics.⁶^–⁸

A summary of the classification of bacteriocins can be seen in Table 1.

Table 1. Classification and characteristics of bacteriocins.

This table summarizes the different classes of bacteriocins, detailing their molecular mass, properties, structural characteristics, and examples.

Classification		Characteristics	Examples	Reference
Class I (lantibiotics)	Subclass Ia Subclass Ib	Molecular mass: <5 kDa. Properties: resistant to proteolysis, thermostable, and resistant to pH. Structure: intramolecular cyclic, providing rigidity and resistance to the action of proteases.	Nisin, Subtilin, Mersacidin	⁹^–¹³
Class II (non- antibiotics)	Subclass IIa Subclass IIb Subclass IIc Subclass IId	Molecular mass: <10 kDa. Properties: thermostable, pH resistant, and ability to depolarize bacterial cell membranes. Structure: amphipathic helical with disulfide bridges that increase the stability of the peptide.	Pediciona, Plantaricin, Lactococcin A	⁹^–¹²^,¹⁴
Class III	Subclass IIIa Subclass IIIb	Molecular mass: >30 kDa. Properties: thermolabile, and unmodified. They have two mechanisms of action: lytic and non-lytic. Structure: large proteins.	Helviticin J, Millericin B	⁹^,¹⁰^,¹⁴
Class IV	-	Molecular mass: - Properties: thermostable, and resistant to pH. Structure: large peptides with complex structure.	Lactocin S, Eenterocin AS-48, Circularin	¹⁰

A common type of bacteria known to produce bacteriocins is Lactic Acid Bacteria (LAB).¹⁵ Additionally, LABs are particularly intriguing due to the long history of safe use of some strains and their status as “Generally Recognized as Safe” (GRAS), along with the “Qualified Presumption of Safety” (QPS) that most LAB strains possess.¹⁶^,¹⁷ Typically, LABs are either cocci or rods and encompass over 60 genera. The major genera include Aerococcus, Carnobacterium, Enterococcus, Lactobacillus, Lactococcus, Leuconostoc, Oenococcus, Pediococcus, Streptococcus, Tetragenococcus, Vagococcus, Propionibacterium, Bifidobacterium, and Weisella.²^,¹⁸

Although these genera include the main producers of bacteriocins,¹⁵ their uneven representation in public databases may introduce bias. For example, in UniProt, the genus Lactobacillus accounts for over 60% of LAB bacteriocin sequences,¹⁹ while genera such as Weissella or Vagococcus are underrepresented (less than 5% each). To mitigate this risk, our study: employs stratified cross-validation that preserves taxonomic proportions, and includes sequences from all selected genera, even the less common ones (see Methods). However, we acknowledge that the full diversity of bacteriocin-producing LAB is still not captured in the available databases.

Bacteriocins produced by LAB have gained popularity due to their promising applications in the food industry as natural preservatives. This reduces the need for adding chemical preservatives or applying physical treatments during food production.²⁰^,²¹ Additionally, they can be used within the pharmaceutical and medical industry, serving as therapeutic agents or alternatives to traditional antibiotics.²² Bacteriocins derived from LABs are colorless, tasteless, and odorless. Moreover, they possess several crucial metabolic traits such as strong tolerance to low pH, the ability to produce acid and aroma, protein hydrolysis, production of viscous exopolysaccharides, and resilience to high thermal stress.¹²^,²³^,²⁴

On the other hand, the development of machine learning and artificial intelligence techniques, coupled with the availability of sequenced bacterial genomes, has enabled the use of new techniques in bioinformatics. In the context of bacteriocins, employing neural networks allows for the identification of patterns in amino acid sequences (aa), providing an advantage in discovering new bacteriocins that remain uncharacterized.²⁵^,²⁶ This research is based on the need to efficiently identify bacteriocin sequences produced by LAB,²⁷^,²⁸ as the genetic and structural diversity of these peptides poses a challenge.²⁹ Therefore, a deep learning neural network was developed for the binary classification of bacteriocin amino acid sequences, distinguishing between those produced by lactic acid bacteria (BacLAB) and non-BacLAB. Feature extraction using the k-mer method and vector embedding was employed.

Fields where bacteriocins can be applied to address diverse issues

Food industry

Some microorganisms can cause food and beverage contamination, leading to their deterioration, posing a constant concern in the food industry as it can spoil taste and cause foodborne illnesses in humans.³⁰^,³¹ Bacterial pathogens transmitted through food are the primary cause of food poisoning. Chemical additives have been widely used for food preservation; however, their toxicity may raise human health issues. Some of the commercially used chemical preservatives include various synthetic chemicals.³²^,³³ Currently, there is a negative public perception towards chemical preservatives. This has led to a consumer preference for alternatives considered more “natural”.³⁴

In response to this demand for natural preservatives, bacteriocins show significant potential for use in the food industry, aiming to prevent food spoilage and hinder disease transmission by inhibiting the growth of pathogenic bacteria.³⁴^,³⁵ Certain LAB-derived bacteriocins, such as nisin, pediocin, enterocin, and leucocin, have been employed for this purpose.³⁶^–³⁸ They can be used in the preservation of dairy products, meats, vegetables, sourdough bread, wine, among others.² Furthermore, using bacteriocins as preservatives leads to the creation of tastier, less acidic, lower salt content, and higher nutritional value food products. Additionally, these bacteriocins can be used as antimicrobial films in food packaging to extend the shelf life and expiration dates of these products.³⁹^,⁴⁰

However, it’s important to note that while bacteriocins are a promising tool, their application is still under development and study, and they do not completely replace traditional antibiotics in all cases. Further research is needed to fully understand their potential and limitations.³⁴

Medicine

Currently, the growing resistance of bacterial pathogens poses a serious challenge to global public health, impacting not only humans but also animals, plants, and the environmental ecosystem.⁴¹ Drug resistance is on the rise worldwide due to the excessive and uncontrolled use of antimicrobial substances. According to the WHO, superbugs represent one of the most significant threats to public health, causing millions of deaths each year.⁴² It is projected that by 2060, at least 20 new types of antibiotics will be needed to effectively address the problem of bacterial drug resistance. However, developing new antibiotics involves a long and complex process, posing a significant barrier. Therefore, it is imperative to explore and develop new therapeutic strategies capable of effectively combating antibiotic-resistant microorganisms.⁷^,¹⁸

In clinical applications, some bacteriocins have demonstrated efficacy in treating infections, especially those caused by multidrug-resistant strains. Being produced by non-pathogenic bacteria that typically colonize the human body, they are of interest in the medical field.⁴³^–⁴⁵ Some identified bacteriocins applicable in the treatment of infectious diseases include nisin, lacticin, salivaricin, subtilosin, mersacidin, enterocin, gallidermin, epidermin, and fermentin.³⁰ Furthermore, bacteriocins have been explored for potential use in treating conditions such as diarrhea, dental caries, mastitis, and cancer.⁴⁶^–⁴⁸

Livestock animal husbandry

Livestock, comprising domestic animals raised in agricultural settings, play a crucial role in providing labor and a wide range of products such as milk, meat, eggs, hides, and leather. Maintaining livestock health and improving the economy through optimal production requires proper feeding and effective hygiene practices. However, farm animals remain susceptible to infections caused by viruses and bacteria despite these measures.⁴⁹^–⁵¹

In the quest to safeguard animal health on farms, novel techniques are being explored as alternatives to antibiotics. This search becomes especially relevant due to various infectious diseases caused by bacteria in cattle, including conditions like mastitis, post-weaning diarrhea, meningitis, arthritis, endocarditis, pneumonia, and septicemia. Despite this pressing need, the range of bacteriocins evaluated for maintaining livestock health is limited, primarily focusing on nisin, lacticin, garvicin, and macedocin.⁵²^–⁵⁴

The application of bacteriocins in livestock food or water has ensured food safety by reducing the presence of foodborne pathogens in the gastrointestinal tract.⁵⁵^,⁵⁶ This application of bacteriocins has not only been used to improve the productivity of cattle but also probiotic strains capable of producing bacteriocins have been explored to increase the growth rate of pigs. Furthermore, efforts have been made in the poultry industry to control Salmonella.⁵⁷ Maintaining a diet with bacteriocin-producing bacteria can reduce existing populations of foodborne pathogens such as Salmonella and Escherichia coli and prevent the reintroduction of these pathogenic bacteria.⁵⁵ Additionally, they can be used in other forms such as the development of intra-mammary formulations for mastitis, which act as germicidal preparations applied to cows’ udders.⁵⁸^,⁵⁹

Aquaculture

Aquatic cultures face similar challenges to livestock, dealing with potential pathogenic risks and requiring preventive measures such as various breeding techniques, vaccination, and antibiotic use.⁵⁵^,⁶⁰ Bacteriocins function as probiotics, leveraging the interconnected ecosystem shared by animals and microorganisms within the aquatic environment. This interaction promotes probiotic competition against pathogenic bacteria, facilitating the production of inhibitory compounds. As a result, it improves water quality, strengthens the immune response of host species, and enhances species nutrition by producing additional digestive enzymes.⁶¹^–⁶³

Studies involving photosynthetic bacteria like Rhodobacter sphaeroides and bacteriocins derived from Bacillus spp. have investigated their impact as probiotics on shrimp growth and digestive enzyme activity.⁶⁴^,⁶⁵ Likewise, experiments with nutrient-enriched water using Alchem Poseidon, a blend of Bacillus subtilis, L. acidophilus, Clostridium butyricum, and Saccharomyces cerevisiae, have shown potential for preventing infections, as the administered bacteria successfully colonized both the host and the aquatic environment.⁶⁶^,⁶⁷

Work related to artificial intelligence for the classification of bacteriocin sequences

Among the works carried out using deep learning neural networks to analyze large datasets and achieve accurate classification of bacteriocins is the article by Poorinmohammad et al. (2018).⁶⁸ In this study, peptide sequence analysis is conducted using machine learning alongside feature selection, and a Sequential Minimal Optimization (SMO)-based classifier is developed to predict lantibiotics, achieving precision and specificity values of 88.5% and 94%, respectively. However, this approach was limited to lantibiotics (Class I bacteriocins) and did not address the structural diversity of other bacteriocin classes.

Furthermore, in the work of Yount et al. (2020),⁶⁹ the BACII𝛼 algorithm was created to identify and classify bacteriocin sequences. This algorithm integrates a consensus signature sequence, physicochemical elements, and genomic patterns within a high-dimensional query tool to select peptides resembling bacteriocins. It accurately retrieved and distinguished almost all known class II bacteriocin families, achieving a specificity of 86%. While innovative, BACII’s reliance on predefined class II motifs limits its applicability to novel or atypical bacteriocin families. In the article by Akhter and Miller (2022), a similar approach was taken, where a machine learning-based software tool was developed to extract potential features from bacteriocin and non-bacteriocin sequences, considering their physicochemical and structural properties. Support Vector Machine (SVM) and Random Forest (RF) algorithms were employed. In this article, a precision of 95.54% was achieved.⁷⁰ Notably, this tool used small datasets (<1,000 sequences), which may restrict its generalization to broader bacteriocin diversity.

Various methods have also been used to identify bacteriocins from bacterial genomes based on bacteriocin precursor genes or contextual genes. For instance, BAGEL⁷¹ and BACTIBASE⁷² are online tools that analyze experimentally validated and annotated bacteriocins, similar to the BLASTP protein search tool. These tools rely on methods that facilitate the identification of potential bacteriocin sequences based on the homogeneity of known bacteriocins. However, these similarity-based approaches suffer from two critical limitations. They inherently exclude bacteriocins with low homology to known sequences, and their databases are biased toward well-studied LAB genera (e.g., Lactobacillus), underrepresenting rare producers like Weissella or Vagococcus. This issue led to the development of the BOA software,⁷³ which attempts to address this problem by integrating prediction tools based on the conservation of contextual genes from the bacteriocin operon. Nevertheless, they still rely on genomic searches based on homology.

There are taxonomic bias in existing tools. A recurring challenge in bacteriocin prediction is the overrepresentation of certain LAB genera (e.g., Lactobacillus, Enterococcus) in public databases, which may skew models toward recognizing features specific to these groups. For example, in UniProt, >60% of annotated bacteriocin sequences derive from just three genera, potentially marginalizing structurally unique peptides from less-studied LAB. This bias could lead to false negatives in ecological or industrial applications where microbial diversity is crucial.

In addition, the study by Nguyen et al. (2019) utilized a different technique from the previous methods by applying word embeddings of protein sequences to represent bacteriocins. This approach takes into account the amino acid order in protein sequences to predict new bacteriocins from sequences without relying on sequence similarity. While promising for novel bacteriocin discovery, their model was trained on limited data and did not account for taxonomic imbalances in sequence sources. This method even enables the prediction of potentially unknown bacteriocins with high probability. Overall, representing sequences with word embeddings that preserve information about the sequence order can be applied to peptide and protein classification problems where sequence similarity cannot be used.⁷⁴

Similarly, in the work by Hamid and Friedberg (2019),⁷⁵ word embedding was used to identify bacteriocins, representing protein sequences using Word2vec. These representations were used as inputs for various deep recurrent neural networks (RNNs) to distinguish between bacteriocin and non-bacteriocin sequences. This technique addresses challenges such as diversity among bacteriocin sequences. Though effective, their RNN architecture required manual tuning for different bacteriocin classes, reducing scalability. Meanwhile, Fields et al. (2020) developed a process for designing and testing bacteriocin-derived compounds. They employed machine learning and a filter of biophysical features to generate an algorithm that predicts bacteriocins. This involved generating characteristic sequences of 20-mers.²⁶ A key limitation was their focus on short peptides (≤50 AA), excluding larger bacteriocins like Class III.

Current bacteriocin prediction tools, such as BAGEL⁷¹ and BACII,⁶⁹ fail to address two critical needs in the field. First, they lack taxonomic resolution, making them unable to distinguish bacteriocins produced by lactic acid bacteria (LAB) from those synthesized by other bacterial groups. Second, they depend heavily on sequence homology, which limits their ability to detect structurally novel bacteriocins, particularly those originating from understudied LAB genera. This gap significantly restricts their usefulness in industrial and therapeutic contexts, where the specific taxonomy of the bacteriocin-producing organism is essential for applications such as probiotic development, targeted antimicrobial design, and food safety strategies.

To overcome these limitations, our study uses a balanced dataset spanning all major bacteriocin classes and LAB genera (without genus-specific evaluation), employs k-mer features independent of sequence homology, and validates performance on a generalized LAB group to ensure broad applicability (see Methods).

Additionally, there are other works that use antimicrobial peptide (AMP) sequences. However, it’s important to note that all bacteriocins are antimicrobial peptides, but not all antimicrobial peptides are bacteriocins. For example, in the study by Li et al. (2022),⁷⁶ they present a deep learning model called AMPlify for antimicrobial peptide prediction. The cross-validation results for the model achieve 91.70% accuracy, 91.40% sensitivity, 92.00% specificity, and 91.68% F1 score.

Similarly, in Wang et al. (2023),⁷⁷ they developed a bidirectional short and long-term memory deep learning network called AMP-EBiLSTM with an accuracy of 92.39%. This approach employs a binary profile function and a pseudo-amino acid composition to capture local sequences and extract amino acid information. In another study, a model known as AMP-BERT was developed. This network uses a bidirectional transformer encoder (BERT) architecture to extract structural and functional information from input peptides, categorizing each input as AMP or non-AMP. Notably, this network achieved a correct prediction rate of 76% for external test sequences selected in this research.⁷⁸

Similarly, a system called AMPs-Net was introduced, an algorithm designed to streamline experimentation and improve the efficiency of discovering potent AMPs. It exhibited good prediction of the antibacterial capabilities of numerous peptides, with an average accuracy ranging from 80.98% to 91.2% and precision varying from 75.77% to 94.26%.⁷⁹ In the study by Gull et al. (2019), they achieved 97% accuracy for an algorithm that identifies biologically active and antimicrobial peptides.⁸⁰ Similarly, in the study by Redshaw et al. (2023), a neural network was developed to predict the antimicrobial activity of sequences. It was trained on two different databases, achieving a precision result of 86-92% for one database and 72-77% for the other.⁸¹

In another work, an application used for predicting antimicrobial peptides based on properties achieved an accuracy exceeding 80% and sensitivity above 90%.⁸² In the study by Yan et al. (2020), a method for predicting short-length antimicrobial peptides (≤ 30 aa) is presented. Their convolutional neural network, called Deep-AmPEP30, demonstrated a 77% accuracy rate.⁸³ Additionally, in the study by Veltri et al. (2018), a deep learning neural network using embedding vectors to reduce weights when processing sequences was developed. It was shown that antimicrobial peptides could be constructed using only nine amino acids, achieved through the k-mers method. The network achieved an accuracy of 90.55%.⁸⁴

The primary aim of this study is to develop a deep learning model that accurately distinguishes bacteriocin sequences produced by lactic acid bacteria (LAB) from non-LAB bacteriocins. Unlike existing tools that classify bacteriocins generically, our approach specifically targets the LAB/non-LAB dichotomy, enabling applications in probiotic development and food safety. It uses k-mer signatures and embedding vectors to overcome the limitations of homology-based methods and provides interpretable features (100 characteristic k-mers per length) to guide synthetic peptide design.

Methods

The general flow of the method used is illustrated in Figure 1. In section a), the input of the AA sequences is shown. There are two groups: BacLAB and Non-BacLAB. Subsequently, feature extraction is performed for each sequence. Two methods were employed. In b), the use of k-mers to obtain vectors of 0s and 1s representing the presence or absence of representative k-mer groups is shown. The resulting vectors have a length of 100. Meanwhile, in c), a 128-character embedded vector is obtained by passing the sequence through an RNN. These features are concatenated in d). The resulting concatenation serves as input for the DNN in step e). Finally, in f ), a prediction of the aa sequences entered into the trained model is made. Training and validation were performed on Google Colab (a cloud-based environment with free GPUs), confirming that the model is computationally efficient and replicable without investment in expensive infrastructure.

Figure 1. Methodological workflow for predicting bacteriocin AA sequences.

This figure illustrates the comprehensive flow of the method used to predict bacteriocin amino acid sequences in BacLAB and Non-BacLAB groups.

Data collection

The AA sequences from both BacLAB and Non-BacLAB were obtained using the publicly accessible UniProt database, downloaded in xlsx format using the Excel option on the platform.¹⁹ The search on this platform was conducted using the keyword “bacteriocin.” The retrieved parameters for each bacteriocin include: Entry, Organism, Length, and Sequence. Additionally, considering the binary classification, a column was added to label the sequences. The BacLAB dataset was labeled as 1, while the Non-BacLAB sequences were labeled as 0.

To classify which sequences correspond to BacLAB and which ones to Non-BacLAB, the parameter “organism” was considered to identify the species that produce the bacteriocin. The LAB genera included for classification encompassed Lactobacillus, Lactococcus, Leuconostoc, Pediococcus, Streptococcus, Aerococcus, Alloiococcus, Carnobacterium, Dolosigranulum, Enterococcus, Oenococcus, Tetragenococcus, Vagococcus, and Weissella.⁸⁵

Sequences with lengths between 50 and 2000 amino acids were selected to ensure consistency. After filtering, the BacLAB dataset contained 24,964 sequences. For the Non-BacLAB dataset, which originally had a larger number of sequences, a random subset of 25,000 sequences was selected to prevent class imbalance in subsequent analyses. Figure 2 illustrates the length of each individual sequence (y-axis) plotted against its position in the ordered dataset (x-axis), allowing a direct comparison of length trends between BacLAB and Non-BacLAB sequences.

Figure 2. Length of each sequence (BacLAB vs. Non-BacLAB) ordered by dataset position.

The curves display the length (in amino acids) of each BacLAB and Non-BacLAB sequence, plotted according to their original position in the dataset. The x-axis represents the sequence index (1 to 25,000), and the y-axis shows the corresponding sequence length. Sequences were filtered to retain lengths between 50 and 2000 amino acids.

Feature extraction

K-mers

In the realm of amino acid sequence processing (or biological sequences in general) using neural networks, a ‘k-mer’ refers to subsequences of length ‘k’.⁸⁶ These subsequences are formed by dividing a longer sequence into specific-sized fragments, where ‘k’ represents the size of each fragment.⁸⁷ For example, a k-mer of size 5 would involve splitting the sequence into all possible subsequences of length 5, as illustrated in Figure 3. The k-mer features of a set of sequences enable the discovery of hidden patterns within that sequence population. Additionally, k-mers are useful for representing sequences in a more manageable way.⁸⁸

Figure 3. Illustration of k-mers generated from an amino acid sequence with different k-values.

On the left side are shown the k-mers that would be obtained from a sequence if k=5 is set. On the right side the same sequence is used, but in this case k=7.

At this stage, a list of the 100 most common k-mers within the BacLAB data set was generated. For this, several values of k were selected (k=3, 5, 7, 15, and 20). The k-mers of each BacLAB sequence were generated. Once all the k-mers were obtained, the frequency of each of them was counted. The 100 k-mers with the highest frequency were selected; this was done for each value of k, resulting in five different lists.

After compiling the lists, feature vectors of ‘0’ and ‘1’ were extracted for each sequence, both for those in the BacLAB and Non-BacLAB groups. The k-mers obtained from each sequence were compared with the list of k-mers. A ‘1’ was assigned if the listed k-mer was present in the analyzed sequence, while a ‘0’ was assigned if the k-mer was not found. This process produced a vector of length 100. Figure 4 illustrates the process.

Figure 4. Feature extraction from an AA sequence.

The list of 100 selected k-mers is compared with the k-mers of the input sequence. If one of the k-mers of the sequence is found in the list, '1' is added; if it is not found, a '0' is added. This process generates a representative vector for the sequence with 100 features in length. In this example, k=5 is used.

Embedding vectors

Word embeddings are numerical representations of amino acids, where each letter denoting an amino acid receives a unique and discrete value.⁸⁹ Each protein is treated as a distinct input token, and the set of 20 amino acids forms a specific dictionary.

For example, for ‘A’ (Alanine), the index 1 is assigned. Consequently, in a sequence, each occurrence of ‘A’ is denoted with the value 1. Figure 5 clarifies the process of generating the index vector. If letters were to appear in the sequence that are not found in the list of amino acids, they will be represented as zero. These indices are used to encode sequences before introducing them into the neural network that generates the embedded vectors.

Figure 5. Encoding of amino acid sequences.

a) The index number corresponding to each aa is assigned. b) Shows how the sequence is encoded with the indices that correspond to each AA. c) Given that there can be letters or numbers in the sequence that do not exist in the aa list, a value of 0 is assigned as an index. This way, errors are avoided when processing the sequence.

Once the index-encoded vectors are obtained, the embedding vectors are extracted. To derive these features, a recurrent neural network (RNN) is applied using the Gated Recurrent Unit (GRU) cell. RNNs with GRUs can handle sequences of varying lengths due to their inherent sequential processing nature and the specific architecture of GRUs. This makes GRU-based RNNs particularly useful in applications where sequence lengths are variable, as they can efficiently handle input length variability without losing learning capacity.⁹⁰^–⁹²

The embedding layer in the network acts as a lookup table or a weight matrix where each row represents, in our case, a vectorized representation of a specific amino acid.⁹³ The number of rows is equal to the count of unique elements in the vocabulary, which is the number of amino acids plus one, including index zero reserved for a non-existent variable in the amino acid list. The number of columns represents the embedding dimension, a model hyperparameter set to 128 in this case. Consequently, the length of the embedding vector obtained is also 128 for each sequence. Normally, before training begins, the weight matrix is initialized randomly along with all the network parameters. However, for this step, a pre-trained network is used, loading the weights into the model. Figure 6 illustrates the structure of the RNN model.

Figure 6. Flowchart of RNN model using GRU cell.

Input: Amino acid sequence (top) and its integer encoding (middle). Embedding Layer: Converts encoded indices into dense vectors (128-D) via a pre-trained weight matrix. GRU Layer: Processes sequential data (arrows indicate flow direction), capturing contextual relationships between amino acids. Linear Layer: Final transformation (LogSoftmax) for classification.

Concatenated data sets

Different datasets will be used to train the neural network and determine which combination of parameters produces the best results. For the selection of k-mers, values of k=3, k=5, k=7, k=15, and k=20 will be used, as shown in Table 2.

Table 2. Parameters for neural network training.

This table presents the different k-mer values used for training the neural network.

Concatenation groups
EV
EV + 3-mers
EV + 5-mers
EV + 7-mers
EV + 15-mers
EV + 20-mers
EV + 3-mers + 5-mers
EV + 3-mers + 7-mers
EV + 5-mers + 7-mers
EV + 15-mers + 20-mers

These specific values for k (k = 3, 5, 7, 15, 20) were chosen to align with conserved motifs reported in bacteriocin literature, ensuring coverage of both short functional domains and longer structural regions:

• k = 3–7: These values target small but critical motifs, such as the 5-AA sequences YGNGV/YDNGI in class IIa bacteriocins, and extended variants (e.g., 7-AA YGNGVXC) associated with antimicrobial activity¹³^,⁹⁴^–⁹⁶
• k = 15–20: Longer k-mers were selected to encapsulate the “pediocin box” (e.g., YGNGVXCXXXXCXV, 14 AA; or YGNGVXCXXXXCXVXWXXA, 19 AA), a hallmark of bacteriocin tertiary structure and functionality.⁹⁷^–⁹⁹ This range also accommodates similarities in the N-terminal half of sequences (17–19 AA) linked to target specificity.⁹⁸

By incorporating this spectrum of k-values, our approach balances granularity (capturing short motifs) with context (preserving structural dependencies), a strategy validated in prior studies on peptide classification.⁹⁸^,⁹⁹

Deep neural network

To predict amino acid sequences, a Deep Neural Network (DNN) was employed following the structure described in Jeff et al.’s article.⁹⁹ This type of network was chosen for its ability to learn complex patterns and representations from data. Additionally, they can efficiently handle large datasets.¹⁰⁰ The construction of this neural network used Python 3.10.12 in Google Colab along with several libraries: i) Pandas (RRID:SCR 018214),¹⁰¹ ii) Keras, iii) Scikit-learn (RRID:SCR 002577),¹⁰² iv) NumPy (RRID:SCR 008633),¹⁰³ and v) Matplotlib.

The network architecture consists of four blocks. The input for each sequence is a vector, which corresponds to the concatenation of the results described in the k-mers section and the embedding features. Therefore, the length of the input depends on the number of concatenated features. In Figure 7, a representation is used where the extracted results using k-mers for k=5 and k=7, and the embedding features are concatenated. Since the result in k-mers corresponds to a vector of length 100, while the embedding features provide a vector length of 128, the input corresponds to a vector length of 328 for each sequence. The output of the neural network is the class of each sequence, where 1 denotes BacLAB and 0 represents non-BacLAB.

Figure 7. Flowchart of the deep neural network.

The model established the number of neurons in each defined layer block, with 128 neurons for the first two layers, 64 neurons for the next four layers in the second block, followed by 32 neurons for the five subsequent layers in the third block, and finally, two neurons in the last two layers in the fourth block. The number of neurons was determined based on the input parameters and the DNN architecture.¹⁰⁴ Out of the total thirteen layers in the model (excluding input and output layers), four layers are dense, three layers are activation layers, three layers are dropout layers, two layers are normalization layers, and one layer is a flattening layer. Table 3 provides a summary of the layers in the proposed DNN model.

Table 3. Layers of the deep neural network model.

Layer	Type	Output shape	Param #
dense	Dense	(None, 128)	42112
dropout	Dropout	(None, 128)	0
dense 1	Dense	(None, 64)	8256
batch normalization	Batch	(None, 64)	256
activation	Activation	(None, 64)	0
dropout 1	Dropout	(None, 64)	0
dense 2	Dense	(None, 32)	2080
batch normalization 1	Batch	(None, 32)	128
activation 1	Activation	(None, 32)	0
dropout 2	Dropout	(None, 32)	0
flatten	Flatten	(None, 32)	0
dense 3	Dense	(None, 2)	66
activation 2	Activation	(None, 2)	0

Additionally, among the hyperparameters used, 75 epochs were set, a batch size of 40, and a learning rate of 2.5×10-5 for the Adam optimizer. “Mean_absolute_error” was used as the loss function. For training and testing the neural network, the k-fold cross-validation technique was employed, with k=30 selected.

For hyperparameter tuning, an iterative approach based on cross-validation (k=30) was employed, where values were progressively optimized through empirical evaluation of key metrics (loss, accuracy, and F1-score). While this method does not follow an automated search (such as grid search), it allowed for flexible adaptation to the dataset characteristics, prioritizing the balance between model stability and computational efficiency. The final hyperparameters (learning rate=2.5×10⁻⁵, batch size=40, epochs=75) were selected when consistent convergence was observed across the evaluation metrics. Given the modular nature of the model (a combination of k-mers and embeddings) and the size of the dataset, manual optimization allowed us to prioritize biologically relevant hyperparameter combinations, reducing the computational cost compared to exhaustive methods (like grid search or random search).

Statistics analysis

In this study, ANOVA test along with Tukey test was used to assess significant differences among multiple groups based on parameters of interest, including accuracy, loss, precision, recall, and F1 score. These parameters are critical for evaluating the performance of the implemented neural network.

A confidence interval of 95% was selected to ensure that the differences identified between the groups are statistically significant, providing greater certainty about the conclusions drawn from the analysis. It is important to note that RStudio Cloud software was used as the statistical analysis tool to conduct these evaluations.

Results

The lists of k-mers were obtained for values of k=3, k=5, k=7, k=15, and k=20. For each k-mer, the 100 most frequent repetitions among the sequences were selected. The list can be found in a xlsx file in the repository.¹⁰⁵ Through k-fold cross-validation, various performance metrics of the neural network were obtained. These metrics include loss, precision, recall, F1 score, and accuracy. They were evaluated for each group with different feature concatenations. Since thirty iterations were performed for each set, Table 4 presents the metrics averaged per group.

Table 4. Performance metrics obtained from k-fold cross validation (k=30) using different concatenation groups.

Group	Loss	Accuracy	Precision	Recall	F1
EV	10.818	89.423	0.897	0.895	0.895
3-mers + EV	11.500	88.648	0.889	0.887	0.887
5-mers + EV	10.000	90.071	0.904	0.902	0.901
7-mers + EV	10.100	90.049	0.903	0.901	0.901
15-mers + EV	10.500	89.584	0.897	0.895	0.895
20-mers + EV	10.300	89.763	0.899	0.898	0.898
3-mers + 5-mers + EV	10.900	89.184	0.893	0.891	0.891
3-mers + 7-mers + EV	11.600	89.085	0.893	0.892	0.892
7-mers + 5-mers + EV	9.900	90.143	0.903	0.901	0.901
15-mers + 20-mers +EV	10.200	89.885	0.900	0.899	0.899

The initial evaluation was conducted using only features extracted from EV. The results obtained for each metric demonstrate notable performance, as both precision and F1 score reached approximately 89%, while the loss function was around 10%. However, an exploration was conducted by including more features to examine if the metric percentages could be improved. Therefore, a concatenation of EV features with various k-mers was implemented.

To demonstrate if there are significant differences between the metrics of each group, an Analysis of Variance (ANOVA) was conducted for each metric. Table 5 shows the results obtained. This analysis revealed substantial differences between the groups, as the Pr(>F) values are less than α=0.05. Therefore, the null hypothesis is rejected, and the alternative hypothesis is accepted.

Table 5. Resultados de la prueba ANOVA.

Los parámetros de la tabla indican: Df: Grados de Libertad, Sum Sq: Suma de cuadrados, Mean Sq: Cuadrado medio, Pr(>F): Valor p.

Parametro	Factor	Df	Sum Sq	Mean Sq	F value	Pr(>F)
Loss	Group	9	102.58	113.978	10.321	8.042e-14
Loss	Residuals	290	320.27	11.044	-	-
Acc	Group	9	65.982	73.314	12.524	<2.2e-16
Acc	Residuals	290	169.765	0.5854	-	-
Precision	Group	9	0.0066813	0.00074237	12.326	<2.2e-16
Precision	Residuals	290	0.0174667	0.00006023	-	-
Recall	Group	9	0.006772	0.00075244	10.918	1.23e-14
Recall	Residuals	290	0.019987	0.00006892	-	-
F1 score	Group	9	0.00656	0.00072889	10.654	2.812e-14
F1 score	Residuals	290	0.01984	0.00006841	-	-

To discern the differences between groups, a Tukey post hoc test was conducted. This test allows paired comparisons of the means of each group. Since the aim is to determine if using concatenated features yields better results than using EV exclusively. Table 6 presents the results of the Tukey test for the groups that show significant differences between using EV exclusively or the concatenation of EV with k-mers. The complete table can be found on the GitHub page.

Table 6. Tukey test results for accuracy comparing EV group and k-mer concatenation groups.

The parameters in the table indicate: diff: difference in the means of the compared groups, lwr: lower limit of the confidence interval, upr: upper limit of the confidence interval, p adj: adjusted p-value.

Parametro	Group	Diff	lwr	Upr	p adj
Accuracy	EV - 3+EV	0.77510000	0.14519745	1.40500256	0.00424300
	EV - 5+7+EV	-0.72010000	-1.35000255	-0.09019745	0.01159510
	EV - 5+EV	-0.64796667	-1.27786922	-0.01806411	0.03805910
Precision	EV - 3+EV	7.333333E-03	0.00094401	0.01372266	0.01102510
	EV - 5+7+EV	-6.666667E-03	-0.01305599	-0.00027734	0.03295030
	EV - 5+EV	-7.333333E-03	-0.01372266	-0.00094401	0.01102510
Loss	EV - 5+7+EV	0.95456667	0.08938459	1.81974875	0.01784760
Recall	EV - 3+EV	8.333333E-03	0.00149862	0.01516804	0.00485130
F1 score	EV - 3+EV	8.333333E-03	0.00152375	0.01514292	0.00459870

In the accuracy parameter, there is a significant difference for the groups ‘3-mers + EV’, ‘5-mers + 7-mers + EV’, and ‘5-mers + EV’. These show ‘p adj’ values lower than α=0.05. The difference between the mean values of the EV group and the ‘3-mers + EV’ group in the ‘diff’ parameter yields a positive value, indicating that the results of the EV group are superior compared to ‘3-mers + EV’. Conversely, the differences of the ‘5-mers + 7-mers + EV’ and ‘5-mers + EV’ groups are negative. This indicates that using these two concatenation groups of k-mers and EV produces better accuracy results than using only EV.

For the precision parameter, the mean values of the EV group surpassed those of ‘3-mers + EV’, showing a positive difference. Similarly to accuracy, the exclusive use of EV yields superior precision. However, the groups ‘5-mers + 7-mers + EV’ and ‘5-mers + EV’ exhibited higher mean values than EV, displaying negative differences, indicating that these groups produce better precision than the exclusive use of EV. Regarding the loss parameter, significant differences were observed only between EV and the ‘EV + 5-mers + 7-mers’ group. In contrast to accuracy, the mean values of EV were higher than those of the ‘EV + 5-mers + 7-mers’ group, favoring the concatenated feature group, considering that lower loss percentages are desired in a neural network.

Results for the Recall and F1 scores showed significant differences between the EV and ‘3-mers + EV’ groups for both parameters. However, in both cases, the mean values for EV outperformed ‘3-mers + EV’. These results indicate that optimal Recall and F1 scores are generated for the EV group. The Tukey test results indicated that the ‘5-mers + 7-mers + EV’ group produces the best result. Among the cross-validation folds of this group, fold k=22 demonstrated the best result, recording a loss of 8.500%, an accuracy of 91.471%, and a precision, recall, and F1 score of 91.000%. Due to its performance, this methodology was chosen for implementation as the model’s classifier and for incorporating the weights generated in the neural network.

Notably, this accuracy surpasses or matches reported values from: General bacteriocin classifiers (e.g., 88.5% in SMO-based models,⁶⁸ 95.54% in SVM/RF approaches⁷⁰). Broader antimicrobial peptide predictors (e.g., 91.7% in AMPlify⁷⁶), despite addressing the more complex LAB/non-LAB distinction. While direct comparisons are limited by the absence of prior taxonomy-aware models, these benchmarks contextualize our model’s competitive edge. Full methodological comparisons are detailed in Discussion.

Figure 8 illustrates the progress of the loss and accuracy metrics during the 75 epochs of fold 22. The measurements indicated adequate convergence during training. Initially, accuracy revealed low values that progressively increased over epochs, both in training and validation ( Figure 8a). In contrast, the loss was high during the initial stages of training, decreasing as the training and validation processes progressed ( Figure 8b). Although attempting to use a larger number of epochs, there was no observed increase in accuracy or decrease in loss beyond the maximum level reached at epoch 70, so this parameter was set at 75, as increasing it would imply greater computational expense without any benefit.

Figure 8. Accuracy and Loss evaluation.

Visualization of the distributed training metrics for the classifier after 75 epochs from 22° folder, which yielded superior results by employing the concatenation of 5-mers, 7-mers and Embedding Vector. (a) Accuracy progression during training and validation; (b) Loss progression during training and validation.

The efficiency of the neural network was assessed using a confusion matrix. The data from the main diagonal were presented, indicating the number of correct predictions made by the model ( Figure 9). A total of 732 sequences were correctly classified as non-BacLAB, while 791 were classified as true BacLAB proteins. Values below the main diagonal represent false negatives, where 39 cases were incorrectly classified as non-BacLAB. On the other hand, values above the main diagonal reflect false positives, where 103 cases were incorrectly classified as BacLAB.

Figure 9. Confusion matrix of 22° folder.

a) The panel shows the confusion matrix for the number of sequences evaluated. b) The panel shows the confusion matrix for the number of sequences evaluated normalized to one.

Discussion

According to the results of the Tukey test, the concatenation of EV and k-mers did not improve the evaluation metrics for all combinations. When comparing EV with ‘3-mers + EV’, decreases in metrics such as accuracy, precision, and loss were observed for the latter group. This could be caused by using a very short k-mer, which increases the probability of finding these k-mers in non-BacLAB sequences, resulting in more false positives. On the other hand, other combinations like ‘7-mers + EV’, ‘15-mers + EV’, ‘20-mers + EV’, ‘3-mers + 5-mers + EV’, ‘3-mers + 7-mers + EV’, ‘15-mers + 20-mers + EV’ did not show a statistically significant difference for any metric. And it was found that the ‘5-mers + 7-mers + EV’ group produces the best result.

The superior performance of the ‘5-mers + 7-mers + EV’ group can be attributed to the selected lengths of k-mers. In several studies, characteristic peptide sequences produced by lactic acid bacteria with lengths of 5 and 7 AA have been identified. Bacteriocins of subclass IIa contain the consensus sequence YGNGVXC at the N-terminal end that characterizes them. Similarly, sequences of leucocin A-UAL 187, sakacin P, and curvacin A had this same 7 AA pattern in their N-terminal region.⁹^,⁹⁷ However, other articles consider only the highly conserved part, the first 5 AA excluding the variable AA. This characteristic sequence is YGNGV or YGNGL.¹³^,¹⁰⁶^,¹⁰⁷ Therefore, given the precedent that certain characteristic sequences of length 5 and 7 exist among bacteriocins, this could explain why the combination of these groups yields better results.

On the other hand, the confusion matrix results showed a higher sensitivity rate than specificity. Improving specificity could be considered in future work since, for this study, higher specificity would be preferable over sensitivity. Misclassifying a non-BacLAB as BacLAB could result in losses during laboratory tests if experimental tests are to be implemented.

Regarding computational efficiency, the model was designed to balance performance and accessibility. All stages (training, validation, and testing) were run on Google Colab using free GPUs (T4/K80), demonstrating that the system does not require specialized hardware for implementation. This choice ensures that the methodology is reproducible in resource-limited academic or industrial environments, without compromising the accuracy of the results. However, while the model is viable in standard environments such as Google Colab, its performance on massive datasets (e.g. >1 million sequences) may require architectural adjustments to maintain reasonable training times.

The model developed in this study achieved results within the range reported in the literature. However, direct benchmarking against existing models is challenging, as no previous studies have specifically addressed binary classification of bacteriocins produced by LAB vs. non-LAB. For example, the BAGEL software can detect putative gene clusters of bacteriocins in new bacterial genomes and has demonstrated an ROC (Receiver Operating Characteristic) analysis value of 0.99.¹⁰⁸ Comparable to the BLASTP protein search tool, these applications use techniques to help recognize potential bacteriocin sequences by evaluating their similarity to known bacteriocins.¹⁰⁹

Similarly, there is the Bacteriocin Operon and Gene Block Associator (BOA) software, which, unlike other models, identifies homologous gene blocks associated with bacteriocins to predict new ones.⁷³ The Bacteriocin-Diversity Assessment software (v1.2 version) also performs similar operations. Although these studies mention achieving high accuracy, the specific percentage reached is not mentioned.¹¹⁰ Additionally, a comparison was made with studies using machine learning and deep learning techniques in Table 7. In this comparison, as mentioned earlier, the study presents accuracy within the existing literature, surpassing by 3% the work done by Poorinmohammad et al. (2018)⁶⁸ and by 4% compared to the results obtained in Redshaw et al. (2023).⁸¹

Table 7. Performance of our model in comparisons with other methods of machine learning.

Method	Purpose	Database	Metrics evaluated	Reference
Generation of physicochemical characteristics, support vector machine (SVM) and random forest (RF) model.	Predict bacteriocin protein sequences	283 bacteriocins and 283 non-bacteriocins	Accuracy: 95.54%	⁷⁰
Word Embedding with Deep Recurrent Neural Networks (RNN)	Predict new bacteriocins from protein sequences without using sequence similarity.	346 bacteriocins and 346 non-bacteriocin	Accuracy: 99%	⁷⁵
Sequential Minimal Optimization (SMO)-based classifier	Search for relevant characteristics of lantibiotics, which can be used in lantibiotic bioengineering.	280 lantibiotic and 190 non-lantibiotic	Accuracy: 88.5% Specificity: 94%	⁶⁸
Word-embedding algorithm using biophysical properties	Design and testing of compounds derived from bacteriocins to generate 20 AA peptides that can be synthesized and their activity evaluated.	346 bacteriocins and 346 non-bacteriocins	-	²⁶
Support vector machines (SVM)	Identification of biologically active and antimicrobial peptides.	2704 in total	Accuracy: 97%	⁸⁰
Krein-support-vector machine (SVM).	Predict the overall antimicrobial activity of sequences	Two datasets: 3556 and 3246	1° Datase’s accuracy: 86-92% 2° Dataset’s accuracy: 72-77%	⁷⁹
Embedding vectors and Deep Learning Neuronal Network (DNN) using k-mers	Identification of bacteriocins produced by LAB	24,964 BacLAB and 25,000 Non-BacLAB	Accuracy: 91.47% Loss: 8.500% Precision: 91.47 % Recall: 87.66% F1 score: 91%	This work

This work also demonstrated superior performance compared to the BACII𝛼 algorithm, which identifies and classifies bacteriocin sequences. By integrating physicochemical and genomic patterns from known Class II bacteriocin families, it achieved an 86% specificity.³⁴ Similarly, a better outcome was observed compared to using sequence composition as features. In a study where this feature was used, an accuracy of 90.55% was achieved.⁸⁴ Although a similar result was observed compared to the work of Dua et al. (2020), which achieved an accuracy of 91.7%.¹¹¹ However, it’s important to consider that each study uses varying amounts of data for their respective articles.

Although the model has demonstrated strong performance in its results, it is important to consider that the sequence filtering step (50 ≤ length ≤ 2000 amino acids), while ensuring a manageable range for training, introduces two main limitations. First, there is a length bias in k-mer representation. Longer sequences naturally contain more subfragments (k-mers), which increases the likelihood of matching characteristic k-mers from the feature list—even if those matches are not biologically relevant. This can lead to a higher chance of false positives in longer sequences, potentially compromising the accuracy of the classification.

Second, standardizing variable-length sequences into fixed-size k-mer vectors (100 features) results in the loss of structural information that depends on the original sequence length. While k-mers are effective at capturing local motifs, they do not preserve information about the relative position of those motifs within the full sequence. As a result, important structural patterns, such as domain arrangements in distant regions, may be lost during the vectorization process.

On the other hand, our model was trained using the best-characterized LAB genera (Lactobacillus, Enterococcus, etc.), which are the most abundant in public databases. For example, in UniProt (the database used in this study), 62% of LAB bacteriocin sequences correspond to Lactobacillus, while genera such as Weissella represent only 3.5%. Although we employed stratified cross-validation to reduce bias, this disparity could affect the detection of atypical bacteriocins in rare genera. Future studies could enrich the dataset with experimental isolates from underrepresented taxa.

Future iterations of the model could address these limitations by incorporating normalization weights based on sequence length to correct for bias, or by including positional k-mers—such as dividing the sequence into segments and extracting k-mers from each region independently. Additionally, the validation was limited to computational data. While the model demonstrated high precision (91.47%), its real-world applicability would require in vitro experimental testing to confirm whether the sequences classified as BacLAB actually produce functional bacteriocins. Furthermore, it is necessary to verify whether the identified k-mers are truly associated with antimicrobial activity. These experiments, although crucial, fall outside the scope of this study and represent a valuable direction for future research.

Conclusion

In this study, we developed a deep learning neural network for binary classification of bacteriocin sequences, successfully distinguishing LAB-produced bacteriocins from non-LAB sequences. Our approach combining k-mer features (k=3,5,7,15,20) and embedding vectors achieved optimal performance with the ‘5-mers+7-mers+EV’ configuration, demonstrating 91.47% accuracy and 8.50% loss in the best fold (k=22). These results compare favorably with existing bacteriocin classification tools, outperforming some by 3-10%, despite addressing the more challenging LAB/non-LAB distinction.

Key strengths of our approach include the identification of 500 characteristic k-mers that may serve as signatures for LAB bacteriocins. Also, validation on a large, balanced dataset (≈25,000 sequences per class), and computational efficiency via Google Colab implementation

However, we acknowledge important limitations, like taxonomic bias. Public databases overrepresent certain LAB genera (e.g., Lactobacillus), potentially affecting model generalizability to rare producers. In the same way, we have some problems with sequence length constraints. Our 50-2000 AA filter may exclude structurally important extremes. And, the lack of experimental validation. Predicted bacteriocins require in vitro confirmation of biological activity.

Future work could: expand taxonomic diversity through targeted sequencing of underrepresented LAB, investigate k-mer positional conservation within full-length sequences, and validate top predictions through antimicrobial assays. These advances would strengthen the model’s utility for developing targeted antimicrobials in food safety and therapeutic applications.

Ethics and consent

Ethical approval and consent were not required.

Data availability

Underlying data

Zenodo: Deep Learning Neural Network Development for the Classification of Bacteriocin Sequences Produced by Lactic Acid Bacteria: Repository. https://doi.org/10.5281/zenodo.13279718.¹⁰⁵

This project contains the following underlying data:

Software-Related Files:

• BacLABNet_script.ipynb (Deep Learning Neural Network for classification of Bacteriocin Sequences)
• embed_proteins.py (Recurrent Neural Network to obtained the embedding vectors)
• model_I22.h5 (This file contains the trained weights of the trained model)
• model_I22.json (This file contains the structure of the trained model)
• rnn_gru.pt (Initial weights of the Recurrent Neural Network to obtain embedding vectors)
• List_kmers.csv (List of 5-mers and 7-mers obtained from dataset after it filtered sequences shorter than 50 aa and longer than 2000 aa)

Files Used for Training, Testing, and Validation of the Neural Network

• data_nonBacLAB.csv (25000 nonBacLAB amino acid sequences retrieved from Uniprot)
• data_BacLAB.csv (24964 BacLAB amino acid sequences retrieved from Uniprot)

Data are available under the terms of the Creative Commons Zero “No rights reserved” data waiver (CC0 1.0 Public domain dedication).

Extended data

Zenodo: Deep Learning Neural Network Development for the Classification of Bacteriocin Sequences Produced by Lactic Acid Bacteria: Repository. https://doi.org/10.5281/zenodo.13279718.¹⁰⁵

• data_BacLAB_and_nonBacLAB.csv (Combination of sequences from data_BacLAB.csv and data_nonBacLAB.csv)
• all k.mers list.xlsx (Table of all k-mers obtained for k=3,5,7,15,20)

Data are available under the terms of the (Creative Commons Zero “No rights reserved” data waiver (CC0 1.0 Public domain dedication).

Software availability

Source code available from: https://github.com/lady1004/BacLAB-Deep-Learning-Neural-Network.

Archived source code at time of publication: https://doi.org/10.5281/zenodo.13279718.

License: CC0 1.0 Universal.

References

1. Yoshida M, Hinkley T, Tsuda S, et al.: Using Evolutionary Algorithms and Machine Learning to Explore Sequence Space for the Discovery of Antimicrobial Peptides. Chem. 2018; 4(3): 533–543. Publisher Full Text
2. Abiola RR, Okoro EK, Sokunbi O: Lactic Acid Bacteria and the Food Industry - A Comprehensive Review. Int. J. Health Sci. Res. 2022; 12(5): 128–142. Publisher Full Text
3. Todorov SD, Popov I, Weeks R, et al.: Use of Bacteriocins and Bacteriocinogenic Beneficial Organisms in Food Products: Benefits, Challenges, Concerns. Foods. 2022; 11. PubMed Abstract | Publisher Full Text | Free Full Text
4. Daba GM, Elnahas MO, Elkhateeb WA: Beyond biopreservatives, bacteriocins biotechnological applications: History, current status, and promising potentials. Biocatal. Agric. Biotechnol. 2022; 39: 102248. Publisher Full Text
5. Palmer JD, Foster KR: The evolution of spectrum in antibiotics and bacteriocins. Proc. Natl. Acad. Sci. USA. 2022; 119(38): e2205407119. PubMed Abstract | Publisher Full Text | Free Full Text
6. Arthur TD, Cavera VL, Chikindas ML: On bacteriocin delivery systems and potential applications. Future Microbiol. 2014; 9: 235–248. PubMed Abstract | Publisher Full Text
7. Ding D, Wang B, Zhang X, et al.: The spread of antibiotic resistance to humans and potential protection strategies. Ecotoxicol. Environ. Saf. 2023; 254: 114734. PubMed Abstract | Publisher Full Text
8. Timothy B, Iliyasu AH, Anvikar AR: Bacteriocins of Lactic Acid Bacteria and Their Industrial Application. Current Topic in Lactic Acid Bacteria and Probiotics. 2021; 7(1): 1–13. Publisher Full Text
9. Negash AW, Tsehai BA: Current Applications of Bacteriocin. Int. J. Microbiol. 2020; 2020: 1–7. PubMed Abstract | Publisher Full Text | Free Full Text
10. Gradisteanu Pircalabioru G, Popa LI, Marutescu L, et al.: Bacteriocins in the era of antibiotic resistance: rising to the challenge. Pharmaceutics. 2021; 13. PubMed Abstract | Publisher Full Text | Free Full Text
11. Soltani S, Hammami R, Cotter PD, et al.: Bacteriocins as a new generation of antimicrobials: Toxicity aspects and regulations. FEMS Microbiol. Rev. 2021; 45. PubMed Abstract | Publisher Full Text | Free Full Text
12. Silva CCG, Silva SPM, Ribeiro SC: Application of bacteriocins and protective cultures in dairy food preservation. Front. Microbiol. 2018; 9. PubMed Abstract | Publisher Full Text | Free Full Text
13. Hernández-González JC, Martínez-Tapia A, Lazcano-Hernández G, et al.: Bacteriocins from lactic acid bacteria. A powerful alternative as antimicrobials, probiotics, and immunomodulators in veterinary medicine. Animals. 2021; 11. PubMed Abstract | Publisher Full Text | Free Full Text
14. Parada JL, Caron CR, Medeiros ABP, et al.: Bacteriocins from lactic acid bacteria: Purification, properties and use as biopreservatives. Braz. Arch. Biol. Technol. 2007; 50(3): 512–542. Publisher Full Text
15. Abdulhussain Kareem R, Razavi SH: Plantaricin bacteriocins: As safe alternative antimicrobial peptides in food preservation—A review. J. Food Saf. 2020; 40(1). Publisher Full Text
16. Alvarez-Sieiro P, Montalbán-López M, Mu D, et al.: Bacteriocins of lactic acid bacteria: extending the family. Appl. Microbiol. Biotechnol. 2016; 100: 2939–2951. PubMed Abstract | Publisher Full Text | Free Full Text
17. Simons A, Alhanout K, Duval RE: Bacteriocins, antimicrobial peptides from bacterial origin: Overview of their biology and their impact against multidrug-resistant bacteria. Microorganisms. 2020; 8. PubMed Abstract | Publisher Full Text | Free Full Text
18. Darbandi A, Asadi A, Mahdizade Ari M, et al.: Bacteriocins: Properties and potential use as antimicrobials. J. Clin. Lab. Anal. 2022; 36. Publisher Full Text
19. Bateman A, Martin MJ, Orchard S, et al.: UniProt: the Universal Protein Knowledgebase in 2023. Nucleic Acids Res. 2023; 51(D1).
20. Ibrahim OO: Classification of Antimicrobial Peptides Bacteriocins, and the Nature of Some Bacteriocins with Potential Applications in Food Safety and Bio-Pharmaceuticals. EC Microbiol. 2019; 15(7): 591–608.
21. Verma DK, Thakur M, Singh S, et al.: Bacteriocins as antimicrobial and preservative agents in food: Biosynthesis, separation and application. Food Biosci. 2022; 46: 101594. Publisher Full Text
22. Lee YCJ, Cowan A, Tankard A: Peptide Toxins as Biothreats and the Potential for AI Systems to Enhance Biosecurity. Front. Bioeng. Biotechnol. 2022; 10. Publisher Full Text
23. Wang Y, Wu J, Lv M, et al.: Metabolism Characteristics of Lactic Acid Bacteria and the Expanding Applications in Food Industry. Front. Bioeng. Biotechnol. 2021; 9. Publisher Full Text
24. Xu C, Fu Y, Liu F, et al.: Purification and antimicrobial mechanism of a novel bacteriocin produced by Lactobacillus rhamnosus 1.0320. LWT. 2021 Jun; 137: 110338. Publisher Full Text
25. Cardona AF, Ruíz Patiño A, Jaller E, et al.: Caminando a hombros de gigantes: intersección entre la genómica y la IA. Medicina (B Aires). 2022; 43(4): 668–681. Publisher Full Text
26. Fields FR, Freed SD, Carothers KE, et al.: Novel antimicrobial peptide discovery using machine learning and biophysical selection of minimal bacteriocin domains. Drug Dev. Res. 2020; 81(1): 43–51. PubMed Abstract | Publisher Full Text | Free Full Text
27. Akhter S, Miller JH: BaPreS: a software tool for predicting bacteriocins using an optimal set of features. BMC Bioinformatics. 2023; 24(1): 313. PubMed Abstract | Publisher Full Text | Free Full Text
28. Talat A, Khan AU: Artificial intelligence as a smart approach to develop antimicrobial drug molecules: A paradigm to combat drug-resistant infections. Drug Discov. Today. 2023; 28: 103491. PubMed Abstract | Publisher Full Text
29. Xu Y, Verma D, Sheridan RP, et al.: Deep Dive into Machine Learning Models for Protein Engineering. J. Chem. Inf. Model. 2020; 60(6): 2773–2790. PubMed Abstract | Publisher Full Text
30. Ng ZJ, Zarin MA, Lee CK, et al.: Application of bacteriocins in food preservation and infectious disease treatment for humans and livestock: A review. RSC Adv. 2020; 10: 38937–38964. PubMed Abstract | Publisher Full Text | Free Full Text
31. Yassin MT, Abdel-Fattah Mostafa A, Al-Askar AA, et al.: In vitro antimicrobial potency of Elettaria cardamomum ethanolic extract against multidrug resistant of food poisoning bacterial strains. J. King Saud. Univ. Sci. 2022; 34(6): 102167. Publisher Full Text
32. Sun MC, Hu ZY, Li DD, et al.: Application of the Reuterin System as Food Preservative or Health-Promoting Agent: A Critical Review. Foods. 2022; 11. PubMed Abstract | Publisher Full Text | Free Full Text
33. Gupta R, Kumar R: Impact Of Chemical Food Preservatives On Human Health. PalArch’s J. Archaeol. Egypt/Egyptol. 2021; 15(15).
34. Ye P, Wang J, Liu M, et al.: Purification and characterization of a novel bacteriocin from Lactobacillus paracasei ZFM54. LWT. 2021; 143: 111125. Publisher Full Text
35. Yu HH, Chin YW, Paik HD: Application of natural preservatives for meat and meat products against food-borne pathogens and spoilage bacteria: A review. Foods. 2021; 10. PubMed Abstract | Publisher Full Text | Free Full Text
36. Ortega-Morales BO, Gaylarde CC: Bioconservation of historic stone buildings—an updated review. Appl. Sci. (Switzerland). 2021; 11. Publisher Full Text
37. Pato U, Riftyan E, Ayu DF, et al.: Antibacterial efficacy of lactic acid bacteria and bacteriocin isolated from Dadih’s against Staphylococcus aureus. Food Sci. Technol (Brazil). 2022; 42: 42. Publisher Full Text
38. Niamah AK: Structure, mode of action and application of pediocin natural antimicrobial food preservative: a review. Basrah J. Agric. Sci. 2018; 31(1):59–69. Publisher Full Text
39. Yap PG, Lai ZW, Tan JS: Bacteriocins from lactic acid bacteria: purification strategies and applications in food and medical industries: a review. Beni-Suef University Journal of Basic and Applied Sciences. 2022; 11. Publisher Full Text
40. Zimina M, Babich O, Prosekov A, et al.: Overview of global trends in classification, methods of preparation and application of bacteriocins. Antibiotics. 2020; 9(9). PubMed Abstract | Publisher Full Text | Free Full Text
41. Xihui Z, Yanlan L, Zhiwei W, et al.: Antibiotic resistance of Riemerella anatipestifer and comparative analysis of antibiotic-resistance gene detection methods. Poult. Sci. 2023; 102(3): 102405. PubMed Abstract | Publisher Full Text | Free Full Text
42. Parmanik A, Das S, Kar B, et al.: Current Treatment Strategies Against Multidrug-Resistant Bacteria: A Review. Curr. Microbiol. 2022; 79: 388. PubMed Abstract | Publisher Full Text | Free Full Text
43. El IK, Senhaji NS, Zinebi S, et al.: Potential application of bacteriocin produced from lactic acid bacteria. Microbiol. Biotechnol. Lett. 2020; 48: 237–251. Publisher Full Text
44. Klibi N, Ben Slimen N, Fhoula I, et al.: Genotypic diversity, antibiotic resistance and bacteriocin production of enterococci isolated from rhizospheres. Microbes Environ. 2012; 27(4): 533–537. PubMed Abstract | Publisher Full Text | Free Full Text
45. Lehtinen S, Croucher NJ, Blanquart F, et al.: Epidemiological dynamics of bacteriocin competition and antibiotic resistance. Proc. R. Soc. B Biol. Sci. 2022; 289(1984). Publisher Full Text
46. Guryanova SV: Immunomodulation, Bioavailability and Safety of Bacteriocins. Life. 2023; 13. PubMed Abstract | Publisher Full Text | Free Full Text
47. Ahmad V, Khan MS, Jamal QMS, et al.: Antimicrobial potential of bacteriocins: in therapy, agriculture and food preservation. Int. J. Antimicrob. Agents. 2017; 49: 1–11. PubMed Abstract | Publisher Full Text
48. Niamah AK, Al-Sahlany STG, Verma DK, et al.: Emerging lactic acid bacteria bacteriocins as anti-cancer and anti-tumor agents for human health. Heliyon. 2024 Aug 29; 10(17):e37054. PubMed Abstract | Publisher Full Text | Free Full Text
49. Demment MW, Young MM, Sensenig RL: Animal Source Foods to Improve Micronutrient Nutrition and Human Function in Developing Countries Providing Micronutrients through Food-Based Solutions: A Key to Human and National Development. J. Nutr. 2003; 133: 3879S–3885S. Publisher Full Text
50. Scialabba NEH: Livestock food and human nutrition. Managing Healthy Livestock Production and Consumption; 2021.
51. Varijakshapanicker P, McKune S, Miller L, et al.: Sustainable livestock systems to improve human health, nutrition, and economic status. Anim. Front. 2019; 9(4): 39–50. PubMed Abstract | Publisher Full Text | Free Full Text
52. Pieterse R, Todorov SD, Dicks LMT: Mode of action and in vitro susceptibility of mastitis pathogens to macedocin ST91KM and preparation of a teat seal containing the bacteriocin. Braz. J. Microbiol. 2010; 41(1): 133–145. PubMed Abstract | Publisher Full Text | Free Full Text
53. Pieterse R, Todorov SD: Bacteriocins: Exploring alternatives to antibiotics in mastitis treatment. Braz. J. Microbiol. 2010; 41: 542–562. PubMed Abstract | Publisher Full Text | Free Full Text
54. Sanca FMM, Blanco IR, Dias M, et al.: Antimicrobial Activity of Peptides Produced by Lactococcus lactis subsp. lactis on Swine Pathogens. Animals. 2023; 13(15). PubMed Abstract | Publisher Full Text | Free Full Text
55. Bemena LD, Mohamed LA, Fernandes AM, et al.: Applications of bacteriocins in food, livestock health and medicine. Int. J. Curr. Microbiol. App. Sci. 2014; 3(12).
56. Callaway TR, Anderson RC, Edrington TS, et al.: Recent pre-harvest supplementation strategies to reduce carriage and shedding of zoonotic enteric bacterial pathogens in food animals. Anim. Health Res. Rev. 2004; 5(1): 35–47. PubMed Abstract | Publisher Full Text
57. Rodríguez E, Arqués JL, Rodríguez R, et al.: Reuterin production by lactobacilli isolated from pig faeces and evaluation of probiotic traits. Lett. Appl. Microbiol. 2003; 37(3): 259–263. PubMed Abstract | Publisher Full Text
58. Khoramian B, Emaneini M, Bolourchi M, et al.: Therapeutic effects of a combined antibiotic-enzyme treatment on subclinical mastitis in lactating dairy cows. Vet. Med (Praha). 2016; 61(5): 237–242. Publisher Full Text
59. Zadoks RN, Middleton JR, McDougall S, et al.: Molecular epidemiology of mastitis pathogens of dairy cattle and comparative relevance to humans. J. Mammary Gland Biol. Neoplasia. 2011; 16(4): 357–372. PubMed Abstract | Publisher Full Text | Free Full Text
60. Hai NV: The use of probiotics in aquaculture. J. Appl. Microbiol. 2015; 119: 917–935. Publisher Full Text
61. Corripio-Miyar Y, Mazorra de Quero C, Treasurer JW, et al.: Vaccination experiments in the gadoid haddock, Melanogrammus aeglefinus L., against the bacterial pathogen Vibrio anguillarum. Vet. Immunol. Immunopathol. 2007; 118(1–2): 147–153. PubMed Abstract | Publisher Full Text
62. Smith P: Antimicrobial use in shrimp farming in Ecuador and emerging multi-resistance during the cholera epidemic of 1991: A re-examination of the data. Aquaculture. 2007; 271: 1–7. Publisher Full Text
63. Zhou X, Wang Y: Probiotics in Aquaculture - Benefits to the Health, Technological Applications and Safety. Health and Environment in Aquaculture. 2012.
64. Nathanailides C, Kolygas M, Choremi K, et al.: Probiotics have the potential to significantly mitigate the environmental impact of freshwater fish farms. Fishes. 2021; 6. Publisher Full Text
65. Wang YB: Effect of probiotics on growth performance and digestive enzyme activity of the shrimp Penaeus vannamei. Aquaculture. 2007; 269(1–4): 259–264. Publisher Full Text
66. Amenyogbe E: Application of probiotics for sustainable and environment-friendly aquaculture management - A review. Cogent Food Agric. 2023; 9. Publisher Full Text
67. Taoka Y, Maeda H, Jo JY, et al.: Growth, stress tolerance and non-specific immune response of Japanese flounder Paralichthys olivaceus to probiotics in a closed recirculating system. Fish. Sci. 2006; 72(2): 310–321. Publisher Full Text
68. Poorinmohammad N, Hamedi J: Moghaddam MHAM. Sequence-based analysis and prediction of lantibiotics: A machine learning approach. Comput. Biol. Chem. 2018; 77: 199–206. PubMed Abstract | Publisher Full Text
69. Yount NY, Weaver DC, de Anda J , et al.: Discovery of Novel Type II Bacteriocins Using a New High-Dimensional Bioinformatic Algorithm. Front. Immunol. 2020; 11. PubMed Abstract | Publisher Full Text | Free Full Text
70. Akhter S, Miller J: Optimal feature selection and software tool development for bacteriocin prediction. bioRxiv. 2022.
71. van Heel AJ , de Jong A , Montalbán-López M, et al.: BAGEL3: Automated identification of genes encoding bacteriocins and (non-)bactericidal posttranslationally modified peptides. Nucleic Acids Res. 2013; 41(Web Server issue): W448–W453. Publisher Full Text
72. Hammami R, Zouhir A, Le Lay C, et al.: BACTIBASE second release: A database and tool platform for bacteriocin characterization. BMC Microbiol. 2010; 10. PubMed Abstract | Publisher Full Text | Free Full Text
73. Morton JT, Freed SD, Lee SW, et al.: A large scale prediction of bacteriocin gene blocks suggests a wide functional spectrum for bacteriocins. BMC Bioinformatics. 2015; 16(1): 381. PubMed Abstract | Publisher Full Text | Free Full Text
74. Nguyen TTD, Le NQK, Ho QT, et al.: Using word embedding technique to efficiently represent protein sequences for identifying substrate specificities of transporters. Anal. Biochem. 2019; 577: 73–81. PubMed Abstract | Publisher Full Text
75. Hamid MN, Friedberg I: Identifying antimicrobial peptides using word embedding with deep recurrent neural networks. Bioinformatics. 2019; 35(12): 2009–2016. PubMed Abstract | Publisher Full Text | Free Full Text
76. Li C, Sutherland D, Hammond SA, et al.: AMPlify: attentive deep learning model for discovery of novel antimicrobial peptides effective against WHO priority pathogens. BMC Genomics. 2022; 23(1). Publisher Full Text
77. Wang Y, Wang L, Li C, et al.: AMP-EBiLSTM: employing novel deep learning strategies for the accurate prediction of antimicrobial peptides. Front. Genet. 2023; 14: 14. Publisher Full Text
78. Lee H, Lee S, Lee I, et al.: AMP-BERT: Prediction of antimicrobial peptide function based on a BERT model. Protein Sci. 2023; 32(1): e4529. PubMed Abstract | Publisher Full Text | Free Full Text
79. Ruiz Puentes P, Henao MC, Cifuentes J, et al.: Rational Discovery of Antimicrobial Peptides by Means of Artificial Intelligence. Membranes (Basel). 2022; 12(7). Publisher Full Text
80. Gull S, Shamim N, Minhas F: AMAP: Hierarchical multi-label prediction of biologically active and antimicrobial peptides. Comput. Biol. Med. 2019; 107: 172–181. PubMed Abstract | Publisher Full Text
81. Redshaw J, Ting DSJ, Brown A, et al.: Krein support vector machine classification of antimicrobial peptides. Dig. Dis. 2023; 2(2): 502–511. Publisher Full Text
82. Porto WF, Ferreira KCV, Ribeiro SM, et al.: Sense the moment: A highly sensitive antimicrobial activity predictor based on hydrophobic moment. Biochim. Biophys. Acta Gen. Subj. 2022; 1866(3): 130070. PubMed Abstract | Publisher Full Text
83. Yan J, Bhadra P, Li A, et al.: Deep-AmPEP30: Improve Short Antimicrobial Peptides Prediction with Deep Learning. Mol. Ther. Nucleic Acids. 2020; 20: 882–894. PubMed Abstract | Publisher Full Text | Free Full Text
84. Veltri D, Kamath U, Shehu A: Deep learning improves antimicrobial peptide recognition. Bioinformatics. 2018; 34(16): 2740–2747. PubMed Abstract | Publisher Full Text | Free Full Text
85. Mokoena MP: Lactic acid bacteria and their bacteriocins: Classification, biosynthesis and applications against uropathogens: A mini-review. Molecules. 2017; 22. PubMed Abstract | Publisher Full Text | Free Full Text
86. Jain C, Rhie A, Zhang H, et al.: Weighted minimizer sampling improves long read mapping. Bioinformatics. 2020; 36: i111–i118. PubMed Abstract | Publisher Full Text | Free Full Text
87. Edgar R: Syncmers are more sensitive than minimizers for selecting conserved k-mers in biological sequences. PeerJ. 2021; 9: e10805. Publisher Full Text
88. Wang Y, Chen Q, Deng C, et al.: KmerGO: A Tool to Identify Group-Specific Sequences With k-mers. Front. Microbiol. 2020; 11. Publisher Full Text
89. Shadab S, Alam Khan MT, Neezi NA, et al.: DeepDBP: Deep neural networks for identification of DNA-binding proteins. Inform. Med. Unlocked. 2020; 19: 100318. Publisher Full Text
90. Basiri ME, Nemati S, Abdar M, et al.: ABCDM: An Attention-based Bidirectional CNN-RNN Deep Model for sentiment analysis. Futur. Gener. Comput. Syst. 2021; 115: 279–294. Publisher Full Text
91. Onan A: Bidirectional convolutional recurrent neural network architecture with group-wise enhancement mechanism for text sentiment classification. J. King Saud Univ. Comput. Inf. Sci. 2022; 34(5): 2098–2117. Publisher Full Text
92. Shen Z, Bao W, Huang DS: Recurrent Neural Network for Predicting Transcription Factor Binding Sites. Sci. Rep. 2018; 8(1): 15270. PubMed Abstract | Publisher Full Text | Free Full Text
93. Hu S, Ma R, Wang H: An improved deep learning method for predicting DNA-binding proteins based on contextual features in amino acid sequences. PLoS One. 2019; 14(11): e0225317. PubMed Abstract | Publisher Full Text | Free Full Text
94. Angelopoulou A, Warda AK, O’Connor PM, et al.: Diverse Bacteriocins Produced by Strains From the Human Milk Microbiota. Front. Microbiol. 2020; 11: 11. Publisher Full Text
95. Chen H, Yan X, Tian F, et al.: Cloning, expression, and identification of a novel class IIa bacteriocin in the Escherichia coli cell-free protein expression system. Biotechnol. Lett. 2012; 34(2): 359–364. PubMed Abstract | Publisher Full Text
96. Martinez JM, Kok J, Sanders JW, et al.: Heterologous coproduction of enterocin A and pediocin PA-1 by Lactococcus lactis: Detection by specific peptide-directed antibodies. Appl. Environ. Microbiol. 2000; 66(8): 3543–3549. PubMed Abstract | Publisher Full Text | Free Full Text
97. Lozano JCN, Meyer JN, Sletten K, et al.: Purification and amino acid sequence of a bacteriocin produced by Pediococcus acidilactici. J. Gen. Microbiol. 1992; 138(9): 1985–1990. PubMed Abstract | Publisher Full Text
98. Kashyap DR: Microbial metabolites: Peptides of diverse structure and function. New and Future Developments in Microbial Biotechnology and Bioengineering: Microbial Secondary Metabolites Biochemistry and Applications. 2019. Publisher Full Text
99. Villalba-Meneses F, Gudiño Gomezjurado ME, Suquilanda-Pesántez JD, et al.: NIFtHool: An informatics program for identification of NifH proteins using deep neural networks. F1000Res. 2022; 11: 11. Publisher Full Text
100. Zhang J, Zong C: Deep Neural Networks in Machine Translation: An Overview. IEEE Intell. Syst. 2015; 30: 16–25. Publisher Full Text
101. McKinney W: Data Structures for Statistical Computing in Python. Proceedings of the 9th Python in Science Conference. 2010.
102. Pedregosa F, Varoquaux G, Gramfort A, et al.: Scikit-learn: Machine learning in Python. J. Mach. Learn. Res. 2011; 12.
103. Harris CR, Millman KJ, van der Walt SJ , et al.: Array programming with NumPy. Nature. 2020; 585: 357–362. PubMed Abstract | Publisher Full Text | Free Full Text
104. Cichy RM, Kaiser D: Deep Neural Networks as Scientific Models. Trends Cogn. Sci. 2019; 23: 305–317. Publisher Full Text
105. González LL: Deep Learning Neural Network Development for the Classification of Bacteriocin Sequences Produced by Lactic Acid Bacteria. Zenodo. 2024. Publisher Full Text
106. Liu W, Zhang L, Yi H, et al.: Qualitative detection of class IIa bacteriocinogenic lactic acid bacteria from traditional Chinese fermented food using a YGNGV-motif-based assay. J. Microbiol. Methods. 2014; 100(1): 121–127. PubMed Abstract | Publisher Full Text
107. Sood SK, Vijay Simha B, Kumariya R, et al.: Highly Specific Culture-Independent Detection of YGNGV Motif-Containing Pediocin-Producing Strains. Probiotics Antimicrob. Proteins. 2013; 5(1): 37–42. PubMed Abstract | Publisher Full Text
108. Chiou PT, Alotaibi AS, Halfond WGJ: BAGEL: An Approach to Automatically Detect Navigation-Based Web Accessibility Barriers for Keyboard Users. Conference on Human Factors in Computing Systems - Proceedings. 2023.
109. Boratyn GM, Camacho C, Cooper PS, et al.: BLAST: a more efficient report with usability improvements. Nucleic Acids Res. 2013; 41(Web Server issue): W29–W33. PubMed Abstract | Publisher Full Text | Free Full Text
110. Costa SS, da Silva Moia G , Silva A, et al.: BADASS: BActeriocin-Diversity ASsessment Software. BMC Bioinformatics. 2023; 24(1): 24. PubMed Abstract | Publisher Full Text | Free Full Text
111. Dua M, Barbará D, Shehu A: Exploring deep neural network architectures: A case study on improving antimicrobial peptide recognition. EPiC Series in Computing. 2020.

Comments on this article Comments (0)

Version 2

VERSION 2 PUBLISHED 30 Aug 2024

Author details Author details

¹ School of Biological Sciences and Engineering, University Yachay Tech, Urcuqui, Provincia de Imbabura, 100119, Ecuador

Isaac Arias-Serrano
Roles: Supervision, Validation, Visualization, Writing – Review & Editing

Fernando Villalba-Meneses
Roles: Resources, Software, Validation, Visualization

Paulo Navas-Boada
Roles: Validation, Visualization, Writing – Review & Editing

Jonathan Cruz-Varela
Roles: Conceptualization, Methodology, Project Administration, Software, Supervision, Validation, Writing – Review & Editing

Competing interests

No competing interests were disclosed.

Grant information

The author(s) declared that no grants were involved in supporting this work.

Article Versions (2)

version 2

Revised

Published: 20 Jun 2025, 13:981

https://doi.org/10.12688/f1000research.154432.2

version 1

Published: 30 Aug 2024, 13:981

https://doi.org/10.12688/f1000research.154432.1

© 2025 González LL et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Download

Export To

metrics

	Views	Downloads
F1000Research	-	-
PubMed Central Data from PMC are received and updated monthly.	-	-

Citations

SEE MORE DETAILS

CITE

how to cite this article

González LL, Arias-Serrano I, Villalba-Meneses F et al. Deep learning neural network development for the classification of bacteriocin sequences produced by lactic acid bacteria [version 2; peer review: 4 approved]. F1000Research 2025, 13:981 (https://doi.org/10.12688/f1000research.154432.2)

NOTE: If applicable, it is important to ensure the information in square brackets after the title is included in all citations of this article.

track

receive updates on this article

Track an article to receive email alerts on any updates to this article.

Open Peer Review

Current Reviewer Status: ?

Key to Reviewer Statuses VIEW HIDE

ApprovedThe paper is scientifically sound in its current form and only minor, if any, improvements are suggested

Approved with reservations A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.

Not approvedFundamental flaws in the paper seriously undermine the findings and conclusions

Version 2

VERSION 2

PUBLISHED 20 Jun 2025

Revised

Views

Reviewer Report 07 Aug 2025

John J. Georrge, University of North Bengal, Darjeeling, India

Approved

https://doi.org/10.5256/f1000research.183776.r393480

The authors revised the manuscript as per ... Continue reading

CITE

Report a concern

Respond or Comment

Views

Reviewer Report 07 Aug 2025

Cristóbal Joel González-Pérez, Centre for Food Research and Development, Hermosillo, Sonora, Mexico

Approved

https://doi.org/10.5256/f1000research.183776.r393916

In this research, a deep learning neural network was developed for bacteriocin sequence classification, which distinguishes bacteriocins produced by LAB from those produced by non-LAB. I consider this relevant because currently, we lack tools to support these bioinformatics searches.

Suggestions:

TABLE 1. In examples of Class II, change Pediocina for Pediocin A or Pediocin.

TABLE 1. In examples of Class IV, eliminate one “e” in Eenterocin.

In Fields where bacteriocins can be applied to address diverse issues, in the subtitle Medicine, it is important to mention that no bacteriocin is currently being used due to the long process that must be gone through; they are important promising molecules but none have been accepted for application in humans.

In page 5, in the last parragraph, check “BACII?”, change or check “?” for “α”.

In the statistical analysis, the authors must mention whether the data are normal or not, that is, if they performed normality tests, they must be mentioned, since Tukey's is for parametric data.

In the second paragraph, page 19, “in vitro” must be italic.

Is the work clearly and accurately presented and does it cite the current literature?

Yes
Is the study design appropriate and is the work technically sound?

Yes
Are sufficient details of methods and analysis provided to allow replication by others?

Yes
If applicable, is the statistical analysis and its interpretation appropriate?

I cannot comment. A qualified statistician is required.
Are all the source data underlying the results available to ensure full reproducibility?

Yes
Are the conclusions drawn adequately supported by the results?

Yes

Competing Interests: No competing interests were disclosed.

Reviewer Expertise: Microbiology (Bacteriocins)

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.

CITE

Report a concern

Respond or Comment

Views

Reviewer Report 07 Aug 2025

Ismail Erol, Bahcesehir University, Istanbul, Turkey

Approved

https://doi.org/10.5256/f1000research.183776.r393911

After revision, the authors substantially increased the quality of the manuscript. I have no futher comments. The current status of the manuscript is ready for indexing.

Ensuring the reproducibility of research is really important, and the authors ... Continue reading

Is the work clearly and accurately presented and does it cite the current literature?

Yes
Is the study design appropriate and is the work technically sound?

Yes
Are sufficient details of methods and analysis provided to allow replication by others?

Yes
If applicable, is the statistical analysis and its interpretation appropriate?

Yes
Are all the source data underlying the results available to ensure full reproducibility?

Yes
Are the conclusions drawn adequately supported by the results?

Yes

Competing Interests: No competing interests were disclosed.

Reviewer Expertise: Molecular Modeling

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.

CITE

Report a concern

Respond or Comment

Views

Reviewer Report 03 Jul 2025

Alaa Kareem Niamah, University of Basrah, Basrah, Basra Governorate, Iraq

Approved

https://doi.org/10.5256/f1000research.183776.r393481

After a second review of the manuscript, I see that the authors have ... Continue reading

CITE

Report a concern

Respond or Comment

Version 1

VERSION 1

PUBLISHED 30 Aug 2024

Views

Reviewer Report 27 Jan 2025

Alaa Kareem Niamah, University of Basrah, Basrah, Basra Governorate, Iraq

Approved with Reservations

https://doi.org/10.5256/f1000research.169463.r355998

Dear Editors and Authors

1-The introduction needs to be supported by some sources related to the current study, such as: refer 1 and 2
‏
2-The aim of the study is not clear and is ... Continue reading

Dear Editors and Authors

1-The introduction needs to be supported by some sources related to the current study, such as: refer 1 and 2
‏
2-The aim of the study is not clear and is not mentioned in the manuscript introduction.

3-The labeling of the x-axis in Figure 2 should be corrected.

4-Table 2 has no meaning and should be deleted. Since the coding is known and previously included in references.

5-Figure 6 is not clear. It should be explained better than the current situation and explain what these shapes, such as circles and arrows, mean.

6-Table 8 should preferably be written and discussed with the results of previous references, not as in the current situation. It is not permissible to put a table containing the results of others.

7-Figure 8 The authors did not explain what A and B mean.

8-Figure 8 The authors did not explain what a and b mean.

9-The conclusions are very poor. This chapter is dedicated to the conclusions of the current study, but we see that the authors have mentioned many results. This chapter should be rewritten and all the results mentioned should be deleted.

Is the work clearly and accurately presented and does it cite the current literature?

Partly
Is the study design appropriate and is the work technically sound?

Yes
Are sufficient details of methods and analysis provided to allow replication by others?

Partly
If applicable, is the statistical analysis and its interpretation appropriate?

Yes
Are all the source data underlying the results available to ensure full reproducibility?

Yes
Are the conclusions drawn adequately supported by the results?

Partly

References

1. Niamah A: Structure, mode of action and application of pediocin natural antimicrobial food preservative: A review. Basrah Journal of Agricultural Sciences. 2018; 31 (1): 59-69 Publisher Full Text
2. Niamah AK, Al-Sahlany STG, Verma DK, Shukla RM, et al.: Emerging lactic acid bacteria bacteriocins as anti-cancer and anti-tumor agents for human health.Heliyon. 2024; 10 (17): e37054 PubMed Abstract | Publisher Full Text

Competing Interests: No competing interests were disclosed.

Reviewer Expertise: Bacteriocin production , Lactic acid Bacteria

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.

CITE

Report a concern

Author Response 11 Sep 2025

Lady González, School of Biological Sciences and Engineering, University Yachay Tech, Urcuqui, 100119, Ecuador

11 Sep 2025

Author Response
We thank you for your constructive comments. We have addressed all points as follows:
1. The requested sources have been added to the introduction.
2. The study's aim has
... Continue reading
We thank you for your constructive comments. We have addressed all points as follows:

The requested sources have been added to the introduction.

The study's aim has been clarified in the introduction. We added explicit statement in the introduction's final paragraph.

The caption for Figure 2 has been corrected. A more appropriate description was also given to the figure.

Table 2 has been removed.

A clearer explanation of Figure 6 has been added.

Table 8 was misplaced; it has now been moved to the discussion section.

Added an explanation of what (a) and (b) represent in Figure 8.

Added an explanation of what (a) and (b) represent in Figure 8.

The conclusion has been rewritten. Aspects such as practical limitations (dataset bias), and future directions (experimental validation) were added in the conclusión.
We thank you for your constructive comments. We have addressed all points as follows:

The requested sources have been added to the introduction.

The study's aim has been clarified in the introduction. We added explicit statement in the introduction's final paragraph.

The caption for Figure 2 has been corrected. A more appropriate description was also given to the figure.

Table 2 has been removed.

A clearer explanation of Figure 6 has been added.

Table 8 was misplaced; it has now been moved to the discussion section.

Added an explanation of what (a) and (b) represent in Figure 8.

Added an explanation of what (a) and (b) represent in Figure 8.

The conclusion has been rewritten. Aspects such as practical limitations (dataset bias), and future directions (experimental validation) were added in the conclusión.
Competing Interests: The authors declare that they have no competing interests. Close
Report a concern
Respond or Comment

COMMENTS ON THIS REPORT

Author Response 11 Sep 2025

Lady González, School of Biological Sciences and Engineering, University Yachay Tech, Urcuqui, 100119, Ecuador

11 Sep 2025

Author Response
We thank you for your constructive comments. We have addressed all points as follows:
1. The requested sources have been added to the introduction.
2. The study's aim has
... Continue reading
We thank you for your constructive comments. We have addressed all points as follows:

The requested sources have been added to the introduction.

The study's aim has been clarified in the introduction. We added explicit statement in the introduction's final paragraph.

The caption for Figure 2 has been corrected. A more appropriate description was also given to the figure.

Table 2 has been removed.

A clearer explanation of Figure 6 has been added.

Table 8 was misplaced; it has now been moved to the discussion section.

Added an explanation of what (a) and (b) represent in Figure 8.

Added an explanation of what (a) and (b) represent in Figure 8.

The conclusion has been rewritten. Aspects such as practical limitations (dataset bias), and future directions (experimental validation) were added in the conclusión.
We thank you for your constructive comments. We have addressed all points as follows:

The requested sources have been added to the introduction.

The study's aim has been clarified in the introduction. We added explicit statement in the introduction's final paragraph.

The caption for Figure 2 has been corrected. A more appropriate description was also given to the figure.

Table 2 has been removed.

A clearer explanation of Figure 6 has been added.

Table 8 was misplaced; it has now been moved to the discussion section.

Added an explanation of what (a) and (b) represent in Figure 8.

Added an explanation of what (a) and (b) represent in Figure 8.

The conclusion has been rewritten. Aspects such as practical limitations (dataset bias), and future directions (experimental validation) were added in the conclusión.
Competing Interests: The authors declare that they have no competing interests. Close
Report a concern

Views

Reviewer Report 11 Jan 2025

John J. Georrge, University of North Bengal, Darjeeling, India

Approved with Reservations

https://doi.org/10.5256/f1000research.169463.r355994

Abstract

The research investigates a deep learning neural network for the binary classification of bacteriocin sequences produced by lactic acid bacteria (LAB). Utilizing k-mers and vector embeddings for feature extraction, the study tested ten group combinations, concluding that the concatenated features of 5-mers, 7-mers, and embedding vectors (EV) yielded superior results. With k-fold cross-validation (k=30), the model achieved notable accuracy (91.47%), precision, recall, and F1 score (91%). These results demonstrate a promising approach for identifying bacteriocins, paving the way for applications in medicine, livestock, aquaculture, and food preservation.

Weaknesses: The abstract lacks explicit mention of the practical challenges or limitations, such as computational expense or data imbalance, which are critical for assessing the feasibility of real-world applications.

Introduction
The study highlights the growing challenge of antibiotic resistance and the potential of bacteriocins as alternatives. LAB bacteriocins, recognized as GRAS (Generally Recognized as Safe) and QPS (Qualified Presumption of Safety), show promise for therapeutic and industrial applications. The introduction emphasizes the critical need for efficient classification methods, leveraging artificial intelligence (AI) and deep learning to address the limitations of traditional genomic tools. Existing methods like BAGEL and BLASTP rely heavily on sequence homology, often leading to false negatives. The study proposes a deep neural network as an innovative solution for identifying LAB-produced bacteriocins.

Weaknesses: While the introduction effectively frames the problem, it does not provide sufficient detail on the limitations of existing deep learning models, nor does it address the potential biases introduced by selecting specific LAB genera.

Methods
The research employed amino acid (AA) sequences sourced from the UniProt database, filtered for lengths between 50 and 2000 AAs. LAB bacteriocin sequences were labelled as “BacLAB,” and non-LAB sequences as “Non-BacLAB.” Feature extraction was performed using k-mers of varying lengths (3, 5, 7, 15, and 20) and embedding vectors generated via a Gated Recurrent Unit (GRU)-based recurrent neural network (RNN). The concatenated features were inputs to a deep neural network (DNN) structured in four blocks with 13 layers. The model used Adam optimization, a mean absolute error loss function, and k-fold cross-validation (k=30). Statistical analyses included ANOVA and Tukey tests to validate performance metrics.

Weaknesses: The methods section lacks clarity on how hyperparameters were tuned and does not justify the selection of specific k-mer lengths.

Results
The study identified five lists of 100 characteristic k-mers for each selected length. Performance metrics from cross-validation demonstrated that the concatenation of 5-mers, 7-mers, and embedding vectors (5-mers+7-mers+EV) achieved the best results with:

Loss: 8.50%
Accuracy: 91.47%
Precision, Recall, F1 Score: 91.00%

Significant differences in accuracy and loss were observed between groups. The confusion matrix revealed high sensitivity but lower specificity, highlighting potential areas for improvement. Compared to other machine learning models, this study’s approach exceeded accuracy benchmarks by 3-10%.

Weaknesses: While the results are promising, the limited scope of comparison with other algorithms leaves room for a more robust benchmarking process.

Discussion
The superior performance of the 5-mers+7-mers+EV group is attributed to characteristic motifs in bacteriocins. The study’s findings align with known sequences in subclass IIa bacteriocins, such as YGNGVXC. The results underscore the importance of selecting appropriate k-mer lengths to balance specificity and sensitivity. However, limitations include the model’s higher false positive rate for non-BacLAB sequences, indicating room for refinement in feature selection. Additionally, comparisons with existing tools like BAGEL reveal the unique advantage of the proposed approach in distinguishing LAB-specific bacteriocins.

Weaknesses: The discussion does not critically evaluate the computational cost and scalability of the model for larger datasets. It also overlooks the potential need for experimental validation of the predictions to confirm biological relevance.

Conclusion
The study successfully developed a deep learning-based classification model for LAB bacteriocins, achieving consistent results comparable to and sometimes exceeding existing methods. The identified k-mers and embeddings offer a robust foundation for future work in therapeutic, aquacultural, and industrial applications. While promising, further specificity and practical testing improvements are essential to validate and expand the model’s utility.

Weaknesses: The conclusion is overly optimistic and does not sufficiently address the study’s limitations, such as the reliance on publicly available datasets that may not represent all bacteriocin-producing LAB.

References
References were cited appropriately within the research, highlighting the breadth of prior work on bacteriocins and machine learning applications.

Is the work clearly and accurately presented and does it cite the current literature?

Yes
Is the study design appropriate and is the work technically sound?

Yes
Are sufficient details of methods and analysis provided to allow replication by others?

Yes
If applicable, is the statistical analysis and its interpretation appropriate?

Yes
Are all the source data underlying the results available to ensure full reproducibility?

Yes
Are the conclusions drawn adequately supported by the results?

Yes

Competing Interests: No competing interests were disclosed.

Reviewer Expertise: Rational Drug Designing, Novel Drug Target Identification, Cheminformatics, Bioinformatics, Molecular Modelling, Docking, QSAR, Pharmacohpore, Protein Engineering, Bioactive Peptides, Reverse Vaccinology, Artificial Intelligence, and Machine Learning.

CITE

Report a concern

Author Response 11 Sep 2025

Lady González, School of Biological Sciences and Engineering, University Yachay Tech, Urcuqui, 100119, Ecuador

11 Sep 2025

Author Response

We sincerely appreciate your constructive feedback. We have addressed all raised concerns through the following revisions:
Abstract: Added limitations regarding computational costs and class imbalance to assess real-world feasibility.
Introduction: ... Continue reading We sincerely appreciate your constructive feedback. We have addressed all raised concerns through the following revisions:
Abstract: Added limitations regarding computational costs and class imbalance to assess real-world feasibility.
Introduction: Expanded on limitations of existing deep learning models. Discussed potential biases from selecting specific LAB genera (e.g., underrepresentation of rare bacteriocin producers).
Methods: Clarified hyperparameter tuning. Justified k-mer lengths based on conserved bacteriocin motifs (e.g., "pediocin box" for *k*=14–19) and cited prior studies. Added scalability details: Experiments were conducted on Google Colab.
Results: A comparison was made with alternative models. However, it was developed in more depth in the discussion section.
Discussion: While our model demonstrates robust performance on the current dataset, its scalability to larger datasets requires further empirical validation. Although computational costs are expected to increase linearly with sequence length due to our fixed-dimension k-mer/embedding pipeline, real-world performance may vary with dataset diversity. Furthermore, while we recognize the importance of experimental validation through biological testing, this remains an area for future research as it falls beyond the scope of our current computational study
Conclusions: We have moderated our claims by emphasizing key limitations, including dependence on public datasets (which may introduce gaps in LAB diversity representation) and the need for experimental validation in future work.
We sincerely appreciate your constructive feedback. We have addressed all raised concerns through the following revisions:
Abstract: Added limitations regarding computational costs and class imbalance to assess real-world feasibility.
Introduction: Expanded on limitations of existing deep learning models. Discussed potential biases from selecting specific LAB genera (e.g., underrepresentation of rare bacteriocin producers).
Methods: Clarified hyperparameter tuning. Justified k-mer lengths based on conserved bacteriocin motifs (e.g., "pediocin box" for *k*=14–19) and cited prior studies. Added scalability details: Experiments were conducted on Google Colab.
Results: A comparison was made with alternative models. However, it was developed in more depth in the discussion section.
Discussion: While our model demonstrates robust performance on the current dataset, its scalability to larger datasets requires further empirical validation. Although computational costs are expected to increase linearly with sequence length due to our fixed-dimension k-mer/embedding pipeline, real-world performance may vary with dataset diversity. Furthermore, while we recognize the importance of experimental validation through biological testing, this remains an area for future research as it falls beyond the scope of our current computational study
Conclusions: We have moderated our claims by emphasizing key limitations, including dependence on public datasets (which may introduce gaps in LAB diversity representation) and the need for experimental validation in future work.
Competing Interests: The authors declare that they have no competing interests. Close
Report a concern
Respond or Comment

COMMENTS ON THIS REPORT

Author Response 11 Sep 2025

Lady González, School of Biological Sciences and Engineering, University Yachay Tech, Urcuqui, 100119, Ecuador

11 Sep 2025

Author Response

We sincerely appreciate your constructive feedback. We have addressed all raised concerns through the following revisions:
Abstract: Added limitations regarding computational costs and class imbalance to assess real-world feasibility.
Introduction: ... Continue reading We sincerely appreciate your constructive feedback. We have addressed all raised concerns through the following revisions:
Abstract: Added limitations regarding computational costs and class imbalance to assess real-world feasibility.
Introduction: Expanded on limitations of existing deep learning models. Discussed potential biases from selecting specific LAB genera (e.g., underrepresentation of rare bacteriocin producers).
Methods: Clarified hyperparameter tuning. Justified k-mer lengths based on conserved bacteriocin motifs (e.g., "pediocin box" for *k*=14–19) and cited prior studies. Added scalability details: Experiments were conducted on Google Colab.
Results: A comparison was made with alternative models. However, it was developed in more depth in the discussion section.
Discussion: While our model demonstrates robust performance on the current dataset, its scalability to larger datasets requires further empirical validation. Although computational costs are expected to increase linearly with sequence length due to our fixed-dimension k-mer/embedding pipeline, real-world performance may vary with dataset diversity. Furthermore, while we recognize the importance of experimental validation through biological testing, this remains an area for future research as it falls beyond the scope of our current computational study
Conclusions: We have moderated our claims by emphasizing key limitations, including dependence on public datasets (which may introduce gaps in LAB diversity representation) and the need for experimental validation in future work.
We sincerely appreciate your constructive feedback. We have addressed all raised concerns through the following revisions:
Abstract: Added limitations regarding computational costs and class imbalance to assess real-world feasibility.
Introduction: Expanded on limitations of existing deep learning models. Discussed potential biases from selecting specific LAB genera (e.g., underrepresentation of rare bacteriocin producers).
Methods: Clarified hyperparameter tuning. Justified k-mer lengths based on conserved bacteriocin motifs (e.g., "pediocin box" for *k*=14–19) and cited prior studies. Added scalability details: Experiments were conducted on Google Colab.
Results: A comparison was made with alternative models. However, it was developed in more depth in the discussion section.
Discussion: While our model demonstrates robust performance on the current dataset, its scalability to larger datasets requires further empirical validation. Although computational costs are expected to increase linearly with sequence length due to our fixed-dimension k-mer/embedding pipeline, real-world performance may vary with dataset diversity. Furthermore, while we recognize the importance of experimental validation through biological testing, this remains an area for future research as it falls beyond the scope of our current computational study
Conclusions: We have moderated our claims by emphasizing key limitations, including dependence on public datasets (which may introduce gaps in LAB diversity representation) and the need for experimental validation in future work.
Competing Interests: The authors declare that they have no competing interests. Close
Report a concern

Comments on this article Comments (0)

Version 2

VERSION 2 PUBLISHED 30 Aug 2024

Open Peer Review

Reviewer Status

Reviewer Reports

	Invited Reviewers
	1	2	3	4
Version 2 (revision) 20 Jun 25	read	read	read	read
Version 1 30 Aug 24	read	read

John J. Georrge, University of North Bengal, Darjeeling, India
Alaa Kareem Niamah, University of Basrah, Basrah, Iraq
Ismail Erol, Bahcesehir University, Istanbul, Turkey
Cristóbal Joel González-Pérez, Centre for Food Research and Development, Hermosillo, Mexico

Comments on this article

All Comments(0)

Add a comment

Browse by related subjects

Back to all reports

Reviewer Report

2 Views

07 Aug 2025 | for Version 2

John J. Georrge, University of North Bengal, Darjeeling, India

2 Views Cite this report Responses(0)

Approved

The authors revised the manuscript as per the reviewers comments. The article may be indexed.

Competing Interests

No competing interests were disclosed.

Reviewer Expertise

Rational Drug Designing, Novel Drug Target Identification, Cheminformatics, Bioinformatics, Molecular Modelling, Docking, QSAR, Pharmacohpore, Protein Engineering, Bioactive Peptides, Reverse Vaccinology, Artificial Intelligence, and Machine Learning.

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.

Respond to this report

Responses (0)

Back to all reports

Reviewer Report

2 Views

07 Aug 2025 | for Version 2

Cristóbal Joel González-Pérez, Centre for Food Research and Development, Hermosillo, Sonora, Mexico

2 Views Cite this report Responses(0)

Approved

Is the work clearly and accurately presented and does it cite the current literature?

Yes
Is the study design appropriate and is the work technically sound?

Yes
Are sufficient details of methods and analysis provided to allow replication by others?

Yes
If applicable, is the statistical analysis and its interpretation appropriate?

I cannot comment. A qualified statistician is required.
Are all the source data underlying the results available to ensure full reproducibility?

Yes
Are the conclusions drawn adequately supported by the results?

Yes

Competing Interests

No competing interests were disclosed.

Reviewer Expertise

Microbiology (Bacteriocins)

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.

Respond to this report

Responses (0)

Back to all reports

Reviewer Report

2 Views

07 Aug 2025 | for Version 2

Ismail Erol, Bahcesehir University, Istanbul, Turkey

2 Views Cite this report Responses(0)

Approved

Is the work clearly and accurately presented and does it cite the current literature?

Yes
Is the study design appropriate and is the work technically sound?

Yes
Are sufficient details of methods and analysis provided to allow replication by others?

Yes
If applicable, is the statistical analysis and its interpretation appropriate?

Yes
Are all the source data underlying the results available to ensure full reproducibility?

Yes
Are the conclusions drawn adequately supported by the results?

Yes

Competing Interests

No competing interests were disclosed.

Reviewer Expertise

Molecular Modeling

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.

Respond to this report

Responses (0)

Back to all reports

Reviewer Report

4 Views

03 Jul 2025 | for Version 2

Alaa Kareem Niamah, University of Basrah, Basrah, Basra Governorate, Iraq

4 Views Cite this report Responses(0)

Approved

After a second review of the manuscript, I see that the authors have made the necessary corrections. I believe the manuscript is now clearer and ready for indexing.

Competing Interests

No competing interests were disclosed.

Reviewer Expertise

Bacteriocin production , Lactic acid Bacteria

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.

Respond to this report

Responses (0)

Back to all reports

Reviewer Report

18 Views

27 Jan 2025 | for Version 1

Alaa Kareem Niamah, University of Basrah, Basrah, Basra Governorate, Iraq

18 Views Cite this report Responses(1)

Approved With Reservations

Is the work clearly and accurately presented and does it cite the current literature?

Partly
Is the study design appropriate and is the work technically sound?

Yes
Are sufficient details of methods and analysis provided to allow replication by others?

Partly
If applicable, is the statistical analysis and its interpretation appropriate?

Yes
Are all the source data underlying the results available to ensure full reproducibility?

Yes
Are the conclusions drawn adequately supported by the results?

Partly

References

Competing Interests

No competing interests were disclosed.

Reviewer Expertise

Bacteriocin production , Lactic acid Bacteria

Respond to this report

Responses (1)

Author Response

11 Sep 2025

Lady González, School of Biological Sciences and Engineering, University Yachay Tech, Urcuqui, 100119, Ecuador

We thank you for your constructive comments. We have addressed all points as follows:

The requested sources have been added to the introduction.
The study's aim has been clarified in the introduction. We added explicit statement in the introduction's final paragraph.
The caption for Figure 2 has been corrected. A more appropriate description was also given to the figure.
Table 2 has been removed.
A clearer explanation of Figure 6 has been added.
Table 8 was misplaced; it has now been moved to the discussion section.
Added an explanation of what (a) and (b) represent in Figure 8.
Added an explanation of what (a) and (b) represent in Figure 8.
The conclusion has been rewritten. Aspects such as practical limitations (dataset bias), and future directions (experimental validation) were added in the conclusión.

View more View less

Competing Interests

The authors declare that they have no competing interests.

Back to all reports

Reviewer Report

16 Views

11 Jan 2025 | for Version 1

John J. Georrge, University of North Bengal, Darjeeling, India

16 Views Cite this report Responses(1)

Approved With Reservations

Loss: 8.50%
Accuracy: 91.47%
Precision, Recall, F1 Score: 91.00%

Is the work clearly and accurately presented and does it cite the current literature?

Yes
Is the study design appropriate and is the work technically sound?

Yes
Are sufficient details of methods and analysis provided to allow replication by others?

Yes
If applicable, is the statistical analysis and its interpretation appropriate?

Yes
Are all the source data underlying the results available to ensure full reproducibility?

Yes
Are the conclusions drawn adequately supported by the results?

Yes

Competing Interests

No competing interests were disclosed.

Reviewer Expertise

Respond to this report

Responses (1)

Author Response

11 Sep 2025

Lady González, School of Biological Sciences and Engineering, University Yachay Tech, Urcuqui, 100119, Ecuador

We sincerely appreciate your constructive feedback. We have addressed all raised concerns through the following revisions:
Abstract: Added limitations regarding computational costs and class imbalance to assess real-world feasibility.
Introduction: Expanded on limitations of existing deep learning models. Discussed potential biases from selecting specific LAB genera (e.g., underrepresentation of rare bacteriocin producers).
Methods: Clarified hyperparameter tuning. Justified k-mer lengths based on conserved bacteriocin motifs (e.g., "pediocin box" for *k*=14–19) and cited prior studies. Added scalability details: Experiments were conducted on Google Colab.
Results: A comparison was made with alternative models. However, it was developed in more depth in the discussion section.
Discussion: While our model demonstrates robust performance on the current dataset, its scalability to larger datasets requires further empirical validation. Although computational costs are expected to increase linearly with sequence length due to our fixed-dimension k-mer/embedding pipeline, real-world performance may vary with dataset diversity. Furthermore, while we recognize the importance of experimental validation through biological testing, this remains an area for future research as it falls beyond the scope of our current computational study
Conclusions: We have moderated our claims by emphasizing key limitations, including dependence on public datasets (which may introduce gaps in LAB diversity representation) and the need for experimental validation in future work.

View more View less

Competing Interests

The authors declare that they have no competing interests.

Alongside their report, reviewers assign a status to the article:

Approved - the paper is scientifically sound in its current form and only minor, if any, improvements are suggested

Approved with reservations - A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.

Not approved - fundamental flaws in the paper seriously undermine the findings and conclusions

[1] 1. Yoshida M, Hinkley T, Tsuda S, et al.: Using Evolutionary Algorithms and Machine Learning to Explore Sequence Space for the Discovery of Antimicrobial Peptides. Chem. 2018; 4(3): 533–543. Publisher Full Text

[2] 2. Abiola RR, Okoro EK, Sokunbi O: Lactic Acid Bacteria and the Food Industry - A Comprehensive Review. Int. J. Health Sci. Res. 2022; 12(5): 128–142. Publisher Full Text

[3] 3. Todorov SD, Popov I, Weeks R, et al.: Use of Bacteriocins and Bacteriocinogenic Beneficial Organisms in Food Products: Benefits, Challenges, Concerns. Foods. 2022; 11. PubMed Abstract | Publisher Full Text | Free Full Text

[4] 4. Daba GM, Elnahas MO, Elkhateeb WA: Beyond biopreservatives, bacteriocins biotechnological applications: History, current status, and promising potentials. Biocatal. Agric. Biotechnol. 2022; 39: 102248. Publisher Full Text

[5] 5. Palmer JD, Foster KR: The evolution of spectrum in antibiotics and bacteriocins. Proc. Natl. Acad. Sci. USA. 2022; 119(38): e2205407119. PubMed Abstract | Publisher Full Text | Free Full Text

[6] 6. Arthur TD, Cavera VL, Chikindas ML: On bacteriocin delivery systems and potential applications. Future Microbiol. 2014; 9: 235–248. PubMed Abstract | Publisher Full Text

[7] 7. Ding D, Wang B, Zhang X, et al.: The spread of antibiotic resistance to humans and potential protection strategies. Ecotoxicol. Environ. Saf. 2023; 254: 114734. PubMed Abstract | Publisher Full Text

[8] 8. Timothy B, Iliyasu AH, Anvikar AR: Bacteriocins of Lactic Acid Bacteria and Their Industrial Application. Current Topic in Lactic Acid Bacteria and Probiotics. 2021; 7(1): 1–13. Publisher Full Text

[9] 9. Negash AW, Tsehai BA: Current Applications of Bacteriocin. Int. J. Microbiol. 2020; 2020: 1–7. PubMed Abstract | Publisher Full Text | Free Full Text

[10] 10. Gradisteanu Pircalabioru G, Popa LI, Marutescu L, et al.: Bacteriocins in the era of antibiotic resistance: rising to the challenge. Pharmaceutics. 2021; 13. PubMed Abstract | Publisher Full Text | Free Full Text

[11] 11. Soltani S, Hammami R, Cotter PD, et al.: Bacteriocins as a new generation of antimicrobials: Toxicity aspects and regulations. FEMS Microbiol. Rev. 2021; 45. PubMed Abstract | Publisher Full Text | Free Full Text

[12] 12. Silva CCG, Silva SPM, Ribeiro SC: Application of bacteriocins and protective cultures in dairy food preservation. Front. Microbiol. 2018; 9. PubMed Abstract | Publisher Full Text | Free Full Text

[13] 13. Hernández-González JC, Martínez-Tapia A, Lazcano-Hernández G, et al.: Bacteriocins from lactic acid bacteria. A powerful alternative as antimicrobials, probiotics, and immunomodulators in veterinary medicine. Animals. 2021; 11. PubMed Abstract | Publisher Full Text | Free Full Text

[14] 14. Parada JL, Caron CR, Medeiros ABP, et al.: Bacteriocins from lactic acid bacteria: Purification, properties and use as biopreservatives. Braz. Arch. Biol. Technol. 2007; 50(3): 512–542. Publisher Full Text

[15] 15. Abdulhussain Kareem R, Razavi SH: Plantaricin bacteriocins: As safe alternative antimicrobial peptides in food preservation—A review. J. Food Saf. 2020; 40(1). Publisher Full Text

[16] 16. Alvarez-Sieiro P, Montalbán-López M, Mu D, et al.: Bacteriocins of lactic acid bacteria: extending the family. Appl. Microbiol. Biotechnol. 2016; 100: 2939–2951. PubMed Abstract | Publisher Full Text | Free Full Text

[17] 17. Simons A, Alhanout K, Duval RE: Bacteriocins, antimicrobial peptides from bacterial origin: Overview of their biology and their impact against multidrug-resistant bacteria. Microorganisms. 2020; 8. PubMed Abstract | Publisher Full Text | Free Full Text

[18] 18. Darbandi A, Asadi A, Mahdizade Ari M, et al.: Bacteriocins: Properties and potential use as antimicrobials. J. Clin. Lab. Anal. 2022; 36. Publisher Full Text

[19] 19. Bateman A, Martin MJ, Orchard S, et al.: UniProt: the Universal Protein Knowledgebase in 2023. Nucleic Acids Res. 2023; 51(D1).

[20] 20. Ibrahim OO: Classification of Antimicrobial Peptides Bacteriocins, and the Nature of Some Bacteriocins with Potential Applications in Food Safety and Bio-Pharmaceuticals. EC Microbiol. 2019; 15(7): 591–608.

[21] 21. Verma DK, Thakur M, Singh S, et al.: Bacteriocins as antimicrobial and preservative agents in food: Biosynthesis, separation and application. Food Biosci. 2022; 46: 101594. Publisher Full Text

[22] 22. Lee YCJ, Cowan A, Tankard A: Peptide Toxins as Biothreats and the Potential for AI Systems to Enhance Biosecurity. Front. Bioeng. Biotechnol. 2022; 10. Publisher Full Text

[23] 23. Wang Y, Wu J, Lv M, et al.: Metabolism Characteristics of Lactic Acid Bacteria and the Expanding Applications in Food Industry. Front. Bioeng. Biotechnol. 2021; 9. Publisher Full Text

[24] 24. Xu C, Fu Y, Liu F, et al.: Purification and antimicrobial mechanism of a novel bacteriocin produced by Lactobacillus rhamnosus 1.0320. LWT. 2021 Jun; 137: 110338. Publisher Full Text

[25] 25. Cardona AF, Ruíz Patiño A, Jaller E, et al.: Caminando a hombros de gigantes: intersección entre la genómica y la IA. Medicina (B Aires). 2022; 43(4): 668–681. Publisher Full Text

[26] 26. Fields FR, Freed SD, Carothers KE, et al.: Novel antimicrobial peptide discovery using machine learning and biophysical selection of minimal bacteriocin domains. Drug Dev. Res. 2020; 81(1): 43–51. PubMed Abstract | Publisher Full Text | Free Full Text

[27] 27. Akhter S, Miller JH: BaPreS: a software tool for predicting bacteriocins using an optimal set of features. BMC Bioinformatics. 2023; 24(1): 313. PubMed Abstract | Publisher Full Text | Free Full Text

[28] 28. Talat A, Khan AU: Artificial intelligence as a smart approach to develop antimicrobial drug molecules: A paradigm to combat drug-resistant infections. Drug Discov. Today. 2023; 28: 103491. PubMed Abstract | Publisher Full Text

[29] 29. Xu Y, Verma D, Sheridan RP, et al.: Deep Dive into Machine Learning Models for Protein Engineering. J. Chem. Inf. Model. 2020; 60(6): 2773–2790. PubMed Abstract | Publisher Full Text

[30] 30. Ng ZJ, Zarin MA, Lee CK, et al.: Application of bacteriocins in food preservation and infectious disease treatment for humans and livestock: A review. RSC Adv. 2020; 10: 38937–38964. PubMed Abstract | Publisher Full Text | Free Full Text

[31] 31. Yassin MT, Abdel-Fattah Mostafa A, Al-Askar AA, et al.: In vitro antimicrobial potency of Elettaria cardamomum ethanolic extract against multidrug resistant of food poisoning bacterial strains. J. King Saud. Univ. Sci. 2022; 34(6): 102167. Publisher Full Text

[32] 32. Sun MC, Hu ZY, Li DD, et al.: Application of the Reuterin System as Food Preservative or Health-Promoting Agent: A Critical Review. Foods. 2022; 11. PubMed Abstract | Publisher Full Text | Free Full Text

[33] 33. Gupta R, Kumar R: Impact Of Chemical Food Preservatives On Human Health. PalArch’s J. Archaeol. Egypt/Egyptol. 2021; 15(15).

[34] 34. Ye P, Wang J, Liu M, et al.: Purification and characterization of a novel bacteriocin from Lactobacillus paracasei ZFM54. LWT. 2021; 143: 111125. Publisher Full Text

[35] 35. Yu HH, Chin YW, Paik HD: Application of natural preservatives for meat and meat products against food-borne pathogens and spoilage bacteria: A review. Foods. 2021; 10. PubMed Abstract | Publisher Full Text | Free Full Text

[36] 36. Ortega-Morales BO, Gaylarde CC: Bioconservation of historic stone buildings—an updated review. Appl. Sci. (Switzerland). 2021; 11. Publisher Full Text

[37] 37. Pato U, Riftyan E, Ayu DF, et al.: Antibacterial efficacy of lactic acid bacteria and bacteriocin isolated from Dadih’s against Staphylococcus aureus. Food Sci. Technol (Brazil). 2022; 42: 42. Publisher Full Text

[38] 38. Niamah AK: Structure, mode of action and application of pediocin natural antimicrobial food preservative: a review. Basrah J. Agric. Sci. 2018; 31(1):59–69. Publisher Full Text

[39] 39. Yap PG, Lai ZW, Tan JS: Bacteriocins from lactic acid bacteria: purification strategies and applications in food and medical industries: a review. Beni-Suef University Journal of Basic and Applied Sciences. 2022; 11. Publisher Full Text

[40] 40. Zimina M, Babich O, Prosekov A, et al.: Overview of global trends in classification, methods of preparation and application of bacteriocins. Antibiotics. 2020; 9(9). PubMed Abstract | Publisher Full Text | Free Full Text

[41] 41. Xihui Z, Yanlan L, Zhiwei W, et al.: Antibiotic resistance of Riemerella anatipestifer and comparative analysis of antibiotic-resistance gene detection methods. Poult. Sci. 2023; 102(3): 102405. PubMed Abstract | Publisher Full Text | Free Full Text

[42] 42. Parmanik A, Das S, Kar B, et al.: Current Treatment Strategies Against Multidrug-Resistant Bacteria: A Review. Curr. Microbiol. 2022; 79: 388. PubMed Abstract | Publisher Full Text | Free Full Text

[43] 43. El IK, Senhaji NS, Zinebi S, et al.: Potential application of bacteriocin produced from lactic acid bacteria. Microbiol. Biotechnol. Lett. 2020; 48: 237–251. Publisher Full Text

[44] 44. Klibi N, Ben Slimen N, Fhoula I, et al.: Genotypic diversity, antibiotic resistance and bacteriocin production of enterococci isolated from rhizospheres. Microbes Environ. 2012; 27(4): 533–537. PubMed Abstract | Publisher Full Text | Free Full Text

[45] 45. Lehtinen S, Croucher NJ, Blanquart F, et al.: Epidemiological dynamics of bacteriocin competition and antibiotic resistance. Proc. R. Soc. B Biol. Sci. 2022; 289(1984). Publisher Full Text

[46] 46. Guryanova SV: Immunomodulation, Bioavailability and Safety of Bacteriocins. Life. 2023; 13. PubMed Abstract | Publisher Full Text | Free Full Text

[47] 47. Ahmad V, Khan MS, Jamal QMS, et al.: Antimicrobial potential of bacteriocins: in therapy, agriculture and food preservation. Int. J. Antimicrob. Agents. 2017; 49: 1–11. PubMed Abstract | Publisher Full Text

[48] 48. Niamah AK, Al-Sahlany STG, Verma DK, et al.: Emerging lactic acid bacteria bacteriocins as anti-cancer and anti-tumor agents for human health. Heliyon. 2024 Aug 29; 10(17):e37054. PubMed Abstract | Publisher Full Text | Free Full Text

[49] 49. Demment MW, Young MM, Sensenig RL: Animal Source Foods to Improve Micronutrient Nutrition and Human Function in Developing Countries Providing Micronutrients through Food-Based Solutions: A Key to Human and National Development. J. Nutr. 2003; 133: 3879S–3885S. Publisher Full Text

[50] 50. Scialabba NEH: Livestock food and human nutrition. Managing Healthy Livestock Production and Consumption; 2021.

[51] 51. Varijakshapanicker P, McKune S, Miller L, et al.: Sustainable livestock systems to improve human health, nutrition, and economic status. Anim. Front. 2019; 9(4): 39–50. PubMed Abstract | Publisher Full Text | Free Full Text

[52] 52. Pieterse R, Todorov SD, Dicks LMT: Mode of action and in vitro susceptibility of mastitis pathogens to macedocin ST91KM and preparation of a teat seal containing the bacteriocin. Braz. J. Microbiol. 2010; 41(1): 133–145. PubMed Abstract | Publisher Full Text | Free Full Text

[53] 53. Pieterse R, Todorov SD: Bacteriocins: Exploring alternatives to antibiotics in mastitis treatment. Braz. J. Microbiol. 2010; 41: 542–562. PubMed Abstract | Publisher Full Text | Free Full Text

[54] 54. Sanca FMM, Blanco IR, Dias M, et al.: Antimicrobial Activity of Peptides Produced by Lactococcus lactis subsp. lactis on Swine Pathogens. Animals. 2023; 13(15). PubMed Abstract | Publisher Full Text | Free Full Text

[55] 55. Bemena LD, Mohamed LA, Fernandes AM, et al.: Applications of bacteriocins in food, livestock health and medicine. Int. J. Curr. Microbiol. App. Sci. 2014; 3(12).

[56] 56. Callaway TR, Anderson RC, Edrington TS, et al.: Recent pre-harvest supplementation strategies to reduce carriage and shedding of zoonotic enteric bacterial pathogens in food animals. Anim. Health Res. Rev. 2004; 5(1): 35–47. PubMed Abstract | Publisher Full Text

[57] 57. Rodríguez E, Arqués JL, Rodríguez R, et al.: Reuterin production by lactobacilli isolated from pig faeces and evaluation of probiotic traits. Lett. Appl. Microbiol. 2003; 37(3): 259–263. PubMed Abstract | Publisher Full Text

[58] 58. Khoramian B, Emaneini M, Bolourchi M, et al.: Therapeutic effects of a combined antibiotic-enzyme treatment on subclinical mastitis in lactating dairy cows. Vet. Med (Praha). 2016; 61(5): 237–242. Publisher Full Text

[59] 59. Zadoks RN, Middleton JR, McDougall S, et al.: Molecular epidemiology of mastitis pathogens of dairy cattle and comparative relevance to humans. J. Mammary Gland Biol. Neoplasia. 2011; 16(4): 357–372. PubMed Abstract | Publisher Full Text | Free Full Text

[60] 60. Hai NV: The use of probiotics in aquaculture. J. Appl. Microbiol. 2015; 119: 917–935. Publisher Full Text

[61] 61. Corripio-Miyar Y, Mazorra de Quero C, Treasurer JW, et al.: Vaccination experiments in the gadoid haddock, Melanogrammus aeglefinus L., against the bacterial pathogen Vibrio anguillarum. Vet. Immunol. Immunopathol. 2007; 118(1–2): 147–153. PubMed Abstract | Publisher Full Text

[62] 62. Smith P: Antimicrobial use in shrimp farming in Ecuador and emerging multi-resistance during the cholera epidemic of 1991: A re-examination of the data. Aquaculture. 2007; 271: 1–7. Publisher Full Text

[63] 63. Zhou X, Wang Y: Probiotics in Aquaculture - Benefits to the Health, Technological Applications and Safety. Health and Environment in Aquaculture. 2012.

[64] 64. Nathanailides C, Kolygas M, Choremi K, et al.: Probiotics have the potential to significantly mitigate the environmental impact of freshwater fish farms. Fishes. 2021; 6. Publisher Full Text

[65] 65. Wang YB: Effect of probiotics on growth performance and digestive enzyme activity of the shrimp Penaeus vannamei. Aquaculture. 2007; 269(1–4): 259–264. Publisher Full Text

[66] 66. Amenyogbe E: Application of probiotics for sustainable and environment-friendly aquaculture management - A review. Cogent Food Agric. 2023; 9. Publisher Full Text

[67] 67. Taoka Y, Maeda H, Jo JY, et al.: Growth, stress tolerance and non-specific immune response of Japanese flounder Paralichthys olivaceus to probiotics in a closed recirculating system. Fish. Sci. 2006; 72(2): 310–321. Publisher Full Text

[68] 68. Poorinmohammad N, Hamedi J: Moghaddam MHAM. Sequence-based analysis and prediction of lantibiotics: A machine learning approach. Comput. Biol. Chem. 2018; 77: 199–206. PubMed Abstract | Publisher Full Text

[69] 69. Yount NY, Weaver DC, de Anda J , et al.: Discovery of Novel Type II Bacteriocins Using a New High-Dimensional Bioinformatic Algorithm. Front. Immunol. 2020; 11. PubMed Abstract | Publisher Full Text | Free Full Text

[70] 70. Akhter S, Miller J: Optimal feature selection and software tool development for bacteriocin prediction. bioRxiv. 2022.

[71] 71. van Heel AJ , de Jong A , Montalbán-López M, et al.: BAGEL3: Automated identification of genes encoding bacteriocins and (non-)bactericidal posttranslationally modified peptides. Nucleic Acids Res. 2013; 41(Web Server issue): W448–W453. Publisher Full Text

[72] 72. Hammami R, Zouhir A, Le Lay C, et al.: BACTIBASE second release: A database and tool platform for bacteriocin characterization. BMC Microbiol. 2010; 10. PubMed Abstract | Publisher Full Text | Free Full Text

[73] 73. Morton JT, Freed SD, Lee SW, et al.: A large scale prediction of bacteriocin gene blocks suggests a wide functional spectrum for bacteriocins. BMC Bioinformatics. 2015; 16(1): 381. PubMed Abstract | Publisher Full Text | Free Full Text

[74] 74. Nguyen TTD, Le NQK, Ho QT, et al.: Using word embedding technique to efficiently represent protein sequences for identifying substrate specificities of transporters. Anal. Biochem. 2019; 577: 73–81. PubMed Abstract | Publisher Full Text

[75] 75. Hamid MN, Friedberg I: Identifying antimicrobial peptides using word embedding with deep recurrent neural networks. Bioinformatics. 2019; 35(12): 2009–2016. PubMed Abstract | Publisher Full Text | Free Full Text

[76] 76. Li C, Sutherland D, Hammond SA, et al.: AMPlify: attentive deep learning model for discovery of novel antimicrobial peptides effective against WHO priority pathogens. BMC Genomics. 2022; 23(1). Publisher Full Text

[77] 77. Wang Y, Wang L, Li C, et al.: AMP-EBiLSTM: employing novel deep learning strategies for the accurate prediction of antimicrobial peptides. Front. Genet. 2023; 14: 14. Publisher Full Text

[78] 78. Lee H, Lee S, Lee I, et al.: AMP-BERT: Prediction of antimicrobial peptide function based on a BERT model. Protein Sci. 2023; 32(1): e4529. PubMed Abstract | Publisher Full Text | Free Full Text

[79] 79. Ruiz Puentes P, Henao MC, Cifuentes J, et al.: Rational Discovery of Antimicrobial Peptides by Means of Artificial Intelligence. Membranes (Basel). 2022; 12(7). Publisher Full Text

[80] 80. Gull S, Shamim N, Minhas F: AMAP: Hierarchical multi-label prediction of biologically active and antimicrobial peptides. Comput. Biol. Med. 2019; 107: 172–181. PubMed Abstract | Publisher Full Text

[81] 81. Redshaw J, Ting DSJ, Brown A, et al.: Krein support vector machine classification of antimicrobial peptides. Dig. Dis. 2023; 2(2): 502–511. Publisher Full Text

[82] 82. Porto WF, Ferreira KCV, Ribeiro SM, et al.: Sense the moment: A highly sensitive antimicrobial activity predictor based on hydrophobic moment. Biochim. Biophys. Acta Gen. Subj. 2022; 1866(3): 130070. PubMed Abstract | Publisher Full Text

[83] 83. Yan J, Bhadra P, Li A, et al.: Deep-AmPEP30: Improve Short Antimicrobial Peptides Prediction with Deep Learning. Mol. Ther. Nucleic Acids. 2020; 20: 882–894. PubMed Abstract | Publisher Full Text | Free Full Text

[84] 84. Veltri D, Kamath U, Shehu A: Deep learning improves antimicrobial peptide recognition. Bioinformatics. 2018; 34(16): 2740–2747. PubMed Abstract | Publisher Full Text | Free Full Text

[85] 85. Mokoena MP: Lactic acid bacteria and their bacteriocins: Classification, biosynthesis and applications against uropathogens: A mini-review. Molecules. 2017; 22. PubMed Abstract | Publisher Full Text | Free Full Text

[86] 86. Jain C, Rhie A, Zhang H, et al.: Weighted minimizer sampling improves long read mapping. Bioinformatics. 2020; 36: i111–i118. PubMed Abstract | Publisher Full Text | Free Full Text

[87] 87. Edgar R: Syncmers are more sensitive than minimizers for selecting conserved k-mers in biological sequences. PeerJ. 2021; 9: e10805. Publisher Full Text

[88] 88. Wang Y, Chen Q, Deng C, et al.: KmerGO: A Tool to Identify Group-Specific Sequences With k-mers. Front. Microbiol. 2020; 11. Publisher Full Text

[89] 89. Shadab S, Alam Khan MT, Neezi NA, et al.: DeepDBP: Deep neural networks for identification of DNA-binding proteins. Inform. Med. Unlocked. 2020; 19: 100318. Publisher Full Text

[90] 90. Basiri ME, Nemati S, Abdar M, et al.: ABCDM: An Attention-based Bidirectional CNN-RNN Deep Model for sentiment analysis. Futur. Gener. Comput. Syst. 2021; 115: 279–294. Publisher Full Text

[91] 91. Onan A: Bidirectional convolutional recurrent neural network architecture with group-wise enhancement mechanism for text sentiment classification. J. King Saud Univ. Comput. Inf. Sci. 2022; 34(5): 2098–2117. Publisher Full Text

[92] 92. Shen Z, Bao W, Huang DS: Recurrent Neural Network for Predicting Transcription Factor Binding Sites. Sci. Rep. 2018; 8(1): 15270. PubMed Abstract | Publisher Full Text | Free Full Text

[93] 93. Hu S, Ma R, Wang H: An improved deep learning method for predicting DNA-binding proteins based on contextual features in amino acid sequences. PLoS One. 2019; 14(11): e0225317. PubMed Abstract | Publisher Full Text | Free Full Text

[94] 94. Angelopoulou A, Warda AK, O’Connor PM, et al.: Diverse Bacteriocins Produced by Strains From the Human Milk Microbiota. Front. Microbiol. 2020; 11: 11. Publisher Full Text

[95] 95. Chen H, Yan X, Tian F, et al.: Cloning, expression, and identification of a novel class IIa bacteriocin in the Escherichia coli cell-free protein expression system. Biotechnol. Lett. 2012; 34(2): 359–364. PubMed Abstract | Publisher Full Text

[96] 96. Martinez JM, Kok J, Sanders JW, et al.: Heterologous coproduction of enterocin A and pediocin PA-1 by Lactococcus lactis: Detection by specific peptide-directed antibodies. Appl. Environ. Microbiol. 2000; 66(8): 3543–3549. PubMed Abstract | Publisher Full Text | Free Full Text

[97] 97. Lozano JCN, Meyer JN, Sletten K, et al.: Purification and amino acid sequence of a bacteriocin produced by Pediococcus acidilactici. J. Gen. Microbiol. 1992; 138(9): 1985–1990. PubMed Abstract | Publisher Full Text

[98] 98. Kashyap DR: Microbial metabolites: Peptides of diverse structure and function. New and Future Developments in Microbial Biotechnology and Bioengineering: Microbial Secondary Metabolites Biochemistry and Applications. 2019. Publisher Full Text

[99] 99. Villalba-Meneses F, Gudiño Gomezjurado ME, Suquilanda-Pesántez JD, et al.: NIFtHool: An informatics program for identification of NifH proteins using deep neural networks. F1000Res. 2022; 11: 11. Publisher Full Text

[100] 100. Zhang J, Zong C: Deep Neural Networks in Machine Translation: An Overview. IEEE Intell. Syst. 2015; 30: 16–25. Publisher Full Text

[101] 101. McKinney W: Data Structures for Statistical Computing in Python. Proceedings of the 9th Python in Science Conference. 2010.

[102] 102. Pedregosa F, Varoquaux G, Gramfort A, et al.: Scikit-learn: Machine learning in Python. J. Mach. Learn. Res. 2011; 12.

[103] 103. Harris CR, Millman KJ, van der Walt SJ , et al.: Array programming with NumPy. Nature. 2020; 585: 357–362. PubMed Abstract | Publisher Full Text | Free Full Text

[104] 104. Cichy RM, Kaiser D: Deep Neural Networks as Scientific Models. Trends Cogn. Sci. 2019; 23: 305–317. Publisher Full Text

[105] 105. González LL: Deep Learning Neural Network Development for the Classification of Bacteriocin Sequences Produced by Lactic Acid Bacteria. Zenodo. 2024. Publisher Full Text

[106] 106. Liu W, Zhang L, Yi H, et al.: Qualitative detection of class IIa bacteriocinogenic lactic acid bacteria from traditional Chinese fermented food using a YGNGV-motif-based assay. J. Microbiol. Methods. 2014; 100(1): 121–127. PubMed Abstract | Publisher Full Text

[107] 107. Sood SK, Vijay Simha B, Kumariya R, et al.: Highly Specific Culture-Independent Detection of YGNGV Motif-Containing Pediocin-Producing Strains. Probiotics Antimicrob. Proteins. 2013; 5(1): 37–42. PubMed Abstract | Publisher Full Text

[108] 108. Chiou PT, Alotaibi AS, Halfond WGJ: BAGEL: An Approach to Automatically Detect Navigation-Based Web Accessibility Barriers for Keyboard Users. Conference on Human Factors in Computing Systems - Proceedings. 2023.

[109] 109. Boratyn GM, Camacho C, Cooper PS, et al.: BLAST: a more efficient report with usability improvements. Nucleic Acids Res. 2013; 41(Web Server issue): W29–W33. PubMed Abstract | Publisher Full Text | Free Full Text

[110] 110. Costa SS, da Silva Moia G , Silva A, et al.: BADASS: BActeriocin-Diversity ASsessment Software. BMC Bioinformatics. 2023; 24(1): 24. PubMed Abstract | Publisher Full Text | Free Full Text

[111] 111. Dua M, Barbará D, Shehu A: Exploring deep neural network architectures: A case study on improving antimicrobial peptide recognition. EPiC Series in Computing. 2020.

Deep learning neural network development for the classification of bacteriocin sequences produced by lactic acid bacteria

Abstract

Background

Methods

Results

Conclusions

Keywords

Revised Amendments from Version 1

Introduction

Table 1. Classification and characteristics of bacteriocins.

Fields where bacteriocins can be applied to address diverse issues

Food industry

Medicine

Livestock animal husbandry

Aquaculture

Work related to artificial intelligence for the classification of bacteriocin sequences

Methods

Figure 1. Methodological workflow for predicting bacteriocin AA sequences.

Data collection

Figure 2. Length of each sequence (BacLAB vs. Non-BacLAB) ordered by dataset position.

Feature extraction

Figure 3. Illustration of k-mers generated from an amino acid sequence with different k-values.

Figure 4. Feature extraction from an AA sequence.

Embedding vectors

Figure 5. Encoding of amino acid sequences.

Figure 6. Flowchart of RNN model using GRU cell.

Concatenated data sets

Table 2. Parameters for neural network training.

Deep neural network

Figure 7. Flowchart of the deep neural network.

Table 3. Layers of the deep neural network model.

Statistics analysis

Results

Table 4. Performance metrics obtained from k-fold cross validation (k=30) using different concatenation groups.

Table 5. Resultados de la prueba ANOVA.

Table 6. Tukey test results for accuracy comparing EV group and k-mer concatenation groups.

Figure 8. Accuracy and Loss evaluation.

Figure 9. Confusion matrix of 22° folder.

Discussion

Table 7. Performance of our model in comparisons with other methods of machine learning.

Conclusion

Ethics and consent

Data availability

Underlying data

Extended data

Software availability

References

Comments on this article Comments (0)

Open Peer Review

Comments on this article Comments (0)

Open Peer Review

Reviewer Status

Reviewer Reports

Comments on this article

Browse by related subjects

Competing Interests Policy

Stay Updated