WIND (Workflow for pIRNAs aNd beyonD): a strategy for in-depth analysis of small RNA-seq data

Current bioinformatics workflows for PIWI-interacting RNA (piRNA) analysis focus primarily on germline-derived piRNAs and piRNA-clusters. Frequently, they suffer from outdated piRNA databases, questionable quantification methods, and lack of reproducibility. Often, pipelines specific to miRNA analysis are used for the piRNA research in silico. Furthermore, the absence of a well-established database for piRNA annotation, as for miRNA, leads to uniformity issues between studies and generates confusion for data analysts and biologists. For these reasons, we have developed WIND ( Workflow for p IRNAs a Nd beyon D), a bioinformatics workflow that addresses the crucial issue of piRNA annotation, thereby allowing a reliable analysis of small RNA sequencing data for the identification of piRNAs and other small non-coding RNAs (sncRNAs) that in the past have been incorrectly classified as piRNAs. WIND allows the creation of a comprehensive annotation track of sncRNAs combining information available in RNAcentral, with piRNA sequences from piRNABank, the first database dedicated to piRNA annotation. WIND was built with Docker containers for reproducibility and integrates widely used bioinformatics tools for sequence alignment and quantification. In addition, it includes Bioconductor packages for exploratory data and differential expression analysis. Moreover, WIND implements a "dual" approach for the evaluation of sncRNAs expression level quantifying the aligned reads to the annotated genome and carrying out an alignment-free transcript quantification using reads mapped to the transcriptome. Therefore, a broader range of piRNAs can be annotated, improving their quantification and easing the subsequent downstream analysis. WIND performance has been tested with several small RNA-seq datasets, demonstrating how our approach can be a useful and comprehensive resource to analyse piRNAs and other classes of sncRNAs.

Often, pipelines specific to miRNA analysis are used for the piRNA research in silico. Furthermore, the absence of a well-established database for piRNA annotation, as for miRNA, leads to uniformity issues between studies and generates confusion for data analysts and biologists. For these reasons, we have developed WIND (Workflow for pIRNAs aN d beyonD), a bioinformatics workflow that addresses the crucial issue of piRNA annotation, thereby allowing a reliable analysis of small RNA sequencing data for the identification of piRNAs and other small noncoding RNAs (sncRNAs) that in the past have been incorrectly classified as piRNAs. WIND allows the creation of a comprehensive annotation track of sncRNAs combining information available in RNAcentral, with piRNA sequences from piRNABank, the first database

Reviewer Status
Invited Reviewers

Introduction
Advances in the field of Next-Generation Sequencing and big data analysis have led to the identification of several small non-coding RNA (sncRNA) classes, some of which are still poorly characterised 1,2 . Among others, the most investigated include microRNAs (miRNAs), small interfering RNAs (siRNAs), PIWI-interacting RNAs (piRNAs), small nuclear (snRNAs) and small nucleolar RNAs (snoRNAs). Increasing evidence demonstrates that the different sncRNAs constitute interconnected networks of molecules with key-regulatory functions in multiple biological processes, including physiological events, organism development or even disease 3 .
piRNAs represent an heterogeneous group, ubiquitous in most animal's germline cells, with lack of conserved sequences and few common structural features in the various species, due to the highly adaptive nature of the piRNA pathway 4 . Germline piRNAs typically have a 21-35 nt length, a strong bias for 5'-end uridine signature and a 2'-O-methyl group at their 3'-end 5 . Most of them are transcribed by either mono-directional or bidirectional genomic clusters, specific regions ranging from <1 kb to >100 kb, giving rise to a long, single-stranded precursor and further processed in multiple mature piRNAs through enzymatic cleavage. A subset of piRNAs present an adenosine bias at position 10, a feature indicating their biogenesis through the ping-pong cycle, a mechanism by which the cleavage of the target RNA is coupled with the production of a second population of target-specific piRNAs. They interact with PIWI proteins of the Argonaute (AGO) family, forming a silencing complex able to suppress transposable elements, regulate target's gene expression at both epigenetic and post-transcriptional level and defend from viral infections 6 . These piRNA functions are well studied in the animal germline, however in somatic cells, their role needs to be further elucidated. Additional studies have revealed that piRNA dysregulation can contribute to the onset of several diseases 7 . Notably in cancer, the abnormal expression of piRNAs has been associated with tumour initiation, progression, and metastasis formation and these molecules have shown the potential to be useful diagnostic tools and therapeutic targets as well as biomarkers for cancer prognosis 8 .
A limitation in understanding their function and use in clinical practice is the lack of a comprehensive and reliable method for their identification in tissues others than germline. A common strategy for piRNAs identification is based on mapping the reads obtained from high-throughput small RNA libraries to the genome and then annotate to small RNA databases for quantification. Most of the piRNA sequences identified so far have been deposited in databases such as piRNABank 9 , piRNAdb (https://www.pirnadb.org/), piRNAclusterDB 10 and others. However, data collected in these repositories mainly include germline piRNAs, while somatic piRNAs represent a minor fraction. In addition, piRNAs in somatic tissues and human cancers are less abundant than in germline, thereby leading to a more difficult identification and characterisation. Although piRNAs were initially confounded with fragments of longer RNAs, functional piRNAs have been identified to derive from fragments of rRNAs, tRNAs, snoRNAs, and post-transcriptionally processed mRNA [11][12][13] . Another level of complexity is represented by their genomic origin(s) and their actual amount, since identical sequences of piRNA can be produced by multiple genomic loci, resulting in very low precision and sensitivity.
For all the reasons stated above, and since existing workflows and tools for piRNA-analysis, usually, focus on the identification and quantification of piRNA clusters (PILFER 14 , unitas 15 ) or use outdated piRNA databases, we decided to implement a useful workflow for small RNA sequencing data analysis, able to analyse all classes of sncRNAs but especially designed for piRNA identification. We created a workflow that provides a quick method to integrate different piRNA databases in one annotation track, a two-method approach for small RNA identification, annotation and quantification, and an output with several ready-to-publish plots and statistics. Additionally, we packaged the entire workflow in several Docker 16 containers avoiding the annoying problems related to the installation and libraries dependencies. Finally, we applied it to different small RNA datasets, highlighting that piRNAs are dysregulated in breast cancer tissues and may play an important role in maintaining the stemness of MCF7 spheroid-enriched cancer stem cells (CSCs).

Methods
In this study, we implemented a workflow for small RNA sequencing data analysis, defined WIND (Workflow for p IRNAs a Nd beyon D) 17 , designed for a comprehensive identification and quantification of small-RNAs and especially of piRNAs. We deployed it exploiting the Docker containerisation approach, allowing us to integrate multiple bioinformatics tools. In detail, we created two Docker images where we adopted broadly used tools for pre-processing, reads alignment, identification and quantification of sncRNAs, and all downstream analyses. We also integrated the already available container made for Salmon 18 for transcriptome analysis. This solution takes into account best practices for reproducibility, versatility and ease of use, as the software deployment is fast and efficient. It can be used in various operating systems with only the requirements of the Docker engine and some minimum adjustments for processing power and RAM for the most memory demanding tools.

Amendments from Version 2
As suggested by the reviewer we added a clarification in the article. The WIND workflow solves many issues and discrepancies present in the original piRNA databases and it recategorizes piRNAs using sequences and genomic ranges information. However, it is still possible to find overlap between piRNAs and other ncRNAs, due to sequence similarity, ab initio categorization of a sequence as piRNA without further wet-lab validation, or real sncRNAs derived piRNAs. However, this information is stored in the final GTF. The purpose is to give complete information to allow studying piRNAs and other sncRNAs, highlighting possible ambiguities of the in silico analysis.
Furthermore, we have modified Supplementary Table 2, which in the previous version had a missing part.
Any further responses from the reviewers can be found at the end of the article

Workflow
The workflow consists of three significant steps ( Figure 1): 1. Annotation forging: the generation of the annotation files for small RNA sequences used in the next quantification step.
2. Pre-processing and quantification: pre-processing, alignment and quantification of the reads assigned to sncRNAs (using a dual approach: genomic and transcriptomic analysis).
3. Exploratory data analysis: result exploration of both quantification methods in parallel and Differential Expression (DE) with two different methodologies (edgeR 19,20 and limma-voom 21 ). The Annotation forging step, represented in blue, is the creation of a GTF file, where the two input databases (piRNABank and RNAcentral) are merged to produce the new small RNA annotation track, that together with the Fasta files constitute the inputs of the following step. In Pre-processing & Quantification step (light blue area), the user's fastq files undergo through the quality check, and the adapter removal followed by the two quantification approaches (completed by Salmon, and STAR with FeatureCounts software) that perform in parallel alignment and the quantification of reads. In the green box, representing the Exploratory data analysis phase, are displayed all the possible results produced by the workflow. The data analyst could also pursue differential expression analysis if that is the desirable outcome.
obtained sequences are finally exported to Fasta and GTF (Gene Transfer Format) file format. Eventually, we decided to provide the information available on piRNACLusterDB 10 as metadata in the final GTF file.
These tracks, for human and mouse species, have been included in the GitHub repository of the workflow and are available for users (https://github.com/ConYel/wind); therefore, the annotation forging step can be skipped. The workflow can also be used for any other species, but in this case, it would be necessary for the user to run the Annotation forging step with the specific genome and small RNA sequences, respectively.
The mapping of these piRNA sequences to the genome has revealed that the piRNAs can derive from two types of genomic locations: discrete genomic loci (the piRNA clusters) and protein-coding genes (e.g. UTRs, introns) 28-30 . Using bumphunter 31 package in the workflow, we were able to obtain piRNA origin information and provide it as the first additional file of the annotation. Since piRNAs are involved in the maintenance of genome stability through the silencing of transposable elements 32 , in this step, we also report a GTF file with the intersection between the genomic positions of small RNAs, the various categories of Transposable Elements (TEs) and information about the TE class, family and gene. Briefly, the GTF file is created with the related Fasta file; then the Genomic_Region_info, Multimapping_piRNA_info and Trans-posable_Elements_info files are reported which carry information about the genomic topology for the sequences in the GTF file. Likewise, these files are available in the GitHub repository (GRCh38 and GRCm38), for future usage by the data analyst.
Pre-processing and quantification. The second step of the workflow, Pre-processing and quantification (light blue box in Figure 1), consists of a quality control check of the small RNA-seq data, carried out using the FastQC tool 33 , followed by the adapter removal using Cutadapt 34 and by another quality control check using FastQC again. After these initial steps, the workflow exploits two different approaches for the quantification of sncRNAs. In particular, one uses alignment to a reference genome with STAR and then quantification of aligned reads with FeatureCounts 35 ; the other one uses Salmon (an aligner-free method) for the estimation of transcript-level abundance. We named the two approaches "genomic" and "transcriptomic" based on how the two methods work. Both approaches have positive and negative features. Undoubtedly, with STAR, the reads are aligned on a reference genome, Salmon instead is an alignment-free quantification method, able to prioritise the association of a feature with a specific site on a transcriptome. On the one hand, STAR could associate a read on multiple sites creating a complete list of identified regions, but this makes it more difficult to determine the genomic locations of origin, thus requiring more computational work. On the other hand, Salmon is a transcriptome quantifier able to correct for fragment GC-content and positional bias, which improves the accuracy of abundance estimates and potentially the sensitivity of subsequent DE analysis.
For Salmon, to be suitable for small RNA reads, the following options were applied: --seqBias, --gcBias, --numBootstraps 100, and --validateMappings as was suggested from the work of Wu et al. 37 . Resulting files, from the previous step, are imported to R using the Bioconductor packages: tximport (for Salmon) and Rsubread (for FeatureCounts), as DGEList objects (edgeR). After reads count, a FeatureCounts object is reported as an R object (.RDS) for an easy and fast way to import it in R. Moreover, we decided to record the assigned reads from both Salmon and FeatureCounts as BAM files.
Exploratory data analysis. The last step of the workflow is the Exploratory data analysis (EDA), which includes the filtering of low expressed small RNAs, the normalisation procedures performed in parallel for both FeatureCounts and Salmon, and then the visualisation of the results according to the suggested EDA workflows 38-41 from Bioconductor 42 . Finally, the workflow provides several useful output files: text and RDS files with filtered or normalised reads, information about the filtering step, Multidimensional Scaling (MDS) plots, biodetection plots, expression per small RNA category plot (countsbio), distance-matrix plot, hierarchical clustering plots with various normalisation methods, Principal Components Analysis (PCA) plots, Relative Log Expression (RLE) plots 43,44 , voomderived plots, sequence length barplots, and piRNA sequence logos. Briefly, for each dataset analysed, 9 RDS files, 17 tab-delimited files with all the statistics from alignment, filtering and annotation plus the filtered and normalised reads in counts per million (CPM), and 24 PDF files with several exploratory data analysis plots for each of the two methods used are generated. Furthermore, we also provided a script for the creation of ping-pong and strand coverage plots exploiting ssviz 45 and ggbio 46 R packages from Bioconductor.
Eventually, the gene expression data can be further compared using the DE analysis module, which allows calculating logarithmic fold change values using limma-voom or edgeR methods, and finally, both results (Salmon and FeatureCounts) can be merged and visualised together using heatmaps 47 . Then, the data analyst can choose to use the union of the results, and either consider all the molecules identified by at least one of the two methods, or use the intersection of the results and consider only the molecules supported by two methods. The differentially expressed molecules can be further used for piRNA target prediction analysis (included in the code) which was inspired by the similar module of iSMART tool 48 .
This workflow is structured to provide maximum flexibility to the user, who can modify several elements. In each step, alterations can be made regarding the tools or the databases used according to the needs of the data analyst, while the workflow strategy remains the same. Specifically, in the first step, the GTF file can be enriched with more sequences of interest or a completely new GTF file could be created for any species.
Currently, the first step has been performed on human and mouse sncRNA sequences for the generation of the GTF files (included in the GitHub repository), but the same approach could be utilised for any well-annotated genome that has enough small RNA sequences reported. In the second step, it is possible to use different tools for quality control, adapter trimming, aligning of the reads, e.g. Subread 49 or HISAT2 50 or a different "alignment-free" RNA-seq quantification method, as Kallisto 51 .

Validation and datasets
The complete workflow has been tested on several datasets to evaluate whether this worked in the identification of known piRNAs, low abundant molecules and in different species. Specifically, we have evaluated the performance of the transcriptomic approach on sncRNA identification, and particularly on piRNAs, for which this method has been tested here for the first time. We created a small dataset where spike-in sequences of piRNA-like molecules were added to the input RNA. COLO205_Dil_B, COLO205_Dil_C). Furthermore, we also exploited the samples processed with sodium periodate/β-elimination (samples are available on ArrayExpress, Accession number E-MTAB-8115: Treated_ COLO205_1, Treated_COLO205_2, Treated_COLO205_3, Treated_testis_1) as an additional control for the quantification algorithms. Indeed, sodium periodate oxidation strongly reduces the non-methylated molecules allowing to see a drastic change in non-methylated spike-ins concentration.
To test the performance of WIND, in both high-piRNA and low-piRNA expression conditions, we used Human Testis RNAs (BioChain Institute Inc Furthermore, we also exploited two public datasets to test our workflow thoroughly including the differential expression module dataset, consisting of two experimental conditions in triplicates, MCF-7 enriched CSCs spheroids and monolayer cultures (Accession number GSE68246 56,57 ); and a subset of 18 samples from TCGA-BRCA 58,59 , using 9 Primary Solid Tumour versus 9 Solid Tissue Normal corresponding samples.

Results
The goal of this study was to create a robust workflow for the identification and quantification of piRNA sequences in small RNA sequencing data. It focuses on elucidating and solving one of the most challenging issues of this kind of analysis, the annotation controversies of piRNAs, thus providing relatively accurate detection of the piRNA expression patterns. As a first point, a unique GTF file was generated for human and mouse species, starting from sequences obtained from the two widely used databases (piRNABank and RNAcentral) for piRNAs and sncRNAs, respectively. The GTF file was created as described in the Methods obtaining 149,549 different genomic locations corresponding to 39,812 sequences in human and 925,759 distinct genomic locations corresponding to 95,205 sequences in mouse for all small RNA types (see Extended data: Supplementary Table 1). Furthermore, in humans, from the 39,812 sequences coding for small RNAs, 28,000 were classified as piRNA, and 19,203 of them were found in common between RNAcentral and piRNABank; instead, 8,444 were found only in RNAcentral and 353 only in piRNABank. Additionally, in the mouse genome, from 95,205 sequences of small RNAs, 65,632 were categorised as piRNA, 34,306 were in common between the two databases and on the contrary, 29,114 were unique to RNACentral, and 2,213 were exclusive to piRNABank.
To test WIND thoroughly, we used several datasets with different characteristics: data produced in house, data available in a public repository, samples which include internal controls (spike-ins), datasets from different species (human and mouse), a dataset including different experimental conditions, and a dataset of tumour tissues (for more details see Methods and Data availability). First, we compared the quantification capability of the two methods implemented in the workflow. In particular, we evaluated the performance of the transcriptomic approach on piRNA quantification, as this method to our knowledge, has not yet been used to analyse this sncRNA class. For this reason, we decided to apply the workflow on an own-made dataset, in which spike-in sequences of four piRNA-like molecules, two non-methylated and two methylated at three different concentrations, were included (see Methods for details). Exploiting this feature, we were able to assess the high efficiency of both (genomic and transcriptomic) approaches in quantifying the spike-ins, as demonstrated by the very similar results obtained by the different methods. Supplementary Table 2 (Extended data) summarises the results obtained for the 4 piRNA-like molecules calculated using three methods: iSMART, FeatureCounts, and Salmon. The results show that all approaches can identify and quantify all the types of piRNA-like sequences (methylated, non-methylated, treated and not treated, and of different length) correctly.
For a long time, piRNAs have been considered exclusively expressed in germline cells, but recently, it has been reported by several studies their presence also in somatic and pathologic tissues 5,52,60-62 . Germinal cells generally show the most significant number and a higher level of expression of piRNAs. Starting from this knowledge, we tested the workflow on small RNA data obtained from human testis samples and tumour cell line (COLO205) to assess the capability to detect piRNAs in high and low concentration. Using WIND, we analysed the dataset and represented the results as plots of biodetection (Figure 2A and B) and countsBio ( Figure 2C and D) per sample from NOISeq 63 package. Biodetection plots are made from raw data in order to explore: a) the percentages of each small RNA type (named "biotypes") in the genome (referred to the whole set of small RNAs provided); b) the proportion detected in each sample; c) the percentage of each biotype within the sample. The countsBio plots, instead, show the count distribution for each biotype displayed as box plots, and the number of sncRNAs detected per biotype. Here, the two biodetection plots show, as expected, the presence of higher percentage of piRNA in testis sample respect to the COLO 205 cells (∼75% in testis and ∼20% in COLO 205; Figure 2A and B). Moreover, considering the countsBio plots ( Figure 2C and D), it is also possible to assess piRNAs higher median expression in testis if compared to the COLO 205 cells. Finally, we also produced sequence logos for the expressed piRNAs in the two sample types. These plots indicate if the bias for uridine at first position base or the adenine at the 10 th position of the sequence exists and if there are other biases in the 15 first bases of the sequences 64 . As expected, both groups showed a strong bias for uridine at the first position (drawn as thymine in the plots), in accordance with the preferential binding of PIWI proteins to transcripts starting with U. A bias, albeit with a much weaker signal, was also evident toward adenine at position 10 in testis group, a hallmark of piRNAs generated by the ping-pong cycle and typical feature of germinal cells ( Figure 2E and F).
For this analysis, we applied a stringent approach; thus, we considered as expressed only those molecules that were identified by both methods (genomic and transcriptomic). Then, we found that 7324 piRNAs were identified in testis and 223 in COLO 205 cells. Therefore, this workflow was able to efficiently identify a good number of piRNAs in somatic cells, where low levels of expression make the procedure more complicated, even when very stringent analysis parameters are used.
It is worth mentioning that, as detailed in the Methods, this workflow operates using two methods in parallel, each of which is able to identify sncRNAs with different performance. Applying the two algorithms together (considering the union of the results) allows the identification of an enriched number of molecules. The final user can decide, based on specific interests, which results should take into consideration, the union of the two approaches, only one, or the intersection.
We also evaluated the performance of the workflow for piRNA identification in the mouse. Specifically, we analysed small RNA-seq samples from mouse adult cardiac myocyte (aCM). In these samples, we were able to identify, considering the union of the genomic and transcriptomic approach, ∼290 different piRNAs per samples (see Extended data: Supplementary Table 3 for the details of the two analysis in comparison). We found that the piRNA population identified in aCM represents 12% of all reads assigned to small RNAs, and the top 100 expressed molecules are listed in Supplementary  Table 3 (Extended data).
Moreover, to test the accuracy of the workflow across diverse sets of data, we moved to a public dataset. Recent findings have indicated that the role of piRNAs may not be only limited to germ cells, but may be extended to the regulation of cancer, promoting a stem-like state of tumour cells 65 . Therefore, we selected a dataset (GSE68246) to compare the piRNA profiles of breast spheroid-enriched CSCs against parental MCF7 cells and also generated in this case files, statistics and plots with WIND that are all available on the GitHub repository. On the expression data, filtering for low-expressed features was first carried out, then two of the NOIseq filters (1 count per million, and proportional filtering) or the EdgeR were applied, filtering by group with and without the specific batch. The resulting objects were reported as RDS files and, for all the analysed sequences, a histogram ( Figure 3A and B) with the average log 2 CPM before and after filtering of the counts was made using the edgeR filtering. Finally, the normalisation of all the counts was carried out with multiple methods: TMM 66 , TMMwsp (TMM with singleton pairing), RLE 67 , limma-Voom, with and without quality weights quantile 68 , Voom, with and without quality weights using the TMM normalisation, and Voom with and without quality weights exploiting the TMMwsp normalisation. To visualise the unforeseen sources of variation and to control whether the normalisation applied was correct, RLE plots ( Figure 3C and D) were generated for all the sequenced data, for each normalisation method and for the not normalised, filtered data. An RDS object was also exported with the list of all normalised objects, and hierarchical clustering ( Figure 3E and F) was then performed on previous data with various normalised methods. We applied the Euclidean distance and the methods of Ward's, complete and average linkage. Furthermore, a correlation plot ( Figure 4A) with sample-to-sample distances was made to show the similarities and dissimilarities between samples on all sncRNA data. In order to check for batch effects and get the summarised effects of the experimental categories, MDS plots ( Figure 4B) and the first two Principal Components on a PCA plot were reported ( Figure 4C). From the GTF file, sequences' lengths were extracted and combined with information about the expressed molecules to draw  Clustering plots, exploiting all the sequenced sncRNA data, with multiple clustering methods and different normalisation methods. As an example, only the first two plots (with TMM (D) and without normalisation (C) for the filtered counts derived from FeatureCounts). In black and brown are shown the two different groups (monolayer and spheroid). the barplots (Figure 5), allowing to underline the differences between the two methods or between the groups of interest. Alongside, in this case, the sequences' logos for only the expressed piRNAs were generated. Moreover, we reported a tab-delimited file with the mean CPM per biological group, as it is useful to know these values for further studying or visualisation. All the files, plots and statistics are available in the GitHub repository.
Ultimately, we performed the differential expression analysis on the results of both methods (genomic and transcriptomic), and the union of the comparisons was reported (Extended data: Supplementary Table 4    piRNAs in the stemness of cancer cells, independently from the tissue type. As a final test, we exploited sncRNA data of 18 samples from TCGA-BRCA, 9 Primary Solid Tumor versus 9 Solid Tissue Normal (Extended data: Supplementary Table 5). We identified 10 piRNAs DE out of 235 DE sncRNA molecules with both approaches (genomic and transcriptomic). In the heatmap ( Figure 6B), it is possible to note that the two approaches obtained equivalent results, and the clustering approach showed a good clustering between tumour and normal samples. Interestingly, some of the identified piRNAs have been previously described as related to cancer progression in tissues like kidney and lung (DQ581033 70 , DQ593398 -DQ592932 71 ). In addition, from the 235 DE sncRNA, 64 are reported as miRNA and most importantly, we found the cancer-specific MIR-8 (now reported as mir-141 and mir-200) upregulated as previously reported by Hoadley et al. 59 . In order to acquire possible functional information about the DE piRNAs, we predicted their possible target RNAs (using the code included in the workflow), and we identified 11 protein-coding genes (Extended data: Supplementary Table 5). Most of them were predicted to bind their targets at the 3' UTR and 4 at the 5' UTR. The functional enrichment analysis of the 11 predicted piRNA targets, using the EnrichR online tool 72 , revealed that they might be involved in regulating "signal transduction that contributes to a DNA damage checkpoint" (GO:0072422), a biological process that has a vital role in cancer progression.

Conclusions
In this paper, we describe a novel bioinformatics workflow, WIND 17 , for the identification and analysis of piRNA from small RNA sequencing data. The main innovations of WIND are: a Docker containerisation approach for the complete analysis, the integration of two databases for piRNA annotation, a dual-method for detection and quantification of piRNAs (named as "genomic" and "transcriptomic" in this article), and the creation of ready-to-use plots and statistics for the interpretation of the results. The idea was born in order to cope with the absence of a gold-standard pipeline for piRNA identification and annotation. We tried to solve many issues related to small RNA sequencing data analysis and, in particular, piRNA identification and quantification. For this reason, the first step was to deploy multiple Docker containers set up to run all the steps of the workflow without installing tools, software or libraries. After this, we focused on the creation of an easy method to integrate data from distinct databases (RNACentral and piRNABank). As described in methods, we were able to assess the deep diversity between the databases. Indeed, it was possible to notice not only differences in numbers of piRNAs annotated between the two databases (both in human and mouse genome) but also inconsistencies in the annotation or in the classification (e.g. the same molecule is classified as piRNA in one and as miRNA in the other). Actually, combining databases usually produces discrepancies and working with sncRNA sequences that have multiple annotations is troublesome. However, with this step, it is possible to obtain a unique GTF file that merges this information (all ids and genomic locations associated with that specific molecule) that can be used for piRNA identification and annotation. The main part of our workflow consists of two detection methods for piRNAs described above as "genomic" and "transcriptomic". For the genomic part, we decided to perform an alignment using STAR. STAR is a well-known genomic aligner that uses a reference genome to compute read alignments. For the transcriptomic part, we used Salmon to produce accurate transcript-level quantification estimates from sncRNA sequencing data. Salmon's main innovation is the use of quasi-mapping (accurate and very fast-to-compute read alignments). However, even if the transcriptomic approach proved to be working well, it has been demonstrated that for the identification of some sncRNAs might not be as efficient as the genomic approach 37 . For this reason, we set the methods to improve sncRNAs identification, following the suggestions of previous works 36,37 . Our idea was to combine the two approaches in order to evaluate the similarities between the results obtained and then ameliorate the identification of piRNAs. The last step was to create a Differential Expression module and, most importantly, the automatic creation of plots and statistics useful for the interpretation of data and results.
To test WIND, we applied it to several sncRNA datasets. Working on the first dataset, were we used the spike-in approach, we found a good consistency between the different methods in the detection of piRNA-like molecules, highlighting the efficiency of both approaches in piRNA quantification. Furthermore, the test on germline and somatic tissues revealed that the two methods, even when a stringent filter is applied, are able to assess the presence of piRNAs also in tissues where they are not abundant. In addition, the workflow is also functional in different species, as shown by the results obtained on the mouse genome. Finally, we also tested WIND on two published datasets, comprising tumour cell lines and tissues. Our workflow, also in this instance, was able to identify efficiently piRNAs and find differentially expressed molecules (not previously investigated) and to recognise, in general, a significant number of sncRNAs.
It is important to clarify that WIND deals with issues and discrepancies present in the original piRNA databases and it recategorizes piRNAs using sequences and genomic ranges information. However, it is still possible to find overlap between piRNAs and other ncRNAs, due to sequence similarity, ab initio categorization of a sequence as piRNA without further wet-lab validation, or real sncRNAs derived piRNAs. However, this information is stored in the final GTF. The purpose is to give complete information to allow studying piRNAs and other sncRNAs, highlighting possible ambiguities of the in silico analysis, which researchers will be able to evaluate and eventually resolve with wet-lab experiments, such as immunoprecipitation (IP) of PIWI-containing ribonucleoprotein complexes and/or the RNA periodate treatment 73,74 .
In conclusion, WIND is a complete dockerised workflow, usable by bioinformaticians and data analysts who want to explore small RNA sequencing data globally, but specifically designed and optimised for piRNAs. WIND allows going from raw data to plots and statistics ready for publication thanks to fast and efficient software implementation, making it very useful in the field of small RNA research.

Open Peer Review
being removed from their GTF files, fix the problem and redo the analysis.
Minor: mSS28 seems to be resistant to periodate treatment, but mSS22 is not. Is my interpretation correct (Supp Table S2 Q1. This is a method article. As such, I think the method is designed in a thoughtful manner and tested with relevant controls, so it could constitute a useful resource to other researchers interested in piRNA annotation. However, although the abstract and discussion are method-oriented, the result section is strongly supportive of widespread somatic piRNA expression in mammals, which is controversial. A1. The aim of this article was to develop a workflow for sncRNA identification, focused on piRNAs, with the intention of detecting piRNAs (and not only them) independently of the tissue of origin. Indeed, we tested the WIND performance in detecting the presence of piRNAs in both somatic and germline tissues, but demonstrating the somatic expression of piRNAs is beyond this article's aim. We used these types of data as proof of the sensing power of our workflow in identifying piRNAs even in the presence of low expressed molecules.
Q2. I give the authors the credit to address this relevant research question by removing as much confounders from their GTF files as possible, and for their willingness to redo most of their analysis after my initial concerns. But an analysis of the sequence of many of the top 100 expressed piRNAs in mouse cardiomyocytes (Supplementary Table 3) shows that they still have 100% identity to rRNAs (e.g., mmu_piR_039147, mmu_piR_005109), snoRNAs (mmu_piR_032974, mmu_piR_034793, mmu_piR_025007), etc. Thus, although the design of the workflow intends to remove all piRNAs with overlap to mRNAs or ncRNA fragments, there seems to be a problem in the filtering step, and some of the article's claims are not fullfilled. My intention is not to further delay acceptance of this manuscript. But the fact that most remaining piRNAs are still non-coding RNA fragments should be at least analyzed and acknowledged. The sequence of mmu_piR_039147 is: CCTCGACACAAGGGTTTGTAA and the sequence that is found to have a partial match with it is CCTCGACACAAGGGTTTGT, which is 2 nucleotides less. The genomic ranges of mmu_piR_039147 in the GTF file are chr6 87997228-87997248 -chr9 118674951-118674971 -In the GTF file (gencode.vM25.primary_assembly.annotation.gtf.gz) that we have used to search if these ranges are included in other sncRNAs, there are no overlaps with any sncRNAs for mmu_piR_039147. As we have seen from BLASTN, a part of the sequence is identical to "Mus musculus large subunit ribosomal RNA gene, partial sequence" but that doesn't mean that it is from the same genomic location and is, therefore, an rRNA fragment. Of course, this is just one case and other piRNA sequences match 100% with a part of another sncRNA, but they can also come from a completely different genomic region.
We agree with the reviewer that some molecules can be fragments or other sncRNAs incorrectly annotated. However, it is not possible at present, without wet-lab experiments, to distinguish fragmentation products from molecules that can come from other ncRNAs such as rRNAs, tRNAs, and snoRNAs, as recently proposed by Barreñada et al.

("Diversification of piRNAs expressed in PGCs and somatic cells during embryonic gonadal development", Barreñada et al. 2020).
We believe that WIND workflow must show all the possible reproducible results and all the useful information and then the data analyst, with the help of the biologist and eventually experimental approaches, will be able to deepen the study and possibly focus on the dubious sequences, to establish whether they are or no piRNA when ambiguity exists. As we already know that these issues exist, we have included as a metadata column on the GTF file, for all the piRNA genomic regions that fall inside other sncRNAs, this information, thus the data analyst can directly check if what identified as expressed or differentially expressed has multiple annotations. In other words, for example, if the piRNA identified by the workflow is inside a miRNA precursor or tRNA, this information can be found in the metadata column "GENCODE_annot" of the GTF. Moreover, it is possible to obtain a file with the detail of all mapped reads assigned to a specific piRNA and verify if they are multi-mapped or not. The data analyst can use this file to see if most of the reads are associated with a complete sequence of a piRNA or if some (for example shorter reads) also map to another sncRNA.
In conclusion, the workflow tries to deal with the issues of piRNA annotation that the databases have until now; however, in our opinion, to completely solve these issues, it is not conceivable to use only an in-silico approach but it is necessary to combine it with wet laboratory techniques, as immunoprecipitation (IP) of PIWI-containing ribonucleoprotein complexes and/or the RNA periodate treatment, to define which sequence is truly a piRNA. Our contribution, with this workflow, is to mitigate the already existing problems and to use it as a backbone for potential future strategies for sncRNA analysis.
In order to be clear about these concerns and to clarify to the users the type of results provided by WIND, we added these sentences in the conclusion section:

"It is important to clarify that WIND deals with issues and discrepancies present in the original piRNA databases and it recategorizes piRNAs using sequences and genomic ranges information. However, it is still possible to find overlap between piRNAs and other ncRNAs, due to sequence similarity, ab initio categorization of a sequence as piRNA without further wet-lab validation, or real sncRNAs derived piRNAs. However, this information is stored in the final GTF. The purpose is to give complete information to allow studying piRNAs and other sncRNAs, highlighting possible ambiguities of the in silico analysis, which researchers will be able to evaluate and eventually resolve with wet-lab experiments, such as immunoprecipitation (IP) of PIWI-containing ribonucleoprotein complexes and/or the RNA periodate treatment".
Q3. Supplementary tables should contain a list of the most abundant piRNAs in all the cases analyzed (including COLO 205 cells, etc), and sequences should be provided. Ideally, the authors could try to understand why many piRNAs with an overlap to ncRNA fragments are not being removed from their GTF files, fix the problem and redo the analysis. A3. All resulting files (the top 100 expressed piRNAs, along with other 28 info files and 21 PDF files for each analyzed dataset) are published in the GitHub repository and can be found in this https://github.com/ConYel/wind/tree/main/Datasets_analysis folder. All the information about the sequences of each sncRNA is included in the GTF files.
Furthermore, during the 2nd revision, our main point was to remove any incongruencies between piRNA genomic positions that fall inside gene coding regions. The existence of piRNAs with an overlap to sncRNA fragments is a well-known issue that our workflow tried to mitigate, however, despite our best efforts, it is not possible to completely remove all these molecules without further wet-lab investigation. For now, we prefer to keep the sequences that are inside sncRNAs and give that information to the users as metadata. In this way, the researchers have all the information and can decide if, for the molecules of interest, it is necessary to confirm the molecule's nature with the help of wet-lab experiments. Moreover, the workflow is under constant development and better ways of solving this kind of issue could be found in the future and integrated with the workflow. Any data analyst or biologist that has a specific request, about the workflow, is able to make a suggestion or even improve it by himself.
Q4.Minor: mSS28 seems to be resistant to periodate treatment, but mSS22 is not. Is my interpretation correct (Supp Table S2) A4. Thanks to the reviewer, we noticed a missing part of the table in the submitted Supplementary table 2. We have fixed the error and we re-uploaded it.
In the "Annotation forging" step, I noticed two length thresholds (100 bp and 69 bp) were used when building the "final Fasta file". So why 100 and 69?

4.
Minor: Figure 5 is bar plot rather than histogram. 1. Figure 6B -part of the legend at the bottom is obscured. 2.
category of all small RNA species. For example, miRNA is ignored by WIND.

A.1
The most problematic part of this workflow was, without any doubt, creating a good GTF file and removing all the possible noise from piRNA sequences. We agree that, in this paper, we focused our attention mostly on piRNAs but our workflow is made for analysing all species of sncRNAs as shown in the figures. A user can easily focus his attention on another small RNA species and perform all the analyses exploiting the genomic and the transcriptomic approach. For this reason, we choose to write in the title "piRNAs and beyond", to indicate that our workflow is suggested not only for the study of piRNAs but can also be used to analyse other sncRNAs molecules.
Q.2 I agree with the author that the key to piRNA analysis is the comprehensive and accurate identification of piRNAs and piRNA precursor. However, only piRNABank and RNAcentral were used in the "Annotation forging" step of WIND while updated databases of piRNA and piRNA cluster had been published. I suggest the authors integrating piRBase and piRNA cluster database in the WIND pipeline.
A.2 Thank you for your valuable comment. Initially, we have incorporated piRNAbank in our WIND pipeline as the first building block for the piRNAs annotation track. After that, we incorporated RNACentral to include also sncRNAs sequences in the annotation track in order to have a complete view of the sncRNAs species (known until now). However, regarding the additional piRBase database, we are facing a major bottleneck in including it. In detail, the database provides the largest and more comprehensive resource of piRNA sequences annotation, including more than 8 million sequences for human and about 60 million sequences for mouse. A major problem is that many of these sequences are located in the same genomic locus, and they differ only in one or a few nucleotides opening the possibility, not yet demonstrated, that they could be piRNA isoforms produced by variations on 5' or 3' end, including nucleotides extension, addition or trimming. This situation raises the question of how these molecules should be quantified and including all of them as independent molecules ("In piRBase, if a piRNA sequence is a subsequence of another piRNA, both of them were considered as different sequences and were assigned distinct piRBase names." cited from Wang et al. (2019)) would significantly falsify the abundance ratio in the quantification step when measuring gene expression with Featurecounts or Salmon. Indeed, this could be an additional methodological constraint that would generate biased counts for any further downstream analysis (e.g. differential expression analysis). On this premise, we are considering including piRBase in a future version of WIND, but some issues about the quantification need to be solved. Nevertheless, about piRNAClusterDB (piRNAcldb), as suggested, we integrated the information included in piRNAcldb as metadata in the final GTF file produced by the annotation forging step. We updated the workflow shown in Figure 1 and we added the following sentence in the Method section: Q.3 Figure 1. "small RNA Fastq files" were not used in the "Annotation forging" step, and should be placed in the "Pre-processing and quantification" step.
A. 3 We modified the figure as suggested.
to explain below.
One of the motivations of the authors was to develop a reliable package for piRNA analysis that can be also applied for piRNA identification in somatic cells. The authors cite our 2018 study (ref. 25) so they are aware that piRNA databases contain a small percentage of contaminating entries that are probably not piRNAs. They considered this information in the design of their workflow and removed all piRNA reads in piRNABank that also have an alternative annotation in RNA Central. However, it is not clear to me whether they did the same with the piRNA sequences in RNA Central that also have an alternative annotation in RNA Central. Figure 2A shows that roughly 80% of the sequences in testis are piRNAs, and also 80% of the sequences in testis are annotated as tRNAs. So this is the proof that, if the authors really intended to depurate their GTF file from piRNAs also having an annotation in RNA Central, they were not effective in doing so. And this completely alters the author's conclusions regarding non-germinal piRNAs.
Another concern is that RNA Central is explicitly a database of non-coding RNAs. Thus, by removing entries in piRNABank that have an alternative annotation in RNA Central, they are not removing those piRNAs in piRNABank that are mRNA-fragments. How to distinguish secondary piRNAs generated from cleavage of coding sequences from mRNA fragments contaminating piRNA databases? It is not surprising, therefore, that the authors found that the remaing piRNAs in their GTF file are either derived from piRNA clusters or from protein-coding genes. This is a methodological bias and does not seem to be a deliberate decision based on the biology of the piRNA pathway. Again, this can completely alter the authors' results and conclusions when analyzing RNA-seq data obtained in somatic cells using WIND.
The authors affirm that they focused in "solving on of the most challenging issues of (small RNA) analysis, the annotation controversies of piRNAs". I'm afraid that, in my opinion, this controversy is still not solved. I would suggest the authors to read our last contribution in this topic (Tosar et al. 2020 1 ) and reconsider the design of their workflow based on what we discussed in that paper. My suggestion is to take the union of piRNAs from piRNABank and RNA Central, and remove those sequences that have alternative annotations in RNA Central. This can be used to construct GTF file 1 containing "canonical" piRNAs derived from piRNA clusters and also mRNA fragments (whether truly piRNAs or not). Then, remove sequences matching to human or mouse mRNAs from RefSeq to make GTF file 2, containing sequences that can only be classified as piRNAs. Repeat their analysis and compare the results shown in this manuscript with what they see based on my suggested approach.

Minor comments:
Consider adding "testis" and "COLO 205 cells" as a headline in Figure 2. Why 69 nt as a cut-off?
RNA Central. Figure 2A shows that roughly 80% of the sequences in testis are piRNAs, and also 80% of the sequences in testis are annotated as tRNAs. So this is the proof that, if the authors really intended to depurate their GTF file from piRNAs also having an annotation in RNA Central, they were not effective in doing so. And this completely alters the author's conclusions regarding non-germinal piRNAs.
A. 1 We thank the reviewer for the helpful suggestion. We removed all piRNA sequences in piRNABank that also have an alternative annotation in RNA Central. However, checking the sequences in RNA Central that also have an alternative annotation in RNA Central, we found only six molecules and none of them is a piRNA. These sequences have been deduplicated in the final GTF, and corresponded to snoRNAs.
We apologize to the reviewer for the misunderstanding, in Figure 2, the categories on the right of the green dashed line should be referred to the axis on the right. In this specific case, the piRNAs in the testis sample are ~75% while tRNA are ~4% as shown on the axis on the right of the plot. To clarify this point, we specified this in the legend as follows: << The biotypes on the right side of the green dashed line are the least abundant, and the reference values are reported on the Y right axis.>> Q.2 Another concern is that RNA Central is explicitly a database of non-coding RNAs. Thus, by removing entries in piRNABank that have an alternative annotation in RNA Central, they are not removing those piRNAs in piRNABank that are mRNA-fragments. How to distinguish secondary piRNAs generated from cleavage of coding sequences from mRNA fragments contaminating piRNA databases? It is not surprising, therefore, that the authors found that the remaining piRNAs in their GTF file are either derived from piRNA clusters or from protein-coding genes. This is a methodological bias and does not seem to be a deliberate decision based on the biology of the piRNA pathway. Again, this can completely alter the authors' results and conclusions when analyzing RNA-seq data obtained in somatic cells using WIND.
A.2 We want to thank the reviewer for this challenging question. As suggested in question 3, we have created a new GTF that takes this into account by removing the sequences matching to human or mouse mRNAs. For more details see "answer 3".

Q.3
The authors affirm that they focused in "solving one of the most challenging issues of (small RNA) analysis, the annotation controversies of piRNAs". I'm afraid that, in my opinion, this controversy is still not solved. I would suggest the authors to read our last contribution in this topic (Tosar et al. 20201) and reconsider the design of their workflow based on what we discussed in that paper. My suggestion is to take the union of piRNAs from piRNABank and RNACentral, and remove those sequences that have alternative annotations in RNACentral. This can be used to construct GTF file 1 containing "canonical" piRNAs derived from piRNA clusters and also mRNA fragments (whether truly piRNAs or not). Then, remove sequences matching to human or mouse mRNAs from RefSeq to make GTF file 2, containing sequences that can only be classified as piRNAs. Repeat their analysis and compare the results shown in this manuscript with what they see based on my suggested approach.
A. 3 We want to thank the reviewer for this very illuminating article regarding the piRNA annotation challenges. We agree with the reviewer that the problem has not been completely resolved, but we are trying to move in that direction and above all, we are trying to highlight that this problem must be considered and addressed in order to study the piRNAs correctly. Using the suggestions proposed, we created a new GTF file removing the sequences matching to human or mouse mRNAs. Using this approach, it is possible to note that some differences exist between the previous and the new results. For this reason, we modified all the tables, figures and data in the text accordingly. We are confident that now our workflow is stronger and more robust. However, it is possible to obtain the previous GTF from GitHub or change the code in order to produce the preferred GTF. Finally, we have also added on the method section the details about the creation of this new GTF file: <> Minor comments: Q.4 Consider adding "testis" and "COLO 205 cells" as a headline in Figure 2.
A. 4 We modified the figure as suggested.

Q.5
The authors refer that the problem of detecting piRNAs in COLO 205 cells is their low expression. However, there are some sequences which are highly expressed according to Figure 2, D. Are these sequences really piRNAs?
A.5 When we talk about the low expression of piRNAs in COLO205 we are referring to an average expression of all identified molecules, however, among these, there are also molecules with a higher expression. In any case, after the creation of the new GTF file, as described before, we reanalysed all the datasets and we were still able to identify some molecules highly expressed and classified as piRNAs. Obviously, WIND exploits the previous knowledge about sncRNAs, this means that the obtained results, even if more accurate thanks to these new improvements, are still limited due to the primary databases used. However, as always suggested, a wet-lab validation should be necessary to confirm the in silico results and to really establish the correct identification and function of the discovered molecules, but this is beyond the scope of this article.
Q.6 A brief description of the sequencing library preparation should be supplied. If the authors spiked in methylated RNAs, treating the samples with sodium periodate before NGS could have been an interesting control.
A.6 As suggested by the reviewer, we added three samples treated with sodium periodate before library preparation and now, in Supplementary Table 2, is available the "Treated" table with the percentages of all the spike-ins used. Furthermore, we modified the method section accordingly: Q.7 Sequence logos are nice and can be informative, but the workflow could be more powerful if it included plots showing ping-pong signals.

A.7
We added in the GitHub repository a code called "pinp_pong.Rmd" that creates a pingpong plot or a coverage plot for both the strands from a BAM file selected for piRNAs. Finally, we modified the Methods section as following: