YODEL : Peak calling software for HITS-CLIP data

YODEL is a peak calling software for analyzing RNA sequencing data generated by High-Throughput Sequencing of RNA isolated by Crosslinking Immunoprecipitation (HITS-CLIP; also known as CLIP-SEQ), a method to identify RNA-protein interactions genome-wide. We designed YODEL to analyze HITS-CLIP experiments, in which Argonaute proteins are immunoprecipitated, followed by sequencing of the associated RNA in order to identify bound microRNAs and their mRNA targets. The HITS-CLIP sequenced reads are mapped to the genome, and then read peaks are visualized where clustered sets of reads map to the same region. Several peak calling algorithms have been developed to define the boundaries of these peaks. In contrast to other peak callers for HITS-CLIP data, such as Piranha, YODEL does not map the starts of reads to fixed interval bins, but instead uses a heuristic approach to iteratively find the tallest point within a set clustered reads and examine bases upstream and downstream of that point until a peak has been determined. This allows the peak boundary to be defined more precisely than coordinates that are multiples of the bin size. Per-sample peak counts are also generated by YODEL, which quickly enables downstream differential representation analysis. YODEL is available at . https://github.com/LancePalmerStJude/YODEL/


Introduction
A peak caller that could accurately define a single peak amongst several samples was required to analyze High-Throughput Sequencing of RNA isolated by Crosslinking Immunoprecipitation (HITS-CLIP) (Darnell, 2010) data from fetal liver red blood cell precursors of miR-144/451 -/-and wild-type mice (Paralkar et al., 2014; Paralkar VR, Palmer LE, Xu P, Lechauve C, Zhao G, Yao Y, Luan J, Wu G, Vourekas A, Mourelatos Z, Scheutz JD and Weiss MJ; unpublished study).Piranha (Uren et al., 2012) is one such software that is commonly used to identify peaks generated by HITS-CLIP.However, Piranha bins the starts of reads and does not fully define a peak.Consequently, a large bin size may result in multiple peaks being combined into one peak.We found that the identification and resolution of peaks using Piranha was highly dependent on the background threshold (-a) and binSize (-b) parameters, and it was unclear how these parameters should be set in order to obtain the most biologically relevant information.We also found that running Piranha on multiple samples separately results in peak boundaries that may be quite different from sample to sample.Generating initial peak calls from a combined sample dataset creates a single standard set of peak boundaries for all samples, which simplifies downstream analysis.We therefore developed a peak calling algorithm, named YODEL, with the following properties: 1) Incorporate strand specificity (Piranha does this, but many other CHIP-SEQ peak callers do not); 2) Generate per-sample read counts for each peak; 3) Have parameters that have easily understandable implications when changed.

Input
The main input for the peak caller is a BED file generated by clusterBed from the BEDtools suite (Quinlan & Hall, 2010) with the -s option used (see Supplementary material: Input file formats).If multiple samples are to be analyzed simultaneously, the name field must contain the sample name or ID before the first colon, followed by the read ID or other descriptive text.In addition, a sample list must be designated (with the YODEL parameter -sampleList) (see Supplementary material: Input file formats).The sample list will identify which samples are to be included for peak calling.After peak calling, read counts for each peak in all samples will be calculated.If no sample list is provided, all reads will be treated as one sample.As an example of how to process HITS-CLIP FASTQ files to generate the input clustered BED file, see Supplementary material: Analysis of HITS-CLIP data from Chi et al., 2009.Implementation YODEL was written in Python and tested with Python version 2.7.YODEL processes each read cluster as it is encountered within the input clustered BED file.For each cluster, the base coverage at each position for all samples under examination for peak calling is determined.Position specific counts for all individual samples are calculated as well.Once the counts are tallied, the program iteratively identifies peaks until no additional peaks are found.The program identifies the position with the highest read count.If the read count is less than the minimum peak height (mph) than no further peak calling for the cluster is performed.From the position with the highest read count, bases upstream are analyzed one at a time.The lowest read count (lowestPoint) up to the current base is tracked.Where dipTolerance (dt) and peakDipBuffer (pdb) are input parameters, if at any position the count is 0 or the count >= (lowestPoint+ peakDipBuffer)* dipTolerance, then the peak start has been determined and is recorded as the base position where the count was 0 or the base position of lowestPoint.This is repeated for bases downstream of the highest point to find the peak end.The peak summit is defined as the median of all the positions with the highest count.Two sets of peak boundaries (25% and 50%) are defined as the positions where the coverage is 25% and 50% of the highest point, and the maximum peak heights per sample are determined.The numbers of reads per peak are calculated (at both the 25% and 50% range) by determining the number of reads that overlap the peak by at least the input parameter binSize (bs).Typically, we have used the 25% peak boundary for downstream calculations.The peak counts are determined on a per sample basis, and for all the samples used in peak calling combined.Once a peak has been determined, the base coverage for positions within the full peak is set to 0 so that no further peaks are called overlapping it.The peak finding process (starting with finding the position with highest read count) is repeated until no more peaks are identified.After peaks are identified by YODEL, several filtering steps can be applied to remove low quality peaks (see Supplementary material: Peak filtering, Supplementary Figure 2-Supplementary Figure 6).
Operation YODEL has been tested on both Windows and Linux running Python 2.7 with standard libraries.Some of the tools (e.g.Bedtools) used to generate input files are Linux or OS X specific.There is no minimum memory requirement for YODEL, but the size of any BED files sorted with the Linux sort command may be dependent on system memory.See 'Supplementary material: Analysis of HITS-CLIP data from Chi et al., 2009' for instructions on how to preprocess data and run YODEL.

Use case
The output of YODEL is described in the Table 1. Figure 1 shows a comparison of YODEL and Piranha output from HITS-CLIP  analysis of wild type and miR144/451 -/-fetal liver erythroblasts.
Results from two different YODEL parameter settings are shown in blue in the lower half of the figure.Cab39 (Figure 1A) (Godlewski et al., 2010) and Ywhaz (Figure 1B) (Yu et al., 2010) are two known miR-451a target mRNAs.For Cab39, the largest peak contains a miR-451a seed match that is not present in the knockout sample.Because Piranha bins the start of reads, the peak defined by Piranha may not actually include the seed match location, as is observed with the Cab39 seed match.The failure of a peak to cover a seed match may prevent a microRNA from being assigned to a peak and subsequently interfere with downstream analysis.Also, the peak calling of Piranha is greatly influenced by bin size.A bin size of 32 (default parameter) is not able to resolve many individual peaks.In Ywhaz mRNA (panel B) YODEL detects three peaks around the miR-451a seed match.One Piranha setting (a=0.98,b=16) identified the three individual peaks, but compromised detection of smaller peaks upstream of the predicted miR-451a binding site.Therefore for our miR144/451 -/-data set, YODEL was superior to Piranha in defining HITS-CLIP peaks.
We have also tested the YODEL software on a publically available HITS-CLIP data set.HITS-CLIP reads from mouse neocortex Argonaute immunoprecipitations (Chi et al. 2009) was retrieved from http://ago.rockefeller.edu/rawdata.php.The reads were pre-processed and run through YODEL, as described in the Supplementary material: Analysis of HITS-CLIP data from Chi et al., 2009.We examined the first four genes identified by Table 1 in Chi et al.
as potential target of microRNAs.Figure 2 shows a comparison of peak calling in the 3' UTR of the Plod3 gene.Again, it is seen that binning starts of reads will cause potential microRNA seeds to be missed.Figure 3 shows the 3' UTR of the Cd164 gene.See Supplementary Figure 1 for browser images of Ctdsp1 and Itgb1.

Conclusions
We have designed a new peak-caller, termed YODEL, for analysis of RNA-seq data generated by HITS-CLIP-type experiments.
Advantages of YODEL compared to Piranha, a program commonly used for the same purpose, include standardization of peak calls for comparative analysis of multiple samples, improved resolution of peak boundaries, and more consistent overlap between peak calls and microRNA seed matches.

License: GPLv3
Input BED files for peak calling with the miR-144/451 KO HITS-CLIP can be found in the Cab39_Ywhaz.allSamples.fullCollapsed.clusters.bedfile within the archived source code listed above.This file contains the clustered BED file used for YODEL input (only around Cab39 and Ywhaz).
Sample HITS-CLIP data from the 130kD band from mouse neocortex samples was downloaded from http://ago.rockefeller.edu/rawdata.php(130kD Brain A-E samples).
The pre-processing BASH pipeline used to generate a clustered BED files, as well as some accessory scripts, can be found in the Ch2009 directory within the archived source code link listed above.

Open Peer Review
Current Peer Review Status:

Neelanjan Mukherjee
Berlin Institute for Medical Systems Biology (BIMSB), Max Delbrück Center for Molecular Medicine in the Helmholtz Association, Berlin, Germany The authors present an alternative peak caller for HITS-CLIP data.While the idea is interesting and the examples are compelling, there is not sufficient analysis presented to determine the utility of YODEL.

Major:
The bench-marking is insufficient to evaluate the difference between YODEL and PIRANHA.The primary figures only have single examples.There needs to be a transcriptome-wide analysis to evaluate the performance.
The analysis should include some type of specificity/sensitivity analysis.It would be instructive to design "true" positives and "false" positive.Generally the "TRUE" positives could be thought of as miRNAs that are expressed in that system vs those that are not.Additionally, one can design 'decoy' seeds that are di-nucleotide shuffled seeds if the expressed miRNAs (that don't match the expressed miRNA seeds) and evaluate the number of counts relative to the actual expressed miRNA seed.
In the case of the KO, one could examine if the peaks called from WT data that contain seed matches to the KO miRNAs change in coverage (WT vs KO), particularly relative to compared to the peaks that contain seed matches to the expressed non-KO miRNAs.Comparing YODEL and PIRANHA in this analysis would be quite instructive. Minor: In the intro the authors describe three properties of YODEL.The first was: "Incorporate strand specificity (Piranha does this, but many other CHIP-SEQ peak callers do not)" I think this should be removed.Any peak finder for RNA interactions needs to be strand-specific.I do not know why CHIP-SEQ peak callers are even mentioned, unless the authors believe this do not know why CHIP-SEQ peak callers are even mentioned, unless the authors believe this could be beneficial for CHIP-seq data.If so, they would need to compare to common CHIP-seq peak finders, though that would be a distraction in my opinion.No competing interests were disclosed.

Competing Interests:
I have read this submission.I believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.
The authors developed a new tool called YODEL to identify peaks from Ago2 HITS-CLIP data using a novel approach based on the identification of the peak summit of reads cluster and estimation of the size of the peaks based on read coverage.The work is sound and interesting, however we have some concerns about the benchmarking of this new tool.Our major questions mainly refer to Bottini et al. (2017) .That should be cited.
Major suggestions/concerns: Benchmarking: The authors showed just for few selected targets that YODEL identifies peaks that include miRNAs seed matches, whereas Piranha did not.This should be shown at the genome-wide scale.
Inclusion of seed match sequences do not assure a better performance.In fact, seed per se matches can be included just by chance due to an overestimation of the peaks size.To rule out the possibility, the authors should show the distribution of the peaks length by the two programs on both entire datasets and calculate the correlation between peak length and number of seed matches.
For Ago2 CLIP-seq peak calling programs it is expected that miRNA-binding sites and cross-linked-dependent mutations position at the peak centers.How does YODEL perform compare to Piranha?
The authors defined two thresholds to assess the peak boundaries, namely 25% and 50% of the coverage of the highest point: how these two threshold have been assessed?Since the peaks boundaries is a primary concern for the authors, it should be explained and supported by analysis how they chose these two percentages.Therefore, we recommend to benchmark the thresholds by adding intermediate percentages and calculate the sensitivity toward seed match identified at the genome-wide scale.

Other major points:
The introduction needs improvement: a brief overview of the software/pipelines available to perform CLIP-seq data analysis and cite some reviews that explain all the steps of the data analysis, including Bottini et al. (2017) and Uhl et al. (2017) .
It should be clearly stated whether YODEL is able to find peaks enriched when comparing two conditions (differential CLIP) and/or only one condition.
In the Methods section all the parameters should be clearly stated and explained in the main text and not in the Supplementary Information.Furthermore, it should be added which kind of input data (not only the file format) are needed to run YODEL (replicates, IgG, control/KO …).
Finally, it should be made clear whether YODEL can be applied to analyze only Ago2 HITS-CLIP data or also other RNA-binding proteins and why.
Minor points: Supplemental figure 2 is missing.
About the first sentence on page 2 "A peak caller that could accurately define…." is odd.A peak caller tool is always needed to identify peaks from CLIP-seq experiments, and not just for the specific case mentioned by the authors.
The sentence on page 2 "Where dipTolerance (dt) and peakDipBuffer (pbd) are input

Figure 1 .
Figure 1.YODEL and Piranha peak calling comparisons in the Cab39 and Ywhaz genes.IGV browser (Thorvaldsdóttir et al., 2013) images showing YODEL output for HITS-CLIP data analyzing the 3' untranslated regions of Cab39 (A) and Ywhaz (B) mRNAs in wildtype and miR-451a/miR-144 -/-mouse fetal livers at embryonic day 14.5.The coverage tracks (WT in blue, KO in red, and combined coverage in magenta) show combined sequencing reads from three animals of each genotype mapped to the mouse mm10 genome.The seedMatches track shows microRNA seed matches for miR-451a, miR-144-3p and the three most abundant erythroid microRNAs besides miR-451a (miR-16-5p, miR-486a-5p and miR-122-5p) (Paralkar VR, Palmer LE, Xu P, Lechauve C, Zhao G, Yao Y, Luan J, Wu G, Vourekas A, Mourelatos Z, Scheutz JD and Weiss MJ; unpublished results).Average peak counts per sample are shown for wild-type and miR-451a/ miR-144 -/-erythroblasts using the YODEL more sensitive parameters (see below).The lower panels show YODEL (blue) and Piranha (green) peak boundaries with the indicated parameter settings.For YODEL, full boundaries are shown by thin lines and 25% boundaries by thick lines.YODEL less sensitive parameter settings: dipTolerance=2, peakDipBuffer=2; more sensitive settings: dipTolerance=1.5,peakDipBuffer=1; Both parameter settings: binSize=16, minPeakHeight=5.Note that the ability of Piranha to resolve different peaks representing distinct microRNA binding sites is highly dependent on parameter settings.Piranha parameters: a=background, threshold b=binSize.The default for Piranha is a=0.99 and b=32.

Figure 3 .
Figure 3. YODEL and Piranha peak calling comparisons in the Cd164 gene.IGV browser image of a portion of the Cd164 3' UTR.Mouse neocortex HITS-CLIP read coverage is shown along with peak calls from YODEL (lower panel blue bars) and Piranha (lower panel green bars).Relevant microRNA seeds (For seeds mapped see Supplementary material: Seeds mapped for brain neocortex data) are shown in black.For YODEL, full boundaries are shown by thin lines and 25% boundaries by thick lines.YODEL parameters: dipTolerance=1.5,peakDipBuffer=1, binSize=16, minPeakHeight=5.Piranha parameters: a=background, threshold b=binSize.The default for Piranha is a=0.99 and b=32.

Figure 2 .
Figure 2. YODEL and Piranha peak calling comparisons in the Plod3 gene.IGV browser image of a portion of the Plod3 3' UTR.Mouse neocortex HITS-CLIP read coverage is shown along with peak calls from YODEL (lower panel blue bars) and Piranha (lower panel green bars).Relevant microRNA seeds (for seeds mapped see Supplementary material: Seeds mapped for brain neocortex data) are shown in black.For YODEL, full boundaries are shown by thin lines and 25% boundaries by thick lines.YODEL parameters: dipTolerance=1.5,peakDipBuffer=1, binSize=16, minPeakHeight=5.Piranha parameters: a=background, threshold b=binSize.The default for Piranha is a=0.99 and b=32.
/doi.org/10.5256/f1000research.12817.r24815© 2017 Mukherjee N.This is an open access peer review report distributed under the terms of the Creative Commons , which permits unrestricted use, distribution, and reproduction in any medium, provided the original Attribution Licence work is properly cited.
the rationale for developing the new software tool clearly explained?Partly Is the description of the software tool technically sound?Yes Are sufficient details of the code, methods and analysis (if applicable) provided to allow replication of the software development and its use by others?Yes Is sufficient information provided to allow interpretation of the expected output datasets and any results generated using the tool?No Are the conclusions about the tool and its performance adequately supported by the findings presented in the article?No

Table 1 . Descriptions of output files.
prefix.binCountsRC25.txtRead counts per peak defined by 25% peak prefix.binCountsRC50.txtRead counts per peak defined by 50% peak prefix.binCountsPH.txtRead counts covering position of maximum peak height for each individual sample