A cloud-based learning environment for comparing RNA-seq aligners

The rapid rise of high-throughput, data intensive experimental techniques has thrust many biologists into the role of data analyst – a role many biologists feel ill equipped to fill. Novices often struggle to find the resources and expertise they need to analyze their experimental results in a wet-lab environment. To fill this need, we developed an educational resource as part of a National Center for Biotechnology Information (NCBI) hackathon. Using RNA-seq as a model, our tutorial guides new users through the steps of data analysis, while placing an emphasis on understanding the motivation behind choices made in the process. To advance the goal of providing a deeper understanding of the analysis process, we developed a new tool, bamDiff. bamDiff allows users to compare the performance of multiple RNA-seq aligners, allowing users to select the most appropriate aligner for the data in question and experimental end-goal. Our tutorial is accessible via a GitHub wiki, with associated data and software provided on an Amazon Machine Image (AMI), which can be completed at no cost to the user through the Amazon Educate Program. Following the hackathon, our tutorial was integrated into the October 2015 offering of NCBI NOW (Next Generation Sequencing (NGS) Online Workshop) a free online experience targeting individuals new to NGS analysis.

With the rise of RNA-seq for exploring biological hypotheses has come an increase in the number of algorithms for aligning RNA-sequences to the genome. The burden of selecting and properly using these algorithms often falls on biologists. However, many biologists do not have training or experience in the skills needed to select the proper tools. To provide an introduction for this audience, the Educational Environment Team sought to develop an interactive learning environment where students with a novice's background in Unix could follow a series of alignment pipelines step-by-step.
The team identified two major goals that could be undertaken within the scope of the hackathon. The first was to produce a user-friendly tutorial that would walk novice computational biologists through the process of aligning RNA-seq data to the genome. The second was to supplement the tutorial with novel methods to compare the results of different RNA-alignment mappers.
Implementing this tutorial on an Amazon Machine Image (AMI) -a virtual server in the cloud, pre-programmed with all the necessary packages --allows students to initially bypass the intimidating task of installing software and dependencies, and immediately start performing alignments using a panel of different algorithms. All tasks in the tutorial and the AMI are designed to fall within the $35 per student provided for free through the Amazon Educate Program (http://aws.amazon.com/education/awseducate). By allowing students to run these alignments in succession, we hope to naturally showcase how these aligners vary in their outputs. Here we introduce both a read-based and position-based approach for identifying and evaluating regions of differential performance across the genome.

Team composition
The team was composed of five members ranging in experience from graduate student to research faculty member. Each member of the team shared the experience of interfacing with noncomputational biologists during their daily work and sought to provide a welcoming introduction to mapping tools. All members contributed to the goals of the team, with each selecting tasks to tackle according to their strengths.

Materials and methods
As part of the tutorial, two team members dedicated their time to developing novel methods for comparing the results of RNA-alignment. After each of the town-hall style meetings conducted with the larger hackathon group, we received recommendations from members of other teams for software packages to help streamline the proposed analysis.

Results
The tutorial was constructed as a GitHub Pages wiki --each new wiki page represents a new step in the workflow 1 . We guide students through registration with the Amazon Educate Program, obtainment of data, alignment, and comparison of the performance of four different aligners. The comparison of aligners is achieved with our own custom written program, bamDiff, in concert with the R (v3.2.1) packages edgeR (v3.10) and csaw (v1.2.1).
The example dataset is sample NA12878 from the Genome in a Bottle Consortium, an extensively curated human-standard 2 . To give a flavor for the alignment workflow without burdening the learner with expensive computational requirements, we limited the genome for alignment to chromosome 20 of the latest human genome reference from RefSeq, GRCh38 (GCF_000001405.30). For the tutorial, we describe the use of four popular aligners: BWA v0.7.12-r1039 3 , HISAT v0.1.6-beta 4 , STAR v2.4.0j 5 , and BLASTmapper (in preparation). For each, we provide the commands needed to construct an index and align the given data, outlining the expected screen output, files to be created, and time required for each task.
The R package csaw provides an elegant framework for identifying regions of differential expression between RNA-seq experiments, and in this case was extended to identifying regions of differential alignment between mappers 6 . To implement read counting, we used csaw to bin reads into 1KB windows, then filtered windows by count size 7 . Only bins with greater than 10 reads were kept. Windows were then filtered by exons, and only windows/bins containing exons were kept for further analysis (see Figure 1c). edgeR was used to identify windows/regions/bins with significant differences between the four mappers (see Figure 1d) 8 .
In order to systematically compare the performance of the various aligners, we wrote a new program called bamDiff. This Python script takes the output from csaw and edgeR as a CSV file as well as the outputs from each aligner in BAM format. bamDiff will report simple summaries for each BAM file, such as the total number of reads, overall number of alignments, proportion of reported reads that were unmapped by the aligner, as well as proportion of aligned reads that were mapped only once (uniquely).
The real strength of bamDiff comes in its ability to go beyond simple summarization to direct comparisons between BAM files. Internally bamDiff uses SAMtools 9 to rapidly extract only the reads mapping to a region of interest identified by csaw from one BAM file. These reads are then checked against the other BAM files to see whether they are mapped at all, and if so, whether they are mapped to the same region in the genome as in the first BAM file (see Figure 1e). If they map to a conflicting region outside the region of interest, bamDiff will report the top ten regions reads are mapping to, by agglomerating reads mapping within 1kb of each other. These results are reported as text tables. Example usage and output can be viewed at the associated page in our tutorial: https://github.com/NCBI-Hackathons/RNA_mapping/wiki/7.-Compare-alignments-with-bamDiff.
Following the hackathon, our tutorial was integrated into NCBI NOW (National Center for Biotechnology Information Next Generation Sequencing (NGS) Online Workshop) a free online experience targeting individuals new to NGS analysis. The first offering of the workshop occurred October 13-23, 2015. Our tutorial was offered as a "take-home" exercise following the sixth lecture, which focused on the analysis of RNA-seq data.

Conclusions and next steps
We achieved two distinct goals during the hackathon: first, the development of a novel method for comparing the results of RNAalignment and second, the creation of a tutorial to guide users through not only undertaking, but also understanding, RNA-seq alignment. Notably, we do not over-simplify the work of data analysis by simply selecting a single aligner for use in our tutorial. Instead, by encouraging the comparison of many different aligners we draw attention to the decision-making process intrinsic to any bioinformatic work. For our user, this is the selection of an RNAseq aligner from many frustratingly similar options.
However, both the novel software bamDiff and the tutorial can be further improved.
Currently, bamDiff is best suited to analyzing a small region of the genome/BAM file, rather than undertaking the entire genome/BAM file. Additionally, a graphical summary would make outputs easier to understand. In the tutorial we assume prior knowledge of the contents, purpose, and format of FASTQ, SAM, and BAM files as we assumed that our tutorial would be used in conjunction with a class or workshop that covers these introductory topics, such as NCBI NOW. The tutorial could be expanded in a number of ways, such as generalizing to pair-end reads or alignments guided by an annotation. Finally, there may be room for improvements in usability, such as streamlining use on platforms such as Google Cloud, Microsoft Azure, or iPlant/CyVerse.

Author contributions
All of the authors participated in designing the study, carrying out the research, and preparing the manuscript. All authors were involved in the revision of the draft manuscript and have agreed to the final content.

Competing interests
No competing interests were disclosed.

Grant information
The work on this project by Ben Busby was supported by the Intramural Research Program of the National Institutes of Health (NIH), National Library of Medicine (NLM), and NCBI. Elizabeth Baskin was supported by the Intramural Research Program of the National Institute of Arthritis and Musculoskeletal and Skin Diseases (Z01-AR041198). Peter DeFord was supported by NIH Training Grant GM007231.
The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. 1.

2.
3. The manuscript summarizes two achievements from NCBI's hackathon of 2015. First is a tutorial to introduce RNAseq mapping. Second is bamDiff, a program for comparing different RNA-seq aligner mapping. This reviewer believes the current state of the tutorial is on the thin side and could benefit from additional expansion. The functionality of bamDiff program is intriguing but based on the current state of the program the reviewer feels the program should be expanded to incorporate other QC metrics. Overall, this work makes great stride for guiding biologist to their first hands-on-experience on NGS.

Major points:
The tutorial provides a step by step tutorial from downloading to mapping and some mapping evaluation. The tutorial can be useful to users that find difficulty working on a Unix environment. In the current form, the author introduces basic commands for performing mapping; however, the author should caution and educate the reader that additional vetting of the raw RNAseq mapping is necessary. While mapping evaluation is important, that is just one of the many QC metric necessary in RNAseq data that contribute to the decision making. Here is an incomplete list of RNAseq related issues that should be included in the tutorial: Whether the RNAseq sample require trimming of low quality or adaptor sequences. The different parameter tuning that increases the coverage of these hard to map region (i.e.

STAR 2-pass).
At what sequencing depth is the RNAseq sample is deep enough for expression analysis, gene fusion detection, splicing detection, and whether additional sequencing is necessary.
(The reviewer recognize that the above comments might not be suitable within the tutorial, but the author should make some attempt to inform the user of these caveats.) Regarding the bam comparison program, the author might want to automatically include output of coverage Bed files that can be displayed in UCSC genome browser or IGV. In the tutorial, the author included some examples of coverage differences; however, the discussions from the tutorial appear to be incomplete. The author failed to discuss the reason that contributes to these differences in coverage such as the parameters or algorithm design. Another discussion point differences in coverage such as the parameters or algorithm design. Another discussion point should be to examine where the same read is being mapped to in different program? Here I present a couple factors that might impact the mapping the author might want to consider: GC content Highly repetitive region Paralogous gene bodies such as some ribosome genes and mitochondrial genes or histone genes might have varying coverage in different programs.
Within the tutorial, the author mentions "is every mapper allowing reads to map to the intronic region? After all, this is an RNA-seq experiment --there should be minimal intronic genetic material." While the statement is largely true; however, this reviewer believe the author should also mention that there are different type of RNA-seq library specifically TruSeq Stranded Total RNA prep would include contain intronic reads. Poly-A enriched RNAseq experiment could also contain intronic reads for intron retention events.
Since the author appear to discuss splicing region in the tutorial, a more detailed analysis on how different programs deals with the splicing region could be of tremendous interest to certain readers. The author might also want to consider the impact of RNA-seq library protocols on the variation of splice site mapping.
A number of tools have been developed for assessing RNAseq alignment quality. A review tools compared to bamDiff might make a stronger case for the novelty of the program.
A comprehensive table in the tutorial or within the manuscript summarizing all the advantages and disadvantages of different mapping programs will definitely enhance the manuscript.
Minor points: Perhaps include other popular aligner such as Tophat as an option.
While it is great that the pipeline being on the amazon cloud allows users to bypss installation. Perhaps the author could include installation procedure or reference to individual software if the user wants to install the program on their private server.
Make sure all program provide sufficient commenting. Most of the shell script lack sufficient commenting for what the program will do and why the author chose those parameters.
Sentence that might benefit from rephrasing "With the rise of RNA-seq for exploring biological hypotheses has come an increase in the number of algorithms for aligning RNA-sequences to the genome." Particularly the choice of "has come" is a bit awkward of a wording.
Another sentence that might benefit from rephrasing "If they map to a conflicting region outside the region of interest, bamDiff will report the top ten regions reads are mapping to, by agglomerating reads mapping within 1kb of each other." For certain details on RNAseq analysis, the author want to refer the reader to the following paper for additional details on RNAseq analysis. "https://genomebiology.biomedcentral.com/articles/10.1186/s13059-016-0881-8" Python script that produces a text summary of a few metrics extracted from RNA-seq BAMs from multiple aligners. Similar to the previous point, additional development would be needed before the results of this tool would be readily useful to most prospective users.
Considerable resources/ tools for performing quality assessment of BAM files (including RNA-seq alignments) already exist. The authors could provide an overview of these, either in the paper or as an additional section in the tutorial Wiki.
In the tutorial. The section for each aligner considered (BWA, HISAT, STAR, and blastmapper) should provide a basic description of the aligner, references, link to the aligner documentation, etc.
The alignment comparisons focus on the number of reads aligned, and how aligners differ in the alignment of particular reads, or reads aligning to particular regions. What other ways might the aligners be different? For example, in their ability to correctly map RNA-seq reads across exon-exon junctions, align reads containing single base sequencing errors or polymorphisms, correctly handle reads containing small insertions or deletions relative to the reference genome, etc.
Minor points: Perhaps the abstract should include a URL for the tutorial mentioned in the title.
The authors have created an AMI to "allow students to initially bypass the intimidating task of installing software and dependencies". This is reasonable, but perhaps the installation task could be provided (with detailed instructions) as an optional exercise.
On a related note, it would be ideal to have detailed documentation on how the AMI (ami-3590de50) was configured (including all dependencies that were installed).
In addition, this tutorial could include a "resources/pre-reading" section that referred the reader to additional helpful materials on RNA-seq sequencing and analysis principles (in addition to the hands on pre-requisites already listed in section 1).
More details on the example RNA-seq data set used in the hands on exercises would be helpful.
Are there similar efforts for comparison of DNA aligners that could be referenced by this tutorial?
Other RNA-seq educational pieces that cover many topics relevant to new NGS users (with less focus on aligner comparison specifically) could be cited by this paper (e.g. Griffith M , et al. www.rnaseq.wiki).
I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.
The benefits of publishing with F1000Research: Your article is published within days, with no editorial bias You can publish traditional articles, null/negative results, case reports, data notes and more The peer review process is transparent and collaborative Your article is indexed in PubMed after passing peer review Dedicated customer support at every stage For pre-submission enquiries, contact research@f1000.com