recount workflow: Accessing over 70,000 human RNA-seq samples with Bioconductor

The recount2 resource is composed of over 70,000 uniformly processed human RNA-seq samples spanning TCGA and SRA, including GTEx. The processed data can be accessed via the recount2 website and the recountBioconductor package. This workflow explains in detail how to use the recountpackage and how to integrate it with other Bioconductor packages for several analyses that can be carried out with the recount2 resource. In particular, we describe how the coverage count matrices were computed in recount2 as well as different ways of obtaining public metadata, which can facilitate downstream analyses. Step-by-step directions show how to do a gene-level differential expression analysis, visualize base-level genome coverage data, and perform an analyses at multiple feature levels. This workflow thus provides further information to understand the data in recount2 and a compendium of R code to use the data.


Introduction
RNA sequencing (RNA-seq) is now the most widely used high-throughput assay for measuring gene expression. In a typical RNA-seq experiment, several million reads are sequenced per sample. The reads are often aligned to the reference genome using a splice-aware aligner to identify where reads originated. Resulting alignment files are then used to compute count matrices for several analyses such as identifying differentially expressed genes. The Bioconductor project 1 has many contributed packages that specialize in analyzing this type of data and previous workflows have explained how to use them 2-4 . Initial steps are typically focused on generating the count matrices. Some pre-computed matrices have been made available via the ReCount project 5 or Bioconductor Experiment data packages such as the airway dataset 6 . The pre-computed count matrices in ReCount have been useful to RNA-seq methods developers and to researchers seeking to avoid the computationally intensive process of creating these matrices. In the years since ReCount was published, hundreds of new RNA-seq projects have been carried out, and researchers have shared the data publicly.
We recently uniformly processed over 70,000 publicly available human RNA-seq samples, and made the data available via the recount2 resource at jhubiostatistics.shinyapps.io/recount/ 7 . Samples in recount2 are grouped by project (over 2,000) originating from the Sequence Read Archive, the Genotype-Tissue Expression study (GTEx) and the Cancer Genome Atlas (TCGA). The processed data can be accessed via the recount Bioconductor package available at bioconductor.org/packages/recount. Together, recount2 and the recount Bioconductor package should be considered a successor to ReCount.
Due to space constraints, the recount2 publication 7 did not cover how to use the recount package and other useful information for carrying out analyses with recount2 data. We describe how the count matrices in recount2 were generated. We also review the R code necessary for using the recount2 data, whose details are important because some of this code involves multiple Bioconductor packages and changing default options. We further show: a) how to augment metadata that comes with datasets with metadata learned from natural language processing of associated papers as well as expression data b) how to perform differential expression analyses, and c) how to visualize the base-pair data available from recount2.

Analysis of RNA-seq data available at recount2 recount2 overview
The recount2 resource provides expression data summarized at different feature levels to enable novel cross-study analyses. Generally when investigators use the term expression, they think about gene expression. But more information can be extracted from RNA-seq data. Once RNA-seq reads have been aligned to the reference genome it is possible to determine the number of aligned reads overlapping each base-pair resulting in the genome base-pair coverage curve as shown in Figure 1. In the example shown in Figure 1, most of the reads overlap known exons from a gene. Those reads can be used to compute a count matrix at the exon or gene feature levels. Some reads span exon-exon junctions (jx) and while most match the annotation, some do not (jx 3 and 4). An exon-exon junction count matrix can be used to identify differentially expressed junctions, which can show which isoforms are differentially expressed given sufficient coverage. For example, junctions 2 and 5 are unique to isoform 2, while junction 6 is unique to isoform 1. The genome base-pair coverage data can be used with derfinder 8 to identify expressed regions; some of these could be unannotated exons, which together with the exon-exon junction data could help establish new isoforms. recount2 provides gene, exon, and exon-exon junction count matrices both in text format and RangedSummarizedExperiment objects (rse) 9 as shown in Figure 2. These rse objects provide information about the expression features (for example gene IDs) and the samples. In this workflow we will explain how to add metadata to the rse objects in recount2 in order to ask biological questions. recount2 also provides coverage data in the form of bigWig files. All four features can be accessed with the recount Bioconductor package 7 . recount also allows sending queries to snaptron 10 to search for specific exon-exon junctions. Figure 1. Overview of the data available in recount2. Reads (pink boxes) aligned to the reference genome can be used to compute a base-pair coverage curve and identify exon-exon junctions (split reads). Gene and exon count matrices are generated using annotation information providing the gene (green boxes) and exon (blue boxes) coordinates together with the base-level coverage curve. The reads spanning exon-exon junctions (jx) are used to compute a third count matrix that might include unannotated junctions (jx 3 and 4). Without using annotation information, expressed regions (orange box) can be determined from the base-level coverage curve to then construct data-driven count matrices.

Packages used in the workflow
In this workflow we will use several Bioconductor packages. To reproduce the entirety of this workflow, install the packages using the following code after installing R 3.4.x from CRAN in order to use Bioconductor version 3.5 or newer.

Coverage counts provided by recount2
The most accessible features are the gene, exon and exon-exon junction count matrices. This section explains them in greater detail. Figure 3 shows 16 RNA-seq reads, each 3 base-pairs long, and a reference genome.
Reads in the recount2 resource were aligned with the splice-aware Rail-RNA aligner 11 . Figure 4 shows the reads aligned to the reference genome. Some of the reads are split as they span an exon-exon junction. Two of the reads were soft clipped meaning that just a portion of the reads aligned (top left in purple).
In order to compute the gene and exon count matrices we first have to process the annotation, which for recount2 is Gencode v25 (CHR regions) with hg38 coordinates. Although recount can generate count matrices for other annotations using hg38 coordinates. Figure 5 shows two isoforms for a gene composed of 3 different exons.
The coverage curve is at base-pair resolution so if we are interested in gene counts we have to be careful not to double count base-pairs 1 through 5 that are shared by exons 1 and 3 ( Figure 5). Using the function disjoin() from GenomicRanges 12 we identified the distinct exonic sequences (disjoint exons). The following code defines the exon coordinates that match Figure 5 and the resulting disjoint exons for our example gene. The resulting disjoint exons are shown in Figure 6.  Spice-aware RNA-seq aligners such as Rail-RNA are able to find the coordinates to which the reads map, even if they span exon-exon junctions (connected boxes). Rail-RNA soft clips some reads (purple boxes with rough edges) such that a portion of these reads align to the reference genome.  library("GenomicRanges") exons <-GRanges("seq", IRanges(start = c(1, 1, 13), end = c(5, 8, 15))) exons Now that we have disjoint exons, we can compute the base-pair coverage for each of them as shown in Figure 7. That is, for each base-pair that corresponds to exonic sequence, we compute the number of reads overlapping that given basepair. For example, the first base-pair is covered by 3 different reads and it does not matter whether the reads themselves were soft clipped. Not all reads or bases of a read contribute information to this step, as some do not overlap known exonic sequence (light pink in Figure 7).
With base-pair coverage for the exonic sequences computed, the coverage count for each distinct exon is simply the sum of the base-pair coverage for each base in a given distinct exon. For example, the coverage count for disjoint exon 2 is 2 + 2 + 3 = 7 as shown in Figure 8. The gene coverage count is then coverage n i i ∑ coverage i where n is the number of exonic base-pairs for the gene and is equal to the sum of the coverage counts for its disjoint exons as shown in Figure 8. . Base-pair coverage counting for exonic base-pairs. At each exonic base-pair we compute the number of reads overlapping that given base-pair. The first base (orange arrow) has 3 reads overlapping that base-pair. Base-pair 11 has a coverage of 3 but does not overlap known exonic sequence, so that information is not used for the gene and exon count matrices (grey arrow). If a read partially overlaps exonic sequence, only the portion that overlaps is used in the computation (see right most read). For the exons, recount2 provides the disjoint exons coverage count matrix. It is possible to reconstruct the exon coverage count matrix by summing the coverage count for the disjoint exons that compose each exon. For example, the coverage count for exon 1 would be the sum of the coverage counts for disjoint exons 1 and 2, that is 19 + 7 = 26. Some methods might assume that double counting of the shared base-pairs was performed while others assume or recommend the opposite.
## Check that the number of reads is less than or equal to 40 million ## after scaling. library("recount") rse_scaled <-scale_counts(rse_gene_SRP009615, round = FALSE) summary(colSums(assays(rse_scaled)$counts)) / 1e6 Enriching the annotation Data in recount2 can be used for annotation-agnostic analyses and enriching the known annotation. Just like exon and gene coverage count matrices, recount2 provides exon-exon junction count matrices. These matrices can be used to identify new isoforms ( Figure 10) or identify differentially expressed isoforms. For example, exon-exon junctions 2, 5 and 6 in Figure 1 are only present in one annotated isoform. Snaptron 10 allows programatic and high-level queries of the exon-exon junction information and its graphical user interface is specially useful for visualizing this data. Inside R, the recount function snaptron_query() can be used for searching specific exon-exon junctions in recount2.
The base-pair coverage data from recount2 can be used together with derfinder 8 to identify expressed regions of the genome, providing another annotation-agnostic analysis of the expression data. Using the function expressed_regions() we can identify regions of expression based on a given data set in recount2. These regions might overlap known exons but can also provide information about intron retention events ( Figure 11), improve detection of exon boundaries (Figure 12), and help identify new exons ( Figure 1) or expressed sequences in intergenic regions. Using coverage_matrix() we can compute a coverage matrix based on the expressed regions or   another set of genomic intervals. The resulting matrix can then be used for a DE analysis, just like the exon, gene and exon-exon junction matrices.

Gene level analysis
Having reviewed how the coverage counts in recount2 were produced, we can now do a DE analysis. We will use data from 72 individuals spanning the human lifespan, split into 6 age groups with SRA accession SRP045638 13 . The function download_study() requires a SRA accession which can be found using abstract_search(). download_study() can then be used to download the gene coverage count data as well as other expression features. The files are saved in a directory named after the SRA accession, in this case SRP045638. ## Download the data if it is not there if(!file.exists(file.path("SRP045638", "rse_gene.Rdata"))) { download_study("SRP045638", type = "rse-gene") } ## 2017-07-30 10:11:16 downloading file rse_gene.Rdata to SRP045638 ## Check that the file was downloaded file.exists(file.path("SRP045638", "rse_gene.Rdata"))

## [1] TRUE
## Load the data load(file.path("SRP045638", "rse_gene.Rdata")) The coverage count matrices are provided as RangedSummarizedExperiment objects (rse) 9 . These objects store information at the feature level, the samples and the actual count matrix as shown in Figure 1 of Love et al., 2016 3 . Figure 2 shows the actual rse objects provided by recount2 and how to access the different portions of the data. Using a unique sample ID such as the SRA Run ID it is possible to expand the sample metadata. This can be done using the predicted phenotype provided by add_predictions() 14 , pulling information from GEO via find_geo() and geo_characteristics(), or with custom code.

Metadata
Using the colData() function we can access sample metadata. More information on these metadata is provided in the Supplementary material of the recount2 paper 7 , and we provide a brief review here. The rse objects for SRA data sets include 21 columns with mostly technical information. The GTEx and TCGA rse objects include additional metadata as available from the raw sources. In particular, we compiled metadata for GTEx using the v6 phenotype information available at gtexportal.org, and we put together a large table of TCGA case and sample information by combining information accumulated across Seven Bridges' Cancer Genomics Cloud and TCGAbiolinks 15 .
## SHARQ tissue predictions: not present for all studies head(colData(rse_gene)$sharq_beta_tissue) ## [1] "blood" "blood" "blood" "blood" "blood" "blood" For some data sets we were able to find the GEO accession IDs, which we then used to create the title and characteristics variables. If present, the characteristics information can be used to create additional metadata variables by parsing the CharacterList in which it is stored. Since the input is free text, sometimes more than one type of wording is used to describe the same information, meaning that we might have to process that information in order to build a more convenient variable, such as a factor vector.

## Scale counts rse_gene_scaled <-scale_counts(rse_gene)
## To highlight that we scaled the counts rm(rse_gene) Having scaled the counts, we then filter out genes that are lowly expressed and extract the count matrix.

DE analysis
Now that we have scaled the counts, there are multiple DE packages we could use, as described elsewhere 2,3 . Since we have 12 samples per group, which is a moderate number, we will use limma-voom 16 due to its speed. The model we use tests for DE between prenatal and postnatal samples adjusting for sex and RIN, which is a measure of quality of the input sample. We check the data with multi-dimensional scaling plots ( Figure 13 and Figure 14) as well as the meanvariance plot (Figure 15). In a real use case we might have to explore the results with different models and perform sensitivity analyses.    Having run the DE analysis, we can explore some of the top results either with an MA plot ( Figure 16) and a volcano plot Figure (

DE report
Now that we have the DE results, we can use some of the tools with the biocView ReportWriting to create a report. One of them is regionReport 17 , which can create reports from DESeq2 18 and edgeR 19 results. It can also handle limma-voom 16 results by making them look like DESeq2 results. To do so, we need to extract the relevant information from the limma-voom objects using topTable() and build DESeqDataSet and DESeqResults objects as shown below. A similar conversion is needed to use ideal 20 , which is another package in the ReportWriting biocView category. ## Some extra information used by the report function mcols(dds) <-limma_res mcols(mcols(dds)) <-DataFrame(type = "results", description = "manual incomplete conversion from limma-voom to DESeq2") Having converted our limma-voom results to DESeq2 results, we can now create the report, which should open automatically in a browser.

GO enrichment
Using clusterProfiler 21 we can then perform several enrichment analyses using the Ensembl gene IDs. Here we show how to perform an enrichment analysis using the biological process ontology ( Figure 18). Several other analyses can be performed with the resulting list of differentially expressed genes as described previously 2,3 , although that is beyond the scope of this workflow.

Other features
As described in Figure 1, recount2 provides data for expression features beyond genes. In this section we perform a DE analysis using exon data as well as the base-pair resolution information.

Exon and exon-exon junctions
The exon and exon-exon junction coverage count matrices are similar to the gene level one and can also be downloaded with download_study(). However, these coverage count matrices are much larger than the gene one. Aggressive filtering of lowly expressed exons or exon-exon junctions can reduce the matrix dimensions if this impacts the performance of the DE software used.
Below we repeat the gene level analysis for the disjoint exon data. We first download the exon data, add the expanded metadata we constructed for the gene analysis, explore the data (Figure 19), and then perform the DE analysis using limma-voom. ## Download the data if it is not there if(!file.exists(file.path("SRP045638", "rse_exon.Rdata"))) { download_study("SRP045638", type = "rse-exon") } ## 2017-07-30 10:37:11 downloading file rse_exon.Rdata to SRP045638 ## Load the data load(file.path("SRP045638", "rse_exon.Rdata")) ## Scale and add the metadata (it is in the same order) identical(colData(rse_exon)$run, colData(rse_gene_scaled)$run) Just like at the gene level, we see many exons differentially expressed between prenatal and postnatal samples ( Figure 20). As a first step to integrate the results from the two features, we can compare the list of genes that are differentially expressed versus the genes that have at least one exon differentially expressed. ## Make a venn diagram library("gplots") vinfo <-venn(list("genes" = genes_de, "exons" = genes_w_de_exon), names = c("genes", "exons"), show.plot = FALSE) plot(vinfo) + title("Genes/exons with DE signal") Not all differentially expressed genes have differentially expressed exons. Moreover, genes with at least one differentially expressed exon are not necessarily differentially expressed (Figure 21). This is in line with what was described in Figure 2B of Soneson et al., 2015 22 .
This was just a quick example of how we can perform DE analyses at the gene and exon feature levels. We envision that more involved pipelines could be developed that leverage both feature levels, such as in Jaffe et al., 2017 23 . For instance, we could focus on the differentially expressed genes with at least one differentially expressed exon, and compare the direction of the DE signal versus the gene level signal as shown in Figure 22.  ## Make the plot plot(top_gene_de$logFC, exon_max_fc, pch = 20, col = adjustcolor("black", 1/5), ylab = "Most extreme exon log FC", xlab = "Gene log FC", main = "DE genes with at least one DE exon") abline(a = 0, b = 1, col = "red") abline(h = 0, col = "grey80") abline(v = 0, col = "grey80") The fold change for most exons shown in Figure 22 agrees with the gene level fold change. However, some of them have opposite directions and could be interesting to study further.
Base-pair resolution recount2 provides bigWig coverage files (unscaled) for all samples, as well as a mean bigWig coverage file per project where each sample was scaled to 40 million 100 base-pair reads. The mean bigWig files are exactly what is needed to start an expressed regions analysis with derfinder 8 . recount provides two related functions: expressed_regions() which is used to define a set of regions based on the mean bigWig file for a given project, and coverage_matrix() which based on a set of regions builds a count coverage matrix in a RangedSummarizedExperiment object just like the ones that are provided for genes and exons. Both functions ultimately use import.bw() from rtracklayer 24 which currently is not supported on Windows machines. While this presents a portability disadvantage, on the other side it allows reading portions of bigWig files from the web without having to fully download them. download_study() with type = "mean" or type = "samples" can be used to download the bigWig files, which we recommend doing when working with them extensively. For illustrative purposes, we will use the data from chromosome 21 for the SRP045638 project. First, we obtain the expressed regions using a relatively high mean cutoff of 5. We then filter the regions to keep only the ones longer than 100 base-pairs to shorten the time needed for running coverage_matrix().
## Define expressed regions for study SRP045638, only for chromosome 21 regions <-expressed_regions("SRP045638", "chr21", cutoff = 5L, maxClusterGap = Now that we have a scaled count matrix for the expressed regions, we can proceed with the DE analysis just like we did at the gene and exon feature levels ( Figure 23, Figure 24, Figure 25, and Figure 26).     Next, we obtain the base-pair coverage data for each DER and scale the data to a library size of 40 million 100 base-pair reads, using the coverage AUC information we have in the metadata.
To create a TxDb object for Gencode v25, first we need to import the data. Since we are working only with chromosome 21 for this example, we can subset it. Next we need to add the relevant chromosome information. Some of the annotation functions we use can handle Entrez or Ensembl IDs, but not Gencode IDs. So we will make sure that we are working with Ensembl IDs before finally creating the Gencode v25 TxDb object.
## 'select()' returned 1:many mapping between keys and columns ## Annotate the regions of interest ## Note that we are using the original regions, not the resized ones nearest_ann <-matchGenes(regions_by_padj[1:10], ann_gencode_v25_hg38) The final piece we need to run plotRegionCoverage() is information about which base-pairs are exonic, intronic, etc. This is done via the annotateRegions() function in derfinder, which itself requires prior processing of the TxDb information by makeGenomicState().
## Create the genomic state object using the gencode TxDb object gs_gencode_v25_hg38 <-makeGenomicState(gencode_v25_hg38_txdb, chrs = seqlevels(regions)) ## 'select()' In these plots we can see that some DERs match known exons ( Figure 28, Figure 34, Figure 36), some are longer than known exons ( Figure 27, Figure 33, Figure 35), and others are exon fragments (Figure 29-Figure 32) which could be due to the cutoff used. Note that Figure 33 could be shorter than a known exon due to a coverage dip.

Summary
In this workflow we described in detail the available data in recount2, how the coverage count matrices were computed, the metadata included in recount2 and how to get new phenotypic information from other sources. We showed how to perform a DE analysis at the gene and exon levels as well as use an annotation-agnostic approach. Finally, we explained how to visualize the base-pair information for a given set of regions. This workflow constitutes a strong basis to leverage the recount2 data for human RNA-seq analyses. expressed exons." Surely this is the definition of a DE gene?? I absolutely agree that "Moreover, genes with at least one differentially expressed exon are not necessarily differentially expressed" -differential transcript usage is a prime example of where this can happen -but if a gene is DE I'm pretty sure that it must have a DE exon.
4. I don't see the need for figs 28-36 -a single example of the plot type should be sufficient I think for an example workflow.
5. It would be nice if recount2 could also provide information at the transcript level. Have the authors considered augmenting recount2 with salmon quantifications for all the data? (big job and more of a 'feature request' really). 1.

5.
RNA-seq experiments. The accompanying R/Bioconductor package 'recount' gives programmatic access to download read counts per gene and to estimate read counts for genomic regions of interest. In an RNA-seq pipeline, processing raw data into the formats available through recount2 involves the most time-consuming steps. Thus, recount2 will save many researchers a lot of time.
The workflow describes how to programmatically access data from recount2 and describes different analyses that can be done using these data. However, I think the authors needs improve and clarify some aspects of the workflow, which I summarize below.
Major comments: The authors use the formula in equation 1 to scale read counts. While I agree that the read counts will be approximately equal to the sum of the coverage divided by the read length, it was not clear why the additional rescaling is needed. I recommend that the authors include a more extensive justification. Also, if the experiments were paired-end, wouldn't this formula be counting reads instead of sequenced RNA fragments (i.e. double counting)?
The section "Enriching annotation" describes several functions and analyses but does not provide any code or examples. Currently, since it is incomplete, it is more distracting than informative. I suggest that the authors either expand this section and add code or drop it. I don't understand the biological question behind a differential expression analysis at the exon level. Could the authors clarify what the biological question is? If the aim is to find differential exon usage, wouldn't it be better to use either DEXSeq, DRIMseq, or similar packages that are specifically designed for this analysis?
Minor comments: The first three sentences of the introduction need references. The sentence "generally, when investigators use the term expression, they refer to gene expression" is not entirely true. For example, developmental or cell biologists often interpret "expression" as protein expression. For full reproducibility, it would be useful to download the data within R using the SRAdb package instead of downloading it manually. The code that creates the age groups is too complicated (4 embedded 'ifelse' statements). I have submitted a pull request with a simplified version of it ( ). https://github.com/LieberInstitute/recountWorkflow/pull/1 Figures 13 and 14 could be merged into a single plot, using shapes and colors to distinguish the different annotations. The same holds for figures 23 and 24.

Is the description of the method technically sound? Partly
Are sufficient details provided to allow replication of the method development and its use by others? Yes If any results are presented, are all the source data underlying the results available to ensure full If any results are presented, are all the source data underlying the results available to ensure full reproducibility? Yes Are the conclusions about the method and its performance adequately supported by the findings presented in the article? Yes No competing interests were disclosed. Competing Interests: I have read this submission. I believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above. 12  The authors present a workflow that describes how to analyze the datasets available through the Recount2 project with Bioconductor. Since many of the state-of-the-art methods for the analysis of RNA-seq data are implemented in R and available through the Bioconductor project, this contribution is an important resource for researchers interested in reanalyzing the impressive amount of data that the authors have processed in the Recount2 project.
I have a few comments that hopefully will help improve the workflow.
1. I was a bit confused by the rationale of the scaled coverage counts. And especially on the need for a target library size and the use of scaled counts. Wouldn't it be simpler to divide the coverage by read length (without rescaling)? Wouldn't that result in the actual reads mapped to each region (exon, gene, ...)? I understand that for the derfinder analysis, some rescaling is needed for normalization purposes, but for more "classic" analysis (such as gene-or exon-level differential expression) where the counts are normalized later in the workflow, wouldn't starting from 'coverage/readlength' be a more sensible choice? 2. Sex prediction. This is a really interesting part of the analysis, even though it's not the focus of this workflow. It would be interesting to get the authors' opinion of how to best use this feature on real analyses. For instance, are the 8 misclassified samples likely to be false positives from the classifiers or are they mislabeled samples? What is the recommendation of the authors in such cases? Should these samples be discarded or is there any diagnostics that can be run to make sure that the quality of these samples is not compromised?
3. I think that the authors should give more details on the design matrix. For instance, why did they decide to include RIN and sex? Why is it important to include these variables in the model? More generally, the workflow lacks details on the limma pipeline. I understand that this is not the focus of the authors' work, but it may be confusing for beginners that don't have a direct experience with limma or voom. The authors could for instance refer the reader to the limma workflow for details. 1 could for instance refer the reader to the limma workflow for details. 4. Similarly, there is a lack of details on the GO enrichment analysis. Since there are many types of gene-set enrichment analysis, a paragraph could be added with more details and perhaps some references to explain what enrichment analysis is and what types of hypotheses are tested. 5. One important advantage of exon-level differential expression is that it can be used to infer alternative splicing. This can be done with the functions 'diffSplice()' and 'topSplice()' in limma or with the DEXSeq package. It would be nice to showcase these functions or at least to mention that they exist.
I have read this submission. I believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.