restfulSE: A semantically rich interface for cloud-scale genomics with Bioconductor

Bioconductor's SummarizedExperiment class unites numerical assay quantifications with sample- and experiment-level metadata. SummarizedExperiment is the standard Bioconductor class for assays that produce matrix-like data, used by over 200 packages. We describe the restfulSE package, a deployment of this data model that supports remote storage. We illustrate use of SummarizedExperiment with remote HDF5 and Google BigQuery back ends, with two applications in cancer genomics. Our intent is to allow the use of familiar and semantically meaningful programmatic idioms to query genomic data, while abstracting the remote interface from end users and developers.

Analyses of multiomic archives like The Cancer Genome Atlas (TCGA) and single-cell transcriptomic experiments such as the 10x 1.3 million mouse neuron dataset typically begin with downloads of large files and conversion of file contents into formats based on local preferences. In this paper we consider how targeted queries of large remote genomic data resources can be conducted using methods available for Bioconductor's SummarizedExperiment class. For large data archives that have been centralized in cloud storage, use of this approach can help diminish effort required to manage local storage, and can facilitate interactive analysis of data subsets in familiar programming idioms, without downloading entire datasets. Clients for HDF5 or Google BigQuery are available in numerous languages; our Bioconductor interface permits access to remote archives of genomic data with familiar and semantically meaningful programmatic idioms, while abstracting the remote interface from end users and developers.

Methods: Data structures and remote back ends
The SummarizedExperiment class and related methods Let Q denote a matrix of quantifications arising from a genome scale assay with G assay features measured on N experimental samples. The elements of Q are the numbers q ij , i = 1, … , G, j = 1, …, N. Bioconductor's SummarizedExperiment structure manages feature quantifications with associated metadata about assay features and samples.
In the 10x mouse neuron dataset, G = 27998 and N = 1.3 million. Each of the G features is a gene, and it is useful to have handy a number of feature annotations like gene name, location, functional role; suppose each gene has F such features recorded. When these quantifications and associated annotations are managed in a Bioconductor SummarizedExperiment X, the matrix Q is programmatically bound to a G × F table of feature-level metadata accessible by the rowData method, and to an N × R table of sample-level metadata accessible by colData, where R denotes the number of sample-level metadata features recorded (Huber et al. 1 ). See Figure 1.
In the context of R programming, let K denote a vector of feature identifiers, S denote a vector of sample identifiers. The standard subsetting idiom X[K,S] expresses filtering of the all the information in Q and the associated metadata to features K and samples S. A GRanges instance (Lawrence et al. 2 ) defining genomic coordinates for features may be bound to X, facilitating queries defined by genomic location (using, for example, subsetByOverlaps) to isolate features coincident with or near the elements of a set of query genomic ranges (eg., binding peaks). This outline of genomic data representation and analysis is characteristic of Bioconductor.

Examples of remote back ends Google BigQuery. The Institute for Systems Biology Cancer Genomics Cloud project (ISB-CGC) (ISB 3 ) uses
Google BigQuery to provide access to various public cancer genomics resources including TCGA and the PanCancer Atlas (Hoadley et al. 4 ). The pancan_SE function of restfulSE constructs queries that derive SummarizedExperiment instances using quantifications and annotations for PanCancer atlas experiments managed in BigQuery tables.

HDF Scalable Data Service (HSDS).
An AWS S3-based distributed data object model for HDF5 datasets, including a RESTful API to structure, populate, and query HDF5 archives, has been implemented by the HDF Group. A number of datasets of interest in bioinformatics are served through HDF Kita Lab in the /shared/ bioconductor folder.

Lazy data retrieval via DelayedArray
The restfulSE package provides interfaces to BigQuery and HSDS so that the numerical content housed in these services satisfies the API of the Bioconductor DelayedArray (Pagès and Hickey 5 ). Any DelayedArray instance can serve as the assay component of a SummarizedExperiment instance. Thus the capacities of SummarizedExperiment to bind semantically rich metadata to genome-scale assays are extended implicitly to data resources for which no standards exist for associating substantive metadata.
In conjunction with the rhdf5client and bigrquery packages, restfulSE functions translate filtering and selection operations which are readily defined using rowData, rowRanges, colData into formal queries resolvable by the HDF5 and BigQuery services. Numerical results are transmitted from server to client only when needed.

Results
The RESTful SummarizedExperiment representation allows complicated research queries to be obtained in a concise, fast, convenient and robust fashion, as illustrated by the following examples.

Hybrid data/annotation strategy for integrative analysis
The following code chunk, which generates Figure 2, illustrates the use of the restfulSE protocol with the ISB-CGC BigQuery back end.

Performance
We focus on pursuit of reliability, expressivity, and scalability using restfulSE.

Reliability:
The restfulSE, rhdf5client and BiocOncoTK packages are accompanied by detailed unit tests that compare retrievals to known values. In the case of BigQuery table queries, the test suite composes random queries in both BigQuery SQL and in the SummarizedExperiment idiom. Results are checked for elementwise equality.
Expressivity: The code segments for Figure 2 and Figure 3 are complex but easy to break down. The joining and reshaping of pancan-atlas tables in BigQuery corresponding to the code in Figure 2 can be checked through the query history in the BigQuery interface. The acquisition of expression values employed five nested SELECT statements; the query for assay quantifications was 6000 characters in length. The R code is less than 500 characters including comments. Scalability. BigQuery is intrinsically auto-scaling, but charges accrue with the amount of data scanned, so query design can have effects on throughput and cost. We rely on the bigrquery (Wickham 10 ) and dbplyr (Wickham and Ruiz 11 ) packages for efficient translation of R-oriented data manipulations to BigQuery SQL. Throughput with the HDF Scalable Data Service is dependent upon the configuration of the object server, the relationship of numerical data layout to prevalent access patterns, and the degree to which queries capitalize on API efficiencies like chunk-based retrieval. For both back ends, proper design and deployment of the querying client can lead to throughput that scale with client-side resources.

Conclusions
Cloud-scale storage and retrieval strategies are of significant interest for genome science. The SummarizedExperiment class unifies assay data with substantive sample-and experiment-level metadata, and its API for managing and interrogating genome-scale experiment archives is used in numerous analytic packages. The restfulSE package exposes high-performance cloud-resident data stores to users and algorithms as SummarizedExperiments.
Continued improvements in efficiency of representation and query resolution for assay data and metadata will help to achieve the potential of a federated data ecosystem for enhanced discovery in biology through interactive genome-scale analysis. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.