ALL Metrics
-
Views
-
Downloads
Get PDF
Get XML
Cite
Export
Track
Software Tool Article

AnVILWorkflow: A runnable workflow package for Cloud-implemented bioinformatics analysis pipelines

[version 1; peer review: 2 approved]
PUBLISHED 21 Oct 2024
Author details Author details
OPEN PEER REVIEW
REVIEWER STATUS

This article is included in the Bioconductor gateway.

Abstract

Advancements in sequencing technologies and the development of new data collection methods produce large volumes of biological data. The Genomic Data Science Analysis, Visualization, and Informatics Lab-space (AnVIL) provides a cloud-based platform for democratizing access to large-scale genomics data and analysis tools. However, utilizing the full capabilities of AnVIL can be challenging for researchers without extensive bioinformatics expertise, especially for executing complex workflows. We present the AnVILWorkflow R package, which enables the convenient execution of bioinformatics workflows hosted on AnVIL directly from an R environment. AnVILWorkflow simplifies the setup of the cloud computing environment, input data formatting, workflow submission, and retrieval of results through intuitive functions. We demonstrate the utility of AnVILWorkflow for three use cases: bulk RNA-seq analysis with Salmon, metagenomics analysis with bioBakery, and digital pathology image processing with PathML. The key features of AnVILWorkflow include user-friendly browsing of available data and workflows, seamless integration of R and non-R tools within a reproducible analysis pipeline, and accessibility to scalable computing resources without direct management overhead. AnVILWorkflow lowers the barrier to utilizing AnVIL’s resources, especially for exploratory analyses or bulk processing with established workflows. This empowers a broader community of researchers to leverage the latest genomics tools and datasets using familiar R syntax. This package is distributed through the Bioconductor project (https://bioconductor.org/packages/AnVILWorkflow), and the source code is available through GitHub (https://github.com/shbrief/AnVILWorkflow).

Keywords

Cloud computing, Genomics, Workflows, R/Bioconductor, AnVIL

Introduction

The NHGRI’s Genomic Data Science Analysis, Visualization, and Informatics Lab-space (AnVIL) consortium was launched in 2018, aiming to democratize genomics data.1 AnVIL enables easy sharing of genomics data by organizing databases, bioinformatics pipelines for large-scale data processing, and interactive downstream analysis in one Cloud-based platform. AnVIL,2 also the name of the platform from the AnVIL project, implements the FAIR data-sharing philosophy and provides a graphical user interface (GUI, supported by Terra3), making it more accessible for researchers without programming backgrounds. However, a GUI tends to be less efficient and slower than a command line interface (CLI), especially for bulk analyses, still requires learning a new platform, and does not support version control and text-based workflows, often included as best practices for reproducible computational research.4

Bioconductor’s AnVIL package is an AnVIL API wrapper that provides R-friendly, programming-based functionalities to leverage flexible and scalable cloud-based resources implemented in the AnVIL platform. With the AnVIL package, users can easily access workflows, data, and Cloud-based computing resources managed by AnVIL. However, the AnVIL package is not customized for workflow execution tasks. Instead, AnVIL covers all the resources related to the AnVIL platform, such as interaction with the repository for Docker-based genomic analysis tools and workflows (Dockstore5), leveraging cloud resources (Leonardo6), and data search and digestion (Gen37). Many AnVIL functions also expose API commands directly, requiring a deep understanding of the underlying AnVIL workspace structures and data models to use for workflow execution. Also, it is a general package without individual support on any workspace and provides no metadata curation. Because most Bioconductor users focus on data analysis, a convenient R-friendly way of accessing and utilizing AnVIL resources is needed. Here, we present the AnVILWorkflow package to meet this need. AnVILWorkflow package is a convenient, fit-for-purpose wrapper around the AnVIL package with the following features optimized for workflow execution:

  • Intuitive function names and returned values

  • Support workflow-specific documentations

  • Enable to set up a Cloud environment with a single function call

  • Return error messages that are easy to interpret and actionable

  • Essential metadata curation for more efficient data browsing

Users can apply AnVILWorkflow on any workspace they can access, including 347 public workspaces (snapshot on 8.28.23) available to anyone with an AnVIL account. We present three use cases where we ran non-R-based bioinformatics analysis tools using conventional R syntax: Salmon,8 bioBakery,9 and PathML.10 Salmon is a widely used RNA sequencing analysis tool for quantifying the expression of transcripts and is based on the command-line interface. Its downstream analysis involves many R/Bioconductor packages, such as DESeq2, edgeR, and limma. bioBakery is a widely used whole metagenomic shotgun (WMS) sequencing data analysis environment, mainly relying on Python. PathML is a general-purpose research toolkit for computational pathology, including many functionalities in digital pathology data analysis, such as strain normalization, nucleus segmentation, and tissue detection. PathML takes raw image files and returns the processed image data in an hdf5 format for further downstream analysis, including machine learning methods.

AnVIL provides comprehensive resources for biomedical data analysis, including data (e.g., genomics), workflows for bulk analysis, and interactive analysis apps (i.e., Galaxy, Jupyter Notebooks, and RStudio) under the workspace. Workflows are often a limiting factor in bioinformatics analysis due to computing demands and the bioinformatics expertise required. Thus, the AnVILWorkflow package makes the workflow-related resources from AnVIL more accessible and easier to use, especially for R users (Figure 1).

0c6fb503-5456-4a70-87c7-1503dece0782_figure1.gif

Figure 1. Overview of AnVILWorkflow package.

AnVIL’s workflow description language (WDL) specified the runtime environment, which includes proper docker containers for existing analysis tools and computing resources. Cromwell, a scientific workflow execution and management system, runs WDL workflows on the cloud. AnVILWorkflow users can easily run established workflows developed by experts and utilize the cloud resources without configuring or taking maintenance responsibilities.

While AnVIL manages workflow orchestration and workspace metadata and provides default setups that simplify decision-making for users, users still need to manage their data storage and cloud costs. Genomics data, especially their raw and intermediate forms, are very large, so data storage can be costly if the sample size increases. Storage costs incur and can be managed in two ways - storage itself and transfer. For example, using regional storage instead of multi-region, cleaning up intermediate results, and storing infrequently accessed data in low-cost storage (e.g., nearline or coldline storage from Google Cloud) can reduce per-sample costs. Analyzing data stored in one region using Virtual Machine (VM) compute resources in a different region incurs data transfer charges, so centralizing all storage and computing in a single region can be more cost-efficient by not only reducing the storage cost but also avoiding data transfer charges. Currently, the AnVIL workspaces use the us-central1 as a default region, and any artifacts generated from the workflow execution, unless specified, are saved in the same-region bucket linked to the workspace. If users use the default region configured by AnVIL, bringing their data stored in the default region, us-central1, will save the data transfer charge. Additionally, open and controlled access genomic datasets hosted in AnVIL are stored in the us-multi-region, so there are no storage and transfer charges for users using the default workspace configuration. Downloading data to the user’s workstation or laptop is subject to charges, currently $0.08 to $0.12 per GB, depending on the amount of data11 and geography of the transfer, and transfer from the US to another continent is more expensive than within the US transfer.

While browsing existing resources through AnVILWorkflow is free, running workflows charge computing costs. AnVILWorkflow is designed to use existing workflows, which usually predefine computing resources optimized for the types of analyses, simplifying computing-related cost management. You can further reduce the run cost using call caching and preemptive instances. For example, if your workflow runs in fewer than 24 hours since a preemptible VM lasts 24 hours at most, you can save up to 80% by using preemptible VMs.

The cost management for a group of users can be efficiently managed through the AnVIL billing project. One billing account can be shared by adding email addresses under the billing project. The billing project offers details on each workspace, including workspace owner and spent reports, so we can easily identify ‘who’ uses ‘how much’ for ‘what’. In addition to the workspace-level expense reports, users can further enhance cost monitoring by configuring spending reporting.12 This allows users to closely monitor the expenditure associated with each workflow execution.

Methods

Implementation

AnVILWorkflow package provides all the functionalities required to run workflows available in AnVIL from the local R session - from the environment setup to the output download. One prerequisite is to create an AnVIL account from the AnVIL web portal. AnVIL account provides two required inputs to run workflows remotely: 1) the email address associated with the user’s account and 2) the billing project name to cover the computing cost.

AnVIL-hosted workflows can be run using four main functions: setCloudEnv, cloneWorkspace, runWorkflow, and getOutput. The setCloudEnv function accepts the AnVIL account email and billing project name and sets up your local R environment so you can access AnVIL and Cloud-computing resources. The cloneWorkspace function creates the user’s copy of a ‘template’ workspace, and the runWorkflow executes the workflow. The getOutput function can check the outputs from successfully executed workflows and download user-specified files to a local computer.

User input can be provided through the updateInput function, which accepts two different forms of tables depending on the workflows - AnVIL’s data model or URLs pointing to data files stored in Google Cloud buckets. The input data formats are already specified in the workflow scripts (Workflow Description Language, WDL13). Other accessory functions are available to monitor submission progress (monitorWorkflow), stop submitted workflow (stopWorkflow), and get Dashboard content (getDashboard).

The AnVILBrowse function allows users to browse AnVIL resources using keywords. This function runs instantaneously because the AnVILWorkflow package includes the metadata snapshot on all the publicly accessible AnVIL workspaces and their workflows and data. It performs basic metadata harmonization, allowing more efficient browsing and filtering, such as selecting workspaces based on the study size or participants’ ages. Users can also browse non-public workspaces they have access to using the getMetaTables function; however, this process can take a while, depending on the number of workspaces a user has access to.

Operation

The use cases demonstrated below include demo input data in the template workspaces, so the R scripts below can run the listed use cases from the local computer. Ready-to-run examples that can be used to test the process on the user’s own AnVIL account are available in the AnVILWorkflow package vignette. Genome Analysis Toolkit Variant Discovery (GATK) best-practice pipelines14 are not demonstrated here but are also available as AnVIL workspaces.

## Setup the account
setCloudEnv(accountEmail = {AnVIL account email},
            billingProjectName = {AnVIL billing project name})
## Clone the workspace of your interest
newName <- {Unique name for your copy of workspace}
cloneWorkspace(workspaceName = newName, templateName = templateName)
## Run workflow
runWorkflow(workspaceName = newName,
            workflowName = {name of the workflow if there is more than one in the workspace of your interest})
## Get workflow outputs
getOutput(workspaceName = newName)

The main features of the demo workspaces and their workflow-specific input data preparation process are described below.

Results

Use cases

Bulk RNA sequencing data analysis

Salmon workflow uses AnVIL’s data model and requires four essential inputs - fastq1, fastq2, fasta, and transcriptome index name. This workflow can be easily applied to the consortium data hosted in AnVIL, which follows AnVIL’s data model. With the default runtime environment configured for this workflow (1 CPU, 2GB memory, and 10GB SSD disk), processing 16 demo samples (32 fastq files, ~1 GB per file) took about 30 minutes and cost $0.12.

Whole metagenomic shotgun data analysis

bioBakery is a metagenome analysis environment composed of Python-based tools, reference databases, and command-line-based workflows. It processes raw shotgun sequencing data into microbial community feature profiles, summary reports, and figures.9 bioBakery’s whole metagenome shotgun (wmgx) and visualization (wmgx_vis) workflows are implemented as an AnVIL workspace. The current version of the AnVILWorkflow supports bioBakery version 3.15 While users can customize this workflow to a great degree, only six inputs are sufficient to run a standard, optimized version of this workflow. Those six inputs are:

  • - Name of the Trimmomatic adaptor type (for demo data, NexteraPE)

  • - Your project name

  • - Extension of input files (for demo data, .fastq.gz)

  • - A table of your sequencing file (fastq) names stored in the Google Cloud Storage bucket

  • - Input file identifier for paired-end sequencing (for demo data, _R1 and _R2)

The seven required databases are already linked to this workflow, and nine additional optional inputs are available for further customization. Optional inputs are for workflow customization, such as bypassing functional profiling (default is false) and maximum memory usage for different tasks (default is 32GB for functional profiling by HUMAnN, 8GB for quality control by Kneaddata, and 24GB for taxonomic profiling by MetaPhlAn). This workflow uses call caching and preemptive instances by default for cost efficiency. Processing six paired-end demo samples (mean file size ~380MB) with the optimized default setting without using preemptive instances took about 5 hours and cost around $6.50. With the preemptive instances, it can take longer but cost less. Compared to the existing options, such as Nephele,16 AnVILWorkflow allows a programmatic approach and more flexible customization options.

Histopathology image processing using PathML

We implemented the hematoxylin-eosin (HE) stain normalization process of PathML as an AnVIL workspace. This workflow accepts an SVS file as input and returns original and normalized images as PNG files. There are two required inputs - Google Cloud Storage URI, where the input SVS image file is stored, and the sample name. Processing one publicly available image (CMU-1_Small_Region.svs, 1.8MB)17 with the default runtime (4 CPU, 16GB memory) took about 8 minutes and cost $0.01. This simple but robust analysis setup can support clinical use cases, such as pathologists who process a large number of images in a short time, by offering guidance and cross-validation options.

Discussion

The AnVILWorkflow package enables users to conduct complex and computationally intense analyses with minimal bioinformatics expertise through well-established workflows within AnVIL and versatile cloud resources directly from standard laptops using the familiar R syntax. The major advantages AnVILWorkflow provides over the existing approaches include 1) a minimal entry barrier, negating the need for software installations, preparation of properly versioned reference data, or construction and oversight of workflows, 2) leveraging flexible cloud computing resources without the need to learn or handle them directly, 3) user-friendly functions that provide enhanced information, and 4) improved reproducibility and interoperability by seamlessly linking multiple analysis steps, conducted in both R and non-R based tools, within a single R vignette. However, there are still some limitations. For instance, certain customizations of the workflows are limited or require a more profound understanding of the workflows. Despite not being inherently more costly than an in-house server, the pay-per-use structure requires careful planning and management. The absence of an integrated versioning system in AnVIL workspaces requires users to manually monitor new versions. In conclusion, AnVILWorkflow proves most advantages for analyzing a bulk of samples on relatively simple workflows (i.e., single-stage workflow procedure) or for exploratory data analysis for non-technical users, particularly when employing well-established analysis workflows.

Ethics and consent

Ethical approval and consent were not required.

Authors’ contributions

SO, LW, and VC conceived the research idea. SO, KG, MR, and MM developed the software. SO and KG performed the benchmarking analyses. SO, LW, and KG wrote the manuscript. LW, SD, MS, and MR reviewed the manuscript.

Comments on this article Comments (0)

Version 1
VERSION 1 PUBLISHED 21 Oct 2024
Comment
Author details Author details
Competing interests
Grant information
Copyright
Download
 
Export To
metrics
Views Downloads
F1000Research - -
PubMed Central
Data from PMC are received and updated monthly.
- -
Citations
CITE
how to cite this article
Oh S, Gravel-Pucillo K, Ramos M et al. AnVILWorkflow: A runnable workflow package for Cloud-implemented bioinformatics analysis pipelines [version 1; peer review: 2 approved]. F1000Research 2024, 13:1257 (https://doi.org/10.12688/f1000research.155449.1)
NOTE: If applicable, it is important to ensure the information in square brackets after the title is included in all citations of this article.
track
receive updates on this article
Track an article to receive email alerts on any updates to this article.

Open Peer Review

Current Reviewer Status: ?
Key to Reviewer Statuses VIEW
ApprovedThe paper is scientifically sound in its current form and only minor, if any, improvements are suggested
Approved with reservations A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.
Not approvedFundamental flaws in the paper seriously undermine the findings and conclusions
Version 1
VERSION 1
PUBLISHED 21 Oct 2024
Views
6
Cite
Reviewer Report 11 Dec 2024
Zachary S.L Foster, USDA Agricultural Research Service, Corvallis, OR, USA 
Approved
VIEWS 6
The authors describe an R package for submitting and interacting with pipelines hosted on the Genomic Data Science Analysis, Visualization, and Informatics Lab-space (AnVIL). The goal of this package is to make it easier for people primarily familiar with R ... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Foster ZSL. Reviewer Report For: AnVILWorkflow: A runnable workflow package for Cloud-implemented bioinformatics analysis pipelines [version 1; peer review: 2 approved]. F1000Research 2024, 13:1257 (https://doi.org/10.5256/f1000research.170635.r342030)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
Views
8
Cite
Reviewer Report 04 Dec 2024
Fabricio Almeida-Silva, VIB Center for Plant Systems Biology, Ghent University, Ghent, Belgium 
Approved
VIEWS 8
In this manuscript, Oh et al present AnVILWorkflow, an R package that allows running and managing AnVIL workflows directly from an R session. This package is a valuable resource for the bioinformatics community, especially to novice bioinformaticians or researchers who ... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Almeida-Silva F. Reviewer Report For: AnVILWorkflow: A runnable workflow package for Cloud-implemented bioinformatics analysis pipelines [version 1; peer review: 2 approved]. F1000Research 2024, 13:1257 (https://doi.org/10.5256/f1000research.170635.r342032)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.

Comments on this article Comments (0)

Version 1
VERSION 1 PUBLISHED 21 Oct 2024
Comment
Alongside their report, reviewers assign a status to the article:
Approved - the paper is scientifically sound in its current form and only minor, if any, improvements are suggested
Approved with reservations - A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.
Not approved - fundamental flaws in the paper seriously undermine the findings and conclusions
Sign In
If you've forgotten your password, please enter your email address below and we'll send you instructions on how to reset your password.

The email address should be the one you originally registered with F1000.

Email address not valid, please try again

You registered with F1000 via Google, so we cannot reset your password.

To sign in, please click here.

If you still need help with your Google account password, please click here.

You registered with F1000 via Facebook, so we cannot reset your password.

To sign in, please click here.

If you still need help with your Facebook account password, please click here.

Code not correct, please try again
Email us for further assistance.
Server error, please try again.