multiomics: A user-friendly multi-omics data harmonisation R pipeline [version 1; peer review: 2 not approved]

Data from multiple omics layers of a biological system is growing in quantity, heterogeneity and dimensionality. Simultaneous multi-omics data integration is a growing field of research as it has strong potential to unlock information on previously hidden biological relationships leading to early diagnosis, prognosis and expedited treatments. Many tools for multi-omics data integration are being developed. However, these tools are often restricted to highly specific experimental designs, and types of omics data. While some general methods do exist, they require specific data formats and experimental conditions. A major limitation in the field is a lack of a single or multiomics pipeline which can accept data in an unrefined, information-rich form pre-integration and subsequently generate output for further investigation. There is an increasing demand for a generic multi-omics pipeline to facilitate general-purpose data exploration and analysis of heterogeneous data. Therefore, we present our R multiomics pipeline as an easy to use and flexible pipeline that takes unrefined multiomics data as input, sample information and user-specified parameters to generate a list of output plots and data tables for quality control and downstream analysis. We have demonstrated application of the pipeline on two separate COVID-19 case studies. We enabled limited checkpointing where intermediate output is staged to allow continuation after errors or interruptions in the pipeline and generate a script for reproducing the analysis to improve reproducibility. A seamless integration with the mixOmics R package is achieved, as the R data object can be loaded and manipulated with mixOmics functions. Our pipeline can be installed as an R package or from the git repository, and is accompanied by detailed documentation with walkthroughs on two case studies. The pipeline is also available as Docker and Singularity containers. In this article, Chen and colleagues present an R pipeline for multi-omics data analysis that can potentially accept unrefined data and produce convenient outputs. The pipeline is available as an R package and as Docker/Singularity containers. It is built on top of and closely integrated with the popular mixOmics package. Though this work could be useful, in its current form, it is unfortunately unclear what the new contributions are. In the present study, "multiomics: A user-friendly multi-omics data harmonisation R pipeline" the authors tried to develop a tool for multiple omics integration and analysis. The problem is of utmost importance. However, the tool needs more work to be suitable for publication. The major problem is that installation is not easy at all for non-expert users. I had a bunch of biological researchers to install the tool, but they couldn't. In addition the below codes produce errors:


Introduction
A biological phenotype is an emergent property of a complex network of biological interactions. Since relying on a single layer of omics data to test a biological hypothesis results in an incomplete perspective of a biological system, interest in multi-omics data integration is steadily increasing as a means to decipher complex biological phenotypes. 1 We illustrate these points with a hypothetical case of measuring protein and transcript levels in a same set of matched samples. Each of these omics data layers contain independent information. A correlation score is then obtained between expression levels of the two blocks of omics data, resulting in an interpretable association measure. While correlation scores are a primitive metric, especially in this context of protein and transcript, 2 they represent an additional layer of data summarising valuable relationships. Identifying highly correlated features across independent blocks of omics data could potentially reinforce the validity of the result, while highlighting interesting features (strong positive or negative correlations) for further investigation [ Figure 1]. Hence, exploiting such parallel measurements from a multiomics perspective allows a more comprehensive and cohesive view of such complex and often dynamic systems, and this resolution would be expected to improve as more omics layers are added. Published multi-omics studies discovering novel biological insights which are not possible with single-omics data further supports our points. [3][4][5][6][7][8][9] With the increasing volume of multi-omics data present in publicly accessible biological data repositories, [10][11][12] multi-omics data integration is expected to be the core strategy of modern and future biological data analyses.
As a result, methods have been developed to leverage the multitude of data modalities in characterising biological systems. While many tools are available, most of these methods are heavily customised to fit a specific experimental design, and are not generic enough to handle most use cases. 1 Furthermore, many tools that claim to perform data integration actually Omics data layer 1

Within-omics correlation
Cross-omics correlation Figure 1. An illustration of a hypothetical multi-omics perspective on a simple biological system. The rectangles represent different layers of omics data (e.g. proteome, transcriptome and lipidome) while the circles represent features within their respective omics data layer. Black single-line arrows show correlation between features within the omics data (e.g. a regulatory factor) while blue double-lines show correlation between features across different omics data layers. A powerful abstraction of the system under study can be obtained by reviewing multiple layers of omics data holistically.
perform high-level data aggregation, where datasets are processed individually and only summarised, high level information is analysed together. 1 Of these algorithms, few perform data integration of multiple layers of omics data simultaneously, which we refer to specifically as "data harmonisation" to distinguish it from the more general term of "data integration". 1 While some "data harmonisation" algorithms exist, it is important to note that at this time, no end-to-end pipeline or framework exists which allows the user to quickly and easily input unrefined data, run a pipeline and export output data which can be used for downstream analyses and further downstream analyses. Therefore, to facilitate this, we developed multiomics, a flexible, easy-to-install and easy-to-use pipeline.
We present a pipeline targeted at bioinformaticians called multiomics 13 with some important features, implementing one of the state of the art tools in data harmonisation from the mixOmics R package. 14 It is portable with multiple implementations, and can be installed as an R 15 package or used by cloning the associated git repository. 16 A series of diagnostic plots are generated automatically and compiled into a pdf file. There is seamless integration with mixOmics, where data generated by the pipeline is exported automatically as a R data object of mixOmics classes. As a form of checkpointing, the R data object is updated at every major stage of the pipeline, and can be loaded directly into the mixOmics suite of tools for further investigation or plot customisation. To increase reproducibility, command line arguments are also exported as a script file which can be rerun directly to reproduce the output. To improve usability, the option to provide command line arguments as a json file is also available.
Detailed documentation is provided both within the source git repository and as vignettes in the R package. Multiple installation methods are shown in the git repository to maximise accessibility of our pipeline for users. Additionally, walkthroughs of two case studies are included. Complete and detailed examples of input data format are also provided, including a sample dataset which can be loaded directly from the R package. In this manuscript, we summarise these information and show a minimum working example to highlight some of the features of our pipeline.

Implementation Quick install
You can install this directly as a R package from gitlab: install.packages("devtools") library("devtools") install_gitlab("tyagilab/sars-cov-2", subdir="multiomics") Docker and singularity containers Docker 17 and Singularity 18,19 images are also available if the user prefers to use containers directly. Note that you typically need root access to run Docker, if this is not possible try Singularity.
# download the Docker image docker pull tyronechen/multiomics:1.0.0 # check that it works correctly docker run --rm -it tyronechen/multiomics:1.0.0 Rscript -e 'packageVersion ("multiomics")' # this opens a bash shell where you can use run_pipeline.R docker run --rm -it --entrypoint bash tyronechen/multiomics:1.0.0 # copy the script from install location or repository as shown in the previous section # once you have a copy of the script in your current working directory, you can run this commandRscript run_pipeline.R -h If you don't have root access, you can try Singularity. The Singularity image file is large and you may need to set $SINGULARITY_TMPDIR to a custom location with at least 1 GB of free space.
# set singularity tmpdir to a location of your choice # if you are not in a HPC you can usually skip this export SINGULARITY_TMPDIR=/path/to/directory singularity pull multiomics.sif docker://tyronechen/multiomics:1.0.0 # copy the script from install location or repository as shown above and runsingularity exec multiomics.sif Rscript run_pipeline.R -h

Manual install
If the above automated install steps do not work, detailed manual installation instructions are available in the source git repository at https://gitlab.com/tyagilab/sars-cov-2/-/tree/master for conda and R.
You may need to install mixOmics from source. Follow the installation instructions on https://github.com/aljabadi/ mixOmics#installation: install_github("mixOmicsTeam/mixOmics") The actual script used to run the pipeline is not directly callable but provided as a separate script. Running the following command will show you the path to the script. A copy of this is also available in the source git repository.

Example input
Three elements are the minimum required input for the pipeline [ Figure 2]. First, at least two files corresponding to omics data blocks are required. Next, a file containing biological class information is required. Finally, a list of unique names labelling each data block is required. Examples of these input files and their internal data structure as they appear in the pipeline are shown.
Examples of these data and class files for two case studies are included in the source git repository.

Running the pipeline
The pipeline is run with the command Rscript run_pipeline.R and passing a list of command line arguments either as strings of text or in a json file (recommended). Running the actual pipeline can take some time. The main bottleneck is parameter tuning which scales exponentially with the number of omics data blocks, but it is possible to disable this if the user wants to perform a test run or is already aware of the parameters. We note that R Data objects are periodically exported that allow for seamless integration with functions in the underlying mixOmics package when needed. A secondary bottleneck is data imputation, which scales with the number of components used and the dimensions of the input data. If needed, it is possible to impute and export this imputed data either with the pipeline or with the underlying mixOmics function, and then substitute that as input. The user can adjust the number of cpus if needed to speed up the process. Data imputation can be skipped if it is not required.
Code for the pipeline can be examined in detail from the git repository or individual functions can be inspected directly after loading the R multiomics package.
Examples of these output files for two case studies are included in the source git repository.

Use cases
We demonstrate a sample use case of our pipeline with reference to an earlier re-analysis of a published dataset. 13,26 Our tool takes as input at least two data files present as tables of quantitative information, with samples as rows and features as columns. A list of names corresponding to the names of these data blocks are required. A file containing class information is also required as a list of newline separated values. Examples of these data and class files for two case studies are included in the source git repository. Other command line arguments are also possible pertaining to distance metrics of choice for prediction, number of features to select and others. A full description of these can be obtained by running Rscript run_pipeline.R -h, which will list every flag in detail. Because of the number of command line arguments, an option is provided to pass these parameters as a json file to the pipeline. Examples of these json files for two case studies are included in the source git repository.

Example data included within the multiomics package
Regarding input data, some example data 27 is provided as part of our R package. Alternatively, you may download this from our git repository directly. This is a subset of anonymised clinical data provided in a separate publication. 27 Example processing workflow We provide a fully processed dataset as a guide for the user. The steps below can be reproduced by downloading the R data object with the following command: url <-paste( "https://gitlab.com/tyagilab/sars-cov-2/", "/raw/master/results/case_study_2/RData.RData", sep="-" ) download.file(url, "RData.RData) load("RData.RData") ls() # [1] "argv" "classes" "data" # [4] "data_imp" "data_pca_multilevel" "data_plsda" # [7] "data_splsda" "diablo" "dist_diablo" # [10] "dist_plsda" "dist_splsda" "linkage" # [13] "mappings" "pca_impute" "pca_withna" # [16] "pch" "perf_diablo" "tuned_diablo" # [19] "tuned_splsda" Inspecting the minimum required input (classes and data) reveals the following: Imputing missing values is optional as PLS-derived methods can function without this step. However, we include this information in case the user would like to perform this step manually. Remaining missing values can be imputed by the user-specified --icomp flag. Imputation is effective when the quantity of missing values is <20% of the data. To investigate if the data has been significantly changed, the user can plot a correlation plot of the principal components before and after imputation. Since imputation can take a long time, especially for large datasets, the imputed data is saved by default and the user can load it in directly as input if desired.
If the study design is longitudinal (e.g. has repeated measurements on the same sample), then the --pch flag should be enabled by the user. The user should pass in a file with the same format as the classes file, but containing information regarding the repeated measurements. 23,28 Providing this information allows the pipeline to adjust for this internally.

Method parameters
Most of the parameters for the machine learning algorithms are specified by the user. These cover the three methods PLSDA (partial least squares discriminant analysis), sPLSDA (sparse PLSDA) and multi-block sPLSDA (also known as DIABLO). The underlying methods are implemented within the mixOmics software package and more information is available on their website http://mixomics.org/. For each method, a distance metric is specified, either "max.dist", "centroids.dist" or "mahalanobis.dist". Unlike PLSDA, sPLSDA and multi-block sPLSDA focus on selecting subset of the most relevant features and therefore require a user-specified list describing the quantity of features to be selected from the data. The number of components to derive for each method is also provided. For this section, several exploratory runs with a wide range can be carried out to find the optimal configuration of features, e.g. starting at 5,10,30,50,100, inspecting subsequent output and further narrowing the range. The user can specify a few additional special parameters to the multi-block sPLSDA (block.splsda) function. The linkage parameter is a continuous value from 0 to 1, and describes the type of analysis, with a value closer to 0 prioritising class discrimination and a value closer to 1 prioritising correlation between data sets. Meanwhile, setting the number of multi-block sPLSDA components to 0 causes the pipeline to perform parameter tuning internally. Note that this can take a long time, and scales exponentially per added block of omics data. The user can also specify the number of cpus to be used for parallel processing, which mainly affects parameter tuning. Using our example, these arguments are provided here: 5,6,7,8,9,10, 5,6,7,8,9,10,30"

Performance metrics
To examine the performance of each method, "M-fold" or "leave-one-out" cross-validation is performed to generate error rate plots. To account for cases where sample classes are imbalanced, balanced error rates which simply averages the class-wise error rates are also calculated and shown [ Figure 3]. Overall.BER Overall.ER max.dist centroids.dist mahalanobis.dist

Result visualisation
Results are exported in a series of plots and compiled into a pdf [ Figure 4]. They can also be accessed internally from our provided R data object.

Output control
Pipeline output can be controlled by specifying a number of flags. By default, the pipeline deposits data in the current working directory. This behaviour can be easily modified. Setting outfile_dir specifies the master output directory. An R data object containing objects shown in the loaded RData file can be renamed with the rdata option, generating a file similar to the one used in this example. The plot flag defines the pdf file containing all graphical output as a multipage pdf of all plots generated in the pipeline. A reproducible script is generated and named by the user with the args flag (this defaults to Rscript.sh).

Reproducibility and integration with mixOmics
Finally, the pipeline has a limited check-pointing built-in. At each milestone in the pipeline, the relevant output is saved and written out as a RData file, similar to the one presented above. This allows the user to manually inspect the data and adjust it to their needs where needed. In the case of completed output, the user can further customise plots and data exports for publication or downstream analysis. Importantly, data objects are compatible with core mixOmics functions, and allows seamless integration with the mixOmics suite of tools if the user intends to extend or perform their own custom analysis workflows.

Data availability
Source data Primary data was generated by third parties and is publicly available. 27,29 For case study 1, translatome data is available from the source publication as Supplementary Table 1 and proteome data is available as Supplementary Table 2. For case study 2, the authors provided their data in a sql database.
• Documentation in markdown format describing pipeline usage on two case studies.
• Input data files in plain text (see Source Data for more information).
• Graphical output as pdf files and feature weights as text files.
• Source code, including code to reproduce figures in this article and source code for the R package.
• Docker file specifications for use with Docker and singularity images.
• Input data files in plain text (see Source Data for more information).
• Graphical output as pdf files and feature weights as text files.
• Source code, including code to reproduce figures in this article and source code for the R package.
• Docker file specifications for use with Docker and singularity images.
The following underlying data is used in this article: • data_lipidome.tsv (Text file as raw input data (lipidomics) for case study 2.) • data_metabolome.tsv (Text file as raw input data (metabolomics) for case study 2.) • classes_diablo.tsv (Text file as raw input data (biological classes) for case study 2.) • RData.RData (R data object containing all input, intermediate and output data for case study 2.) • manuscript_figures (Example output plots that can be generated by the pipeline.) 27,29 Code and data is available under the MIT license. Documentation is available under the CC-BY-3.0 AU license.

Extended data
The following extended data is available in the same repository: • data/case_study_1 (All raw input data for case study 1.) • data/case_study_2 (All raw input data for case study 2.) • results/case_study_1 (Example output data for case study 1.) • results/case_study_2 (Example output data for case study 2.) Similar to underlying data, extended code and data is available under the MIT license. Documentation is available under the CC-BY-3.0 AU license.
# this will show you the path to the script system.file("scripts", "run_pipeline.R", package="multiomics") • Source code available from: https://gitlab.com/tyagilab/sars-cov-2 • Archived source code at time of publication: https://doi.org/10.5281/zenodo.4562009 • License: MIT License. Documentation provided under a CC-BY-3.0 AU license The specific version numbers of the packages used are shown below, along with the version of the R installation.
> library(multiomics) > sessionInfo() # R version 4.0.3 (2020-10-10) the practicalities of multi-omics data analysis can be brought up soon. Figure 1 is not contributing to the exposition and can be removed.

○
What do the following statements on Page 4 at the beginning of passage 3 mean?
"implementing one of the state of the art tools in data harmonisation from the mixOmics R package".
○ "It is portable with multiple implementations" ○ ○ Fix: "run a pipeline and export output data which can be used for downstream analyses and further downstream analyses". ○