netDx: Software for building interpretable patient classifiers by multi-'omic data integration using patient similarity networks

Patient classification based on clinical and genomic data will further the goal of precision medicine. Interpretability is of particular relevance for models based on genomic data, where sample sizes are relatively small (in the hundreds), increasing overfitting risk netDx is a machine learning method to integrate multi-modal patient data and build a patient classifier. Patient data are converted into networks of patient similarity, which is intuitive to clinicians who also use patient similarity for medical diagnosis. Features passing selection are integrated, and new patients are assigned to the class with the greatest profile similarity. netDx has excellent performance, outperforming most machine-learning methods in binary cancer survival prediction. It handles missing data – a common problem in real-world data – without requiring imputation. netDx also has excellent interpretability, with native support to group genes into pathways for mechanistic insight into predictive features. The netDx Bioconductor package provides multiple workflows for users to build custom patient classifiers. It provides turnkey functions for one-step predictor generation from multi-modal data, including feature selection over multiple train/test data splits. Workflows offer versatility with custom feature design, choice of similarity metric; speed is improved by parallel execution. Built-in functions and examples allow users to compute model performance metrics such as AUROC, AUPR, and accuracy. netDx uses RCy3 to visualize top-scoring pathways and the final integrated patient network in Cytoscape. Advanced users can build more complex predictor designs with functional building blocks used in the default design. Finally, the netDx Bioconductor package provides a novel workflow for pathway-based patient classification from sparse genetic data.


Introduction
Supervised learning methods are useful in clinical genomics for disease diagnosis, risk stratification for prognosis, and evaluating treatment response. Machine learning is a powerful analytic approach that can identify patterns separating patient groups, but interpreting models remains an active area of research 1 . Interpretability is desirable to better understand biological mechanism underlying the phenotype and for rational treatment design. It is also important for genomic applications, where most contemporary datasets have fewer than a thousand samples, increasing the risk of overfit models that do not independently replicate. Separately, most machine learning methods do not handle missing data -a common feature of real-world datasets -without prior data imputation or filtering. netDx is a supervised learning algorithm that classifies patients by integrating multimodal patient data 2 . It is notable among machine learning methods for handling missing data without imputation, and excels at interpretability by enabling users to create biologically-meaningful grouping of features, such as grouping genes into pathway-level features. netDx integrates multi-modal data by converting each layer into a patient similarity network and then integrating these networks ( Figure 1a). Nodes are patients and edge weights measure pairwise similarity. The example shows a two-class problem (high and low risk patients), with four features shown as patient similarity networks: similarity for clinical (red), gene expression (green), metabolomic (blue) and mutation (orange) data. (b) Conceptual workflow for netDx predictor. Samples are split into train and test samples, and training samples are subjected to feature selection (blue flow). Feature selection uses regularized regression to integrate networks, such that networks with non-zero regression weights have their feature score increased. This process is repeated with different subsamples of training data, for a user-provided maximum of times (featScoreMax). This process is repeated for each patient label. Features passing a user-specified threshold are used to classify held-out samples. Test patients are classified by highest similarity. Patient networks that combine training and test patients are then integrated; only networks from features passing selection are used for this step. Label propagation is used to compute similarity of each test patient to training samples from each label; a given patient is assigned to the class with highest similarity. Average model performance is computed by running this whole process over several train/test splits. Features with consistent high scores can be used to classify an independent validation set.
This paper provides an introduction to the R-based software implementation of netDx for the Bioconductor system 3 and showcases common use cases. The details of the netDx algorithm and performance have been previously published 2 , though we provide a brief conceptual summary here ( Figure 1b). As input, the user provides multiple sets of data measured on the same set of labelled patients. The user additionally provides functions to compute pairwise patient similarity and optionally, rules to group measures from each data type into features (feature design). For example, gene expression measures could be grouped into features representing known biological pathways. As with other machine learning methods, the user specifies parameters for training the model, such as the threshold scores for feature selection. Consider an application to predict good or poor patient survival, using tumour-derived gene expression, DNA methylation and proteomic data. In this scenario, netDx is provided with three data tables, one per 'omic data type, a table with patient identifiers and known labels, and the grouping rule that one feature is to be created per data layer. Patients are then automatically split into training and test samples and feature selection is performed using training samples. netDx uses the given feature processing rules to convert data from different modalities into a common space of patient similarity networks 1,2 . Feature selection is performed once per patient label, and features passing selection are used to classify patients from the held-out test data. Performance robustness is evaluated by repeating this feature selection and classification exercise for multiple train/test splits. The final model is created from features that scored highly in feature selection, a step that uses only training samples. A feature may comprise an entire data layer, a single variable, or specified groupings; one example of the last is grouping gene-level measures into pathways, so that each pathway is a separate feature. Interpretability is aided by the pathway-level predictor design, which identifies cellular processes with predictive value.

Methods
Implementation netDx 3 is integrated into the Bioconductor system, a high-quality computational biology software framework for genomic data analysis in the statistical programming language R 4 . Figure 2 shows the workflow for building a model using the netDx software package; Table 1 describes major function calls. netDx uses Bioconductor data structures and mechanisms for fetching and storing data, and representation of input data.

Operation
Running netDx requires a machine with one or more cores running Intel processors (1.7 GHz i7 or later) or equivalent, a minimum of 1Gb RAM per thread, and 1Gb disk space. Feature selection is an embarrassingly Figure 2. netDx software workflow. The yellow box shows data provided to netDx to build the predictor. See use cases for examples. Patient data is provided as a MultiAssayExperiment object, with the patient metadata in the colData slot. Variable grouping rules, used for feature design, are provided as a list of lists. The outer list corresponds to a given assay; each entry in the corresponding list corresponds to one group of measures and is used to create a single feature. For instance, in a pathway-based design, one entry in the outer list would be "gene expression", with each inner list containing genes grouped by pathways. In this scenario, each of these gene groupings generates a single pathway-level feature. Pathway definitions can be automatically fetched using the fetchPathwayDefinitions() function, or custom definitions can be provided in the GMT file format. For the workflow using sparse genetic data (see Use Case 3), such as CNVs, patient CNVs are provided to netDx as a GRanges object. In this instance, pathways are provided in GRangesList format.
parallel problem, and we highly recommend running the software on a multi-core machine to reduce compute time. netDx is currently supported on machines running OS X or Unix-based operating systems. The software requires the Java executable (v1.8 or higher) to be available on the system path, and will not work on recent Windows-based operating systems that lack this type of installation. Windows users can access netDx via a Docker image provided at https://hub.docker.com/repository/docker/shraddhapai/netdx. netDx v1.1.4 requires R>=3.6 and BioConductor release 3.11 or higher. When building a predictor, patient data is provided to netDx as a MultiAssayExperiment object, a Bioconductor data structure used to represent multi-'omics experiments associated with a given set of samples. In our usage, data types are collected in the assays slot of the object; the sole exception is clinical data, which is provided as part of the sample metadata, using the colData slot. Grouping rules are provided as a nested list object -or list-of-lists (groupList). The outer list consists of one entry per data type, with corresponding groupings in the inner list. Assays names must be identical in the assays slot and in groupList.
The easiest way to build classifiers is to use the wrapper function, buildPredictor(). This function runs feature selection and classification over a specified number of train/test splits, and returns all associated feature scores and detailed classification results in a list object. Advanced users can create custom predictor designs by combining the individual steps used in buildPredictor() (Table 1).

Use cases
This section describes four use cases for building predictors with netDx. The first uses pathway-level features based on gene expression data to generate a binary classifier of breast cancer subtype. The second performs three-way classification of breast cancer subtype by integrating gene expression, DNA methylation and proteomic assays. The third builds a binary classifier of autism spectrum disorder diagnosis from sparse genetic mutations. The fourth involves prediction of tumour stage from somatic mutations that have been desparsified using prior knowledge of gene interaction networks.
Use case 1: Binary classifier from clinical and transcriptomic data, using pathway-level features Introduction. In this example, we will build a binary breast tumour Luminal A subtype classifier from clinical data and gene expression data. We will use different rules to create features for each assay. Specifically: • Clinical measures (e.g. age, stage): Features are defined at the level of variables; similarity is defined as normalized difference.
• Gene expression: Features are defined at the level of pathways; similarity is defined by pairwise Pearson correlation.
Feature scoring is automatically performed over multiple random splits of the data into train and blind test partitions. Feature selected networks are those that consistently score highly across the multiple splits (e.g. those that score 9 out of 10 in ≥70% of splits

Data.
In this example, we use curated data from The Cancer Genome Atlas, through the Bioconductor curatedTCGAData package. The goal is to classify a breast tumour into either a Luminal A subtype or otherwise. The predictor integrates clinical variables selected by the user, along with gene expression data.
Here we load the required packages and download clinical and gene expression data.

##
Title We will work only with the gene expression data in this example: brca <-suppressMessages(curatedTCGAData("BRCA",c("mRNAArray"),FALSE)) This next code block prepares the TCGA data. In practice you would do this once, and save the data before running netDx, but we run it here in full to see an end-to-end example.
When running the predictor (next section), the user simply passes this custom function as an input variable; i.e. the makeNetFunc parameter when calling buildPredictor().
Note: While netDx supports flexible experimental design, the user must ensure that the design, i.e. the similarity metric and variable groupings are appropriate for a given application. Domain knowledge is recommended to support good design.
netDx requires that the makeNetFunc function take some generic parameters as input. These include: • dataList: the patient data, provided as a MultiAssayExperiment object. Refer to online tutorials for MultiAssayExperiment to see how to construct those objects from data.
• groupList: sets of input data that will define individual networks (e.g. genes grouped into pathways) • netDir: the directory where the resulting patient similarity networks will be stored.
This function requires dataList, groupList, and netDir as input variables. The residual ... parameter is to pass additional variables to makePSN_NamedMatrix(), notably numCores (number of parallel jobs).
In this example, the custom similarity function does the following: 1. Creates pathway-level networks from RNA data using the default Pearson correlation measure makePSN_ NamedMatrix(writeProfiles=TRUE, ...)
Build predictor. Finally, we call the function that runs the netDx predictor. We provide: • number of train/test splits over which to collect feature scores and average performance (numSplits), • maximum score for features in one round of feature selection (featScoreMax) • threshold to call feature-selected networks for each train/test split (featSelCutoff); only features scoring this value or higher will be used to classify test patients, and • the information to create the PSN, including patient data (dataList), how variables are to be grouped into networks (groupList) and the custom function to generate features (makeNetFunc).
Change numCores to match the number of cores available on your machine for parallel processing.
The call below runs two train/test splits. Within each split, it: • splits data into train/test using the default split of 80:20 • scores networks between 0 to 2 (i.e. featScoreMax=2) • uses networks that score ≥1 out of 2 (featSelCutoff) to classify test samples for that split.
These are unrealistically low values set so the example will run fast. In practice a good starting point is featScoreMax=10, featSelCutoff=9 and numSplits=100, but these parameters may need to be tuned to the sample sizes in the dataset and heterogeneity of the samples. Datasets with high levels of heterogeneity or small sample sizes may benefit from increased sampling -i.e. higher numSplits value. Increasing this setting increases the time to train the model but identifies generalizable patterns over a larger set of random subsamples.

Examine output
The results are stored in the list object returned by the buildPredictor() call. This list contains: • inputNets: all input networks that the model started with.
• Split<i>: a list with results for each train-test split predictions: real and predicted labels for test patients accuracy: percent accuracy of predictions -featureScores: feature scores for each label (list with g entries, where g is number of patient labels).
Each entry contains the feature selection scores for the corresponding label.
-featureSelected: vector of features that pass feature selection. List of length g, with one entry per label.

Reformat results for further analysis
This code collects different components of model output to examine the results.

Compute model performance
After compiling the data above, plot accuracy for each train/test split: Create a ROC curve, a precision-recall curve, and plot average AUROC and AUPR ( Figure 3): Examine feature scores and consistently high-scoring features. Use getNetConsensus() to convert the list data structure into a single table, one per patient label. The rows show train/test splits and the columns show features that consistently perform well.
We then use callFeatSel() to identify features that consistently perform well across the various train/test splits. Because this is a toy example, we set the bar low to get some features. Here we accept a feature if it scores 1 or higher (fsCutoff=1) in even one split (fsPctPass=0.05), setting the latter to a low positive fraction.   Where features are scored out of 10, a reasonable setting is fsCutoff=9 and fsPctPass=0.7. This setting gives us features that score a minimum of 9 in at least 70% of the train/test splits.
Visualize pathway features as an enrichment map. An enrichment map is a network-based visualization of pathway connectivity and is used in netDx to visualize themes in predictive pathway-based features 5 . It is used in conjunction with the AutoAnnotate Cytoscape app to identify clusters, and apply auto-generated labels to these 6 .
Use getEMapInput_many() to create the input that helps generate the enrichment map in Cytoscape.
Emap_res <-getEMapInput_many(featScores2,pathList, minScore=1,maxScore=2,pctPass=0,out$inputNets,verbose=FALSE) Write the results to files that Cytoscape can read in: gmtFiles <-list() nodeAttrFiles <-list() for (g in names(Emap_res)) { outFile <-sprintf("%s/%s_nodeAttrs.txt",outDir,g) write. Finally, plot the enrichment map. This step requires Cytoscape to be installed, along with the EnrichmentMap and AutoAnnotate apps. It also requires the Cytoscape application to be open and running on the machine running the code. This block is commented out for automatic builds on Bioconductor, but a screenshot of the intended result is shown below (Figure 4).

plotEmap(gmtFiles[[1]], nodeAttrFiles[[1]], groupClusters=TRUE,hideNodeLabels=TRUE)
This example enrichment map isn't terribly exciting because of the low number of pathway features permitted, the upper bound on feature selection scores and low number of train/test splits in the demonstration example.
Here is an example of an enrichment map generated by running the above predictor with more real-world parameter values, and all available pathways ( Figure 5):

Visualize integrated patient similarity network based on top features.
We apply a threshold to define the most predictive features, and integrate these into a single patient similarity network. Such a network is useful for downstream operations such as ascertaining whether or not classes are significantly separated, and for visualization of results.
Here we define predictive features as those scoring 2 out of 2 in all train/test splits.
featScores2 <-lapply(featScores, getNetConsensus) featSelNet <-lapply(featScores2, function(x) { callFeatSel(x, fsCutoff=2, fsPctPass=1) }) Figure 5. Enrichment map shows consistently high-scoring pathway features when running the breast tumour binary classifier with real-world parameters. This network is generated by running the plotEmap() function, which uses the RCy3 Bioconductor package to programmatically call Cytoscape network visualization software from within R, to run the EnrichmentMap app [5][6][7] . Nodes show pathways features that scored a minimum of 9 out of 10 in feature selection, in at least 70% of train/test splits; node fill indicates feature score. Edges connect pathways with shared genes. The larger yellow bubbles are auto-generated by the AutoAnnotate Cytoscape app 6,8 ; these thematically group top pathways, by clustering and word-frequency based cluster annotation.  Figure 5 for an example of a more informative enrichment map produced by running a real-world example.
By setting calcShortestPath=TRUE, the function will also compute the pairwise shortest path for withinand across-group nodes. The result is shown as a set of violin plots and a one-sided Wilcoxon-Mann-Whitney test is used to assign significance.
As with plotEMap(), this method must be run on a computer with Cytoscape installed and running. To bypass plotting the PSN in Cytoscape, set plotCytoscape to FALSE. This function call computes shortest-path distances within-and among clusters ( Figure 6) and plots the integrated PSN (Figure 7). The resulting network is shown below (Figure 7).    This visualization and statistic are useful to ascertain whether or not patients of the same label are more similar in the integrated network; having within-class distance be significantly smaller than acrossclass distance is indicative of good class separation. This graph is generated using the plotIntegratedPatientNetwork() function. From left to right, it shows pairwise patient shortest distances: within patients of class "LumA"; between the two class labels; within patients of the residual class "nonLumA"; and between all patients in the network.
The integrated PSN can also be visualized as a tSNE plot (Figure 8). Use case 2: Three-way classifier with clinical and three types of 'omic data Introduction. In this example, we will use clinical data and three types of 'omic data -gene expression, DNA methylation and proteomic data -to classify breast tumours as being one of three types: Luminal A, Luminal B, or Basal. This example is an extension of the one used to build a binary classifier (see Use Case 1).
We also use several strategies and definitions of similarity to create features: • Clinical variables: Each variable (e.g. age) is its own feature; similarity is defined as normalized difference.
• Gene expression: Features are defined at the level of pathways; i.e. a feature groups genes within the pathway. Similarity is defined as pairwise Pearson correlation.
• Proteomic and methylation data: Features are defined at the level of the entire data layer; a single feature is created for all of proteomic data, and the same for methylation. Similarity is defined as pairwise Pearson correlation. Setup. Load the netDx package.

suppressWarnings(suppressMessages(require(netDx)))
Data. For this example, we download data from The Cancer Genome Atlas through the Bioconductor curatedTCGAData package. The fetch command automatically creates a MultiAssayExperiment object containing the data.

suppressMessages(library(curatedTCGAData))
We use the curatedTCGAData() command to explore available data types in the breast cancer dataset. In this call we fetch only the gene expression, proteomic and methylation data; setting dry.run=FALSE initiates the fetching of the data.

Rules to create features (patient similarity networks).
We will group gene expression data by pathways and clinical data by single variables. We will treat methylation and proteomic data each as a single feature, so each of those groups will contain the entire input table for those corresponding data types.
In the code below, we fetch pathway definitions from January 2018 from a source that auto-compiles these from curated pathway databases (http://download.baderlab.org/EM_Genesets). We choose the January 2018 source to be consistent with earlier published work, but normally the latest source would be downloaded. We group gene expression measures by pathways.
Grouping rules are accordingly created for the clinical, methylation and proteomic data.

##
Luminal_A Basal-like Luminal_B ## [1,] 57.14 100 28.57 ## [2,] 58.62 100 45.00 On examining the confusion matrix above, we can see that the model perfectly classifies basal tumours, but performs poorly in distinguishing between the two types of luminal tumours. This performance is unsurprising because luminal and basal tumours have different molecular characteristics, with the latter being ER-tumours; in contrast, both Luminal A and B are both types of ER+ tumours 9 .
res <-out$Split1$predictions print( Use case 3: Binary classifier using sparse genetic data and pathway-level features netDx natively handles missing data, making it suitable to build predictors with sparse genetic data such as somatic DNA mutations, frequently seen in cancer, and from DNA copy number variations (CNVs). netDx handles missing data at two levels. First, netDx uses patient similarity networks, not input data, as its features. Missing data can be handled by the similarity metric used to make this conversion. e.g. If similarity is defined as the Pearson correlation between gene expression measures at the pathway level, then omitting missing genes from the correlation calculation still allows the correlations, and thus the pathway-level network, to be computed. Where patients are missing a particular feature, the network integration step uses what information it has. For example, in a scenario where the data consist of transcriptomic and proteomic measures, if a patient is missing transcriptomic data, the integration step will use only the proteomic data (network edges) for that patient.
This example demonstrates how to use netDx to build a predictor from sparse genetic data. Here we build a case/ control classifier for autism spectrum disorder (ASD) diagnosis, starting from rare CNVs; for this, we use data from Pinto et al. 10 . The design for this predictor is shown in Figure 9.
Design and adapting the algorithm for sparse event data. In this design, we group CNVs by pathways. The logic behind the grouping is prior evidence showing that genetic events in diseases tend to converge on cellular processes of relevance to the pathophysiology of the disease 10 . CNVs are grouped into pathway-level features and patient similarity is binary; i.e. two patients have similarity of one if they share CNVs in genes from the same pathway. Feature selection is iteratively performed on independent thirds of the sample set. This design uses an additional label enrichment step that precedes feature selection. Label enrichment filters out networks with insufficient bias towards case-case edges, using a label-based permutation approach. Networks with significant label enrichment are used in feature selection. Scores from all three featureselection splits are added to get a final score for each feature, with a maximum attainable score of 30. Test patients are classified as cases if they carry a CNV in a pathway that passes feature selection.

Binary similarity and label enrichment In this design, similarity is defined as a binary function, a strategy that has advantages and drawbacks. In plain terms, if two patients share a mutation in a pathway, their similarity for that pathway is 1; otherwise it is zero.
This binary definition, while conceptually intuitive, increases the false positive rate in the netDx feature selection step. That is, networks with even a single case sample will get a high feature score, regardless of whether that network is enriched for case samples.
To counter this problem, we introduce a label-enrichment step in the feature selection. A bias measure is first computed for each network, such that a network with only cases scores +1; one with only controls scores -1; and one with an equal number of both has a score of zero. Label-enrichment compares the bias in each real network, to the bias in that network in label-permuted data. It then assigns an empirical p-value for the proportion of times a label-permuted network has a bias as high as the real network. Only networks with a p-value below a user-assigned threshold (default: 0.07) pass label-enrichment, and feature selection is limited to these networks. In netDx, label-enrichment is enabled by setting enrichLabels=TRUE in the call to buildPredictor_sparseGenetic().

Cumulative feature scoring
The other difference between this design and those with non-sparse data, is the method of scoring features (Figure 9). The user specifies a parameter which indicates the number of times to split the data and run feature selection. The algorithm then runs feature selection numSplits times, each time leaving 1/numSplits of the samples out. In each split, features are scored between zero and featScoreMax, using the same approach as is used for continuous-valued input. Feature scores are then added across the splits so that a feature can score as high as numSplits*featScoreMax.

Evaluating model performance
For a given cutoff for features, a patient is called a "case" if they have a genetic event in pathways that pass feature selection at that cutoff; otherwise, at that cutoff, they are labelled a "control". These calls are used to generate the false positive and true positive rates across the various cutoffs, which ultimately generates a ROC curve.

suppressMessages(require(netDx)) suppressMessages(require(GenomicRanges))
Data. CNV coordinates are read in, and converted into a GRanges object. As always, the sample metadata table, here the pheno object, must have ID and STATUS columns. ## * Setting up sample metadata phenoFile <-sprintf("%s/extdata/AGP1_CNV.txt",path.package("netDx")) pheno <-read.delim(phenoFile,sep="\t",header=TRUE,as.is=TRUE) colnames(pheno) [ Group CNVs by pathways. The fetchPathwayDefinitions() function downloads pathway definitions from baderlab.org but users may provide custom .gmt files as well. We use the BiocFileCache package to download gene definitions for the hg18 genome build, and convert these a GRanges object. The function mapNamedRangesToSets() is used to group this GRanges object into pathway-level sets. Group gene extents into pathway-based sets, which effectively creates grouping rules for netDx. The function mapNamedRangesToSets() does this grouping, generating a GRangesList object.

path_GRList <-mapNamedRangesToSets(gene_GR,pathwayList)
Run predictor. Once the phenotype matrix and grouping rules are set up, the predictor is called using buildPredictor_sparseGenetic(). Note that unlike with non-sparse data, the user does not provide a custom similarity function in this application; currently, the only option available is the binary similarity defined above. As discussed above, setting enrichLabels=TRUE to enable label-enrichment is highly recommended to reduce false positive rate. Plot results. Feature selection identifies pathways that are consistently enriched for the label of interest; here, "case" status. From the diagnostic point of view, a patient with a genetic event in a selected feature -here, a CNV in a feature-selected pathway -is labelled a "case". "True positives" are therefore cases with CNVs in feature-selected pathways, while "false positives" are controls with CNVs in feature-selected pathways. These definitions are used to compute the ROC curve below (Figure 10). dat <-out$performance_denEnrichedNets plot(0,0,type="n",xlim=c(0,100),ylim=c(0,100), las=1, xlab="False Positive Rate (%)", ylab="True Positive Rate (%)", bty='n',cex.axis=1.5,cex.lab=1.3, main="ROC curve -Patients in label-enriched pathways") points(dat$other_pct,dat$pred_pct, col="red",type="o",pch=16,cex=1.3,lwd=2) Figure 10. ROC curve for case-control classification in autism, using rare copy number variations (CNV) in pathway genes. In this design, patients are classified as cases if they carry a CNV in pathways passing feature selection, and controls otherwise. Each dot in the graph shows the sensitivity/specificity for a given cutoff for feature selection.
We can also compute the AUROC and AUPR.
Pathway scores are also added across the splits, for a total of 9 across the 3 splits (3 + 3 + 3). As before, running the predictor with all possible pathway-related features and realistic training parameters, such as numPermsEnrich=200L, featScoreMax=10L, numSplits=3L identifies a much richer set of themes related to synaptic transmission and cell proliferation, consistent with the known biology of ASD as well as those identified in the original publication 10 .
The nodes in Figure 11 have been reorganized to group clusters sharing a broader theme. Terms related to neurotransmission and synaptic plasticity are in the bottom left, those related to the cell cycle and proliferation are in the top-right, and those related to immune function are in the bottom right.
The dynamic range of feature scores is much larger as well, here ranging from 0 to 30. The resulting ROC curve in Figure 12 accordingly has 30 cutoffs at which specificity and sensitivity are evaluated, evidenced by 30 datapoints in that curve. This is in contrast to 9 cutoffs in the ROC curve shown in Figure 10.
Use case 4: Mutation-based classifier using indirect mutations inferred from known interaction networks netDx provides the option of reducing the sparsity of mutation data by inferring "indirect mutations" using prior knowledge of gene-gene interaction networks. Conceptually, the logic is that if a patient has a mutation in a given gene, the mutation indirectly impacts interacting genes. Indirect mutation is inferred by label propagating patient mutations over a gene-gene interaction network onto neighbours. The resulting smoothed network is then used for downstream applications. This network-based smoothing improved mutation-based tumour class discovery in four types of cancer 12 . For label propagation, we use an R-based implementation of random walk with restart, a popular strategy in bioinformatic applications [12][13][14][15] . The result of using this strategy on a patient's binary somatic mutation profile is a non-sparse profile in which genes are assigned a continuous score between  zero and one, that reflects its network proximity to patient mutations. This propagation value is then ranked and binarized, with the top-ranked fraction set to one; this fraction defaults to 3% and is tunable. The binarization serves to limit inferred mutation to genes closest to the known mutations. For instance, genes distant from the patient's mutation would get a low propagation value, and would be thresholded to zero, i.e. not considered to be mutated. The result of this step is a less sparse binary matrix, which serves as input data to the predictor.
In this example, we use direct and inferred somatic mutations to classify Testicular Germ Cell Tumours (TGCT) 16 by binarized pathologic stage. As with the previous use case, we create pathway-level features to reflect that cancer progression occurs by a combination of genes acting in molecular networks corresponding to cancer hallmark processes such as cell proliferation and apoptosis 13,17 . As in Use Case 3, similarity used is the binary function. If two patients share a mutation in a pathway, their similarity for that pathway is one; otherwise it is zero.

set.seed(8) suppressWarnings(suppressMessages(require(netDx))) suppressWarnings(suppressMessages(require(MultiAssayExperiment)))
Data. Clinical and genetic data are downloaded using the Bioconductor package curatedTCGAData. Mutations are converted to a binary matrix format where rows represent genes, columns represent patients; entry [i,j] is set to one if gene i has a somatic mutation, and zero otherwise.

# download example nets from remote location for vignette require(BiocFileCache)
## Loading required package: BiocFileCache ## Loading required package: dbplyr netFileURL <-paste("http://download.baderlab.org/netDx/", "supporting_data/CancerNets.txt",sep="") cache <-rappdirs::user_cache_dir(appname = "netDx") bfc <-BiocFileCache::BiocFileCache(cache,ask=FALSE) netFile <-bfcrpath(bfc,netFileURL) cancerNets <-read.delim(netFile,sep="\t",header=TRUE,as. Finally, the smoothed matrix is binarized. Genes with a propagation value greater than a specified cutoff are set to one, with the rest set to zero. This step ensures that genes which get a low propagation value are not used. Genes with lower smoothed values reflect those farther from the original mutation, and setting these to zero signifies a lack of confidence that these were impacted. . Thereafter, we define pathway-level patient similarity to be binary; i.e. if two patients share a mutation in genes from the same pathway, their mutual similarity is one; else it is zero. Individual steps below use identical functions to those used in the first use case above. #Setup to build the predictor pathwayList <-readPathways( fetchPathwayDefinitions("January",2018) ) Plot the AUROC and AUPR curves ( Figure 13): predPerf <-plotPerf(predList, predClasses=st) Examine features with the highest scores. Here, these are pathways with somatic mutations that best predict vital status:  more intuitive ( Table 1). The current version of netDx includes a novel workflow of building a classifier from sparse genetic data (see Use Case 3), using the function buildPredictor_sparseGenetic(). We also added the functionality to generate an integrated patient similarity network from features passing selection. The plotIntegratedPatientNetwork() function generates this network, computes statistics on pairwise shortest distance measures (Dijkstra distance) within and across labels, and automatically generates a network visualization in Cytoscape.
A number of software updates were made as part of Bioconductor integration. Unlike the previous version where all user output was written to a specific output directory, all predictor output is now returned to users as R objects, and intermediate work is written to temporary directories by default. The turnkey predictor-building function no longer automatically generates a log file; rather, users are required to create their own log files using the R sink() function. Functions computing model performance and plotting no longer assume a directory structure created by the model-building step. Users now set random number generator seeds at the outset, instead of providing a seed as an input parameter to various functions. Automated network visualization in Cytoscape now uses RCy3, for programmatic access of Cytosape from R.
Memory improvements were made to the underlying GeneMANIA network integration algorithm Java implementation 19,20 , creating a modified version specifically for netDx. netDx incurs a relatively higher memory footprint because each feature in netDx internally generates a similarity network with pairwise similarity measures. Network integration, a step in feature selection, requires keeping all these networks in memory. Certain grouping rules also incur a greater memory footprint than others. Notably, a model with pathway-level features converts one gene expression data matrix into ~2,000 pathway-level patient similarity networks; such a design is less scalable in the number of nodes, than one which creates a single feature based on all gene expression. We optimized netDx memory usage by customizing the underlying GeneMANIA Java application used for network integration. netDx uses a modified version of the GeneMANIA implementation, which bypasses steps not required for the netDx pipeline, such as the identifier conversion and steps involving file input/output. Memory and computational time improvements were benchmarked by building binary classifiers for breast tumours and schizophrenia case-control classification. The CommonMind Consortium 21 dataset (downloaded from Synapse: syn5607607) included 279 controls and 258 cases, with a total of 537 patients, with gene expression data from the prefrontal cortex organized into pathway level features (1,735 pathways). The breast cancer data was part of the TCGA project 9 , with tumour gene-expression for 348 patients, including 154 Luminal A and 194 tumours of other subtypes, also organized into pathway-level features (1,706 pathways). In the benchmark, an approximately 70:30 split of samples was used for cross validation. We measured training time for the predictor using the 70% of samples of a single subtype. All tests were performed on an Intel Xeon @ 2.6GHz machine with 126 GB of available RAM and 12 cores. During benchmarking, threads had a fixed amount of RAM available, with discrete steps of 4 GB, 6 GB and 8 GB. Here each predictor was built using only a single core. Benchmarking runs were parallelized using GNU parallel 22 , where the performance was averaged over four runs of the 10 queries for each datasets. Following improvements, memory use dropped to one-third of the original amount.
With the updated software, the CommonMind dataset also required two-thirds of the time to build the predictor, as compared to with the original version (Table 2).
Finally, the feature selection step now provides the option of using a Monte Carlo resampling strategy for selecting samples for iterative feature scoring. The previous version of the software required a fraction of samples to be held out, the fraction being directly related to the maximum feature score. The Monte Carlo resampling approach should allow users to increase the upper-bound of feature scores, even in smaller samples.

Conclusions
The updated netDx software provides an improved user experience for clinical research applications pertaining to risk stratification, treatment response, and to identify biomarkers associated with patient subtypes. Method extensions will be required for further feature additions, such as the ability to predict continuous outcome. Classification of borderline patients could be better controlled, perhaps by a user-specified margin. Similar to pathway-level grouping for gene expression data, other grouping strategies will be required for other types of genomic data, such as miRNA, single nucleotide polymorphisms, and brain imaging data.

Data availability
Underlying data Data for the autism case/control classification 5 is provided as part of the netDx package.
Data for the breast cancer example is from The Cancer Genome Atlas 9,23 . They are fetched from the curatedTCGAData package which is maintained by the Bioconductor repository.

Are sufficient details of the code, methods and analysis (if applicable) provided to allow replication of the software development and its use by others? Partly
Is sufficient information provided to allow interpretation of the expected output datasets and any results generated using the tool? Yes Are the conclusions about the tool and its performance adequately supported by the findings presented in the article? Yes Competing Interests: We interacted with the authors during the review process to correct warning and errors obtained while executing the protocols of the 4 use cases

Reviewer Expertise: Systems and Network Biology, Bioinformatics, Computational Biology
We confirm that we have read this submission and believe that we have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however we have significant reservations, as outlined above. Responses are shown in bold beneath each reviewer comment, the latter shown in italics.
---The paper entitled "netDx: Software for building interpretable patient classifiers by multi-'omic data integration using patient similarity networks" is a companion paper to the article "netDx: interpretable patient classification using integrated patient similarity networks" published in Plos Comp Bio in 2019. Its goal is to present an updated version of the R software implementation of netDx as a Bioconductor package.
The netDx tool proposes an approach to building a patient classifier from heterogeneous patient data, from clinical to omics. The availability as an R package is of interest to the community. In addition, the manuscript details 4 different uses cases that could help interest readers to apply the tools. We however had difficulties running the code provided in the use cases and obtained different errors and warnings, so have been in contact with the authors to try to solve the problems. However, debugging code necessitate a lot of exchanges, and hundreds of lines of error outputs cannot go in a peer review. These problems are not solved yet.
Response: We thank the reviewers for taking the time and effort to work with us to resolve these issues. There were three main sources of errors that the reviewers encountered. The resolution for each is described below.
To better support Windows users we now provide Docker images of working environments with the latest version of netDx. The following text has been added to the "Operation" section: "Windows users can access netDx via a Docker image provided at https://hub.docker.com/repository/docker/shraddhapai/netdx." Incompatibility with French locale: The reviewers tested netDx on a system with a French locale, which identified an unforeseen international incompatibility. netDx uses a Java-based network integration software during feature selection. Parts of this software would break when provided with numbers using a comma for a decimal separator; as such they were incompatible with several non-English locales. We have now fixed the issue in netDx v1.3.1 to ensure that all files passed from R to Java are forced to use a period as decimal separator.
© 2020 Lê Cao K. This is an open access peer review report distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Kim-Anh Lê Cao
University of Melbourne, Melbourne, Australia The authors illustrate full R workflows to apply netDx using four case studies with various ranges of classification difficulty and data analysis settings. The netDx package proposes various ways of grouping features, for example using known biological pathways, and several graphical outputs. I appreciated that netDx builds on MultiAssayExperiment for easier handling of multi omics data sets. The manuscript will be useful for readers eager to get started with netDx. Below are some suggestions for improvement of the manuscript.
Methodological aspects: I acknowledge that the original algorithm has been detailed in reference [2], however, the present manuscript gives some emphasis on the ability of netDx to handle missing values. A short statement describing how this is done would be helpful.
○ I suggest rewriting the sentence 'The final model is created by choosing features that consistently score highly.' in the introduction. On a first read, it appeared as if there was selection bias during the process.
○ I have some reservations regarding the representation of SEM in the AUROC figures, why not using SD? ○ Implementation aspects: While much effort and improvements have been done in the netDx package v1.1.4, I believe that additional functions could be created to be user friendly. For example, many customs functions (e.g. makeNets, the code proposed to reformat the results and calculate the accuracy or to extract the results to Cytoscape, to list a few) could be recoded into more generic functions. I would encourage the authors to revisit the code they propose in these workflows and improve when possible.
○ I am not sure Table 1 column 3 (function name in v1.0.23) and part of the software update paragraph is useful here. Presumably this could appear on the GitHub page and the NEWS files, unless the objective of this manuscript is also to update the users of the latest changes.

Is sufficient information provided to allow interpretation of the expected output datasets and any results generated using the tool? Yes
Are the conclusions about the tool and its performance adequately supported by the findings presented in the article? Yes feature, the network integration step uses what information it has. For example, in a scenario where the data consist of transcriptomic and proteomic measures, if a patient is missing transcriptomic data, the integration step will use only the proteomic data (edges) for that patient (network edges) for that patient." I suggest rewriting the sentence 'The final model is created by choosing features that consistently score highly.' in the introduction. On a first read, it appeared as if there was selection bias during the process.