Identification of predictive cytokine biomarkers of scleroderma via local causal neighborhood methods

Scleroderma is an autoimmune disease with established Background relationship between immune cytokines and prognosis. Therefore, it is necessary to identify and investigate the causal relationship between cytokines and scleroderma diagnosis and to use this information to identify predictive biomarkers of scleroderma status. : Forty scleroderma positive patients and twenty-four healthy Methods controls have been included in this study. Twenty-nine cytokines implicated in scleroderma have been measured in the bronchoalveolar lavage fluid of these patients, and eight have been found to be univariately associated with scleroderma status. : Using local causal neighborhood learning methods, we have Results found two cytokines, Osteoprotegerin (OPG), also known as osteoclast genesis inhibitory factor, or tumor necrosis factor receptor superfamily member 11B and macrophage inflammatory protein-1 delta, to be causally related to the scleroderma status. Logistic regression predictor based on these cytokines achieves 73% AUC for the task of identifying the scleroderma status. : Our results demonstrate the feasibility of developing Conclusions predictive local causal neighborhood biomarkers of scleroderma status based on bronchoalveolar lavage fluid.

Introduction Scleroderma, also known as systemic sclerosis (SSc), is an autoimmune disease of the connective tissues.The disease is characterized by accumulation of collagen and other connective tissue macromolecules in skin as well as internal organs.The disease is not considered heritable, although it is argued that genetic predisposition plays an important role in its development 1 .
The mechanisms that cause interstitial lung disease in scleroderma (SSc-ILD) remain poorly understood.It is well documented that fibrosis is associated with increased expression of the profibrotic cytokines and chemokines.Several cytokines such as transforming growth factor (TGF) 2 , connective tissue growth factor (CTGF) 3 , interleukins, tumor necrosis factor (TNF) 4 , oncostatin M (OSM) 5 and others have been reported to be increased in bronchoalveolar lavage fluid (BALF) or serum from scleroderma patients when compared to healthy controls.It has been postulated that production of profibrotic cytokines and chemokines is likely to be key components in the pathology that leads to progressive pulmonary fibrosis in scleroderma 6 .
It has been suggested that T-helper (Th) cells and their associated cytokines may play a role in the pathophysiology of scleroderma 7 .Different subsets of Th cells play prominent roles in the progression of scleroderma 7 .Although heavily implicated in disease pathogenesis, immunoproteomic biomarkers do not currently provide an effective way to predict SSc-ILD 8 .
In this paper, we use computational local causal neighborhood (LCN) discovery techniques 9,10 to identify the cytokines that have direct causal relationship to scleroderma status.Rather than reconstructing a detailed causal graph, LCN methods identify variables (cytokines) that are causally 'close' to a target variable (SSc-ILD).Based on theoretical results and empirical studies, this task can be achieved even from observational data under very common assumptions about the distribution of the data 9,10 .The causal nature of the identified variables is complemented by provably advantageous properties with respect to the predictive value of these variables.Specifically, the set of biomarkers in the LCN of a target node make all other biomarkers conditionally independent.This implies that LCN yield the most parsimonious yet maximally predictive diagnostic biomarkers.This has been demonstrated in a number of empirical studies with real biological data 11,12 .Therefore, we seek to develop a biomarker of SSc-ILD by discovering its LCN in cytokine data measured from the bronchoalveolar lavage fluid (BALF).

Study cohort
Forty patients with SSc-ILD (20 African American: Cytokine Array.Lyophilized BALF powder was recovered in 5 mM Tris, pH 7.4 to a protein concentration of 1 mg/ml, and analyzed using Human Cytokine Antibody Array V from RayBiotech, Inc. (Norcross, GA) according to the manufacturer's instructions.Briefly, 500 µg of BALF samples were incubated with array support at room temperature for 2 h followed by the incubation with cocktail of biotin conjugated antibodies at 4°C overnight.The arrays were then incubated with horseradish peroxidase-conjugated streptavidin at room temperature for 2 h, and developed by using enhanced chemiluminescence-type solution.The images were scanned and analyzed with the NIH Image software.The net optical density (OD) was obtained by subtracting a background measurement of the same size negative area from the OD measurement of the signal area.Positive controls (six per membrane) were used to normalize the results from different membranes being compared.The variation between two identical cytokine spots ranged from 0 to 10% in duplicated experiments.
The final dataset consisted of 28 measured cytokines.Figure 1 shows the description of the cytokine array map.

Data handling and preparation
The raw cytokine measurements have been log transformed.We have imputed the missing cytokine values from their 10 nearest neighbors identified by K-nearest neighbors (KNN) algorithm using Bioconductor package impute, version 3.4 13 .Adjustment for sex and race have been performed by fitting linear regression model to individual cytokine values and extracting the residuals for further analysis.All analyses have been performed in R, version 3.2.5 14 .

Univariate analyses
We have performed Welch t-test 15 to find cytokines univariately associated with scleroderma status.We have adjusted for multiple comparisons using False Discovery Rate correction (FDR) 16 with significance threshold set at 0.05.Table 2.

Local causal neighborhood biomarker selection
We define LCN of a variable as a set of variables in its vicinity.
LCNs can be constructed in such a way as to maximize their utility for biomarker development.For instance, Markov blanket (MB) of a variable is an LCN defined as the most compact set of variables (cytokines) rendering other variables independent of a target variable (SS) 17 .Intuitively, MB is the optimal solution of variable selection problem for predictive purposes.In practice, this means that the selected biomarkers contain all essential predictive information about the target node and are causally close to it (causal parents, children or spouses).Inference of MB requires inference of causal directionality, which is infeasible in most biomedical datasets due to their limited sample size and complexity.A closely related LCN that preserves much of the predictive utility of MB is the parent-child set (PC-set), which only consists of direct causal parents and children of the target node.As opposed to MB the PC-set does not include common confounders of the effects of the target node, and thus can be slightly less predictive.Nonetheless, PC-sets are much easier to discover computationally.Overall, the major advantage LCNs for biomarker discovery is that the size of the biomarker is typically small and has a causal interpretation.
In order to infer LCN of SS, we have used the HITON-PC algorithm 18 from causal explorer toolbox 19 in MATLAB Release 2016b 20 .This algorithm performs a series of conditional independence tests (Fisher's Z test with 0.05 significance threshold) to infer the PC-set of the target variable (SS).

Development of a predictive model based on the local causal neighborhood
We use logistic regression model for predictive analysis.To ensure accurate and unbiased estimates of predictive power of the local causal neighborhood, we have performed 1,000 cross validation (CV) runs of the predictive analysis.First, we have randomly split the sample into training (70% of the data) and testing sets preserving case-control balance in each of the splits.Next, we have identified the PC-set of SSc-ILD using HITON-PC algorithm.The cytokines in the PC-set served as the predictors in building the logistic regression model of SSc-ILD from the training set data.Finally, we estimated the performance of the logistic regression model on the data in the test set.The performance estimates from each of the 1,000 CV runs have been used to determine the overall predictive power of the PC-set.
Our main metric of predictive performance is area under receiver operating characteristic (ROC) curve (AUC).Curve specifies the relationship between the true positive rate (TPR) and the false positive rate (FPR) of the model under all possible model parameter thresholds.The value of AUC from 0.5 (no predictive signal) to 1 (perfect prediction).We have extracted the following additional performance metrics from the ROC: (i) mean classification success rate, defined as the average rate of correctly identifying the scleroderma status on the test set, over the entire simulation runs; (ii) mean optimal sensitivity, the average value of the optimal sensitivity of the predictive model over the entire simulation runs.
(ii) mean optimal specificity, the average of the optimal specificity of the predictive model over the entire simulation runs.

Cytokine array
Cytokine arrays were utilized to explore and compare the expression levels of multiple cytokines in BALF samples.The array revealed elevated expression of 28 cytokines in BALF from scleroderma patients when compared to controls, which were selected for further analysis.Figure 2.
Scleroderma status is univariately associated with 8 out of 28 cytokines measured in BALF From the 28 cytokines measured, 8 are significantly associated with scleroderma status at p-value of 0.05 (Table 1).After correction for multiple comparisons using FDR, 3 cytokines, MIP-1delta, VEGF and Osteoprotegerin,remain significantly associated.The same cytokines are significant after adjustment for subject sex and race.Local causal neighborhood of SSc-ILD consists of OPG and MIP1delta HITON-PC algorithm identifies two cytokines, OPG and MIP-1delta, as the members of the PC-set LCN of SS in both adjusted and unadjusted data.Figure 3 shows the inferred LCN for the scleroderma status.
In order to assess the stability of the PC-Set estimate, we performed a re-sampling analysis as described in the methods section.
The histogram of the appearance of the different cytokines in the causal neighborhood of scleroderma status is shown in Figure 4.
Osteoprot and MIP-1delta appeared most frequently in the PC-set LCN of SS.In the 1000 simulation runs, either one or both cytokines appeared in the PC-set of scleroderma status 847 times.The scatterplot of MIP1-delta against Osteoprot adjusted values (Figure 5) shows apparent concentration of scleroderma cases in the top right quadrant, indicating association of higher expression values of these cytokines with disease.

LCN cytokines predict SS with 73% AUC
We first fit the logistic regression model using all cytokines as predictors.The resulted model attains cross-validated predictivity indistinguishable from random predictor on the testing data and almost perfect predictivity in the training data.This indicates high levels of noise in the weakly predictive data, which requires feature selection before attempting to build predictive model in order to obtain accurate models and estimates of model predictivity.To estimate the predictive utility of PC-set LCN in an unbiased way, we have performed 1,000 iterations of repeated crossvalidated predictive model building using logistic regression.In data adjusted for sex and race, the mean classification success rate is 68.46% (st.dev.6.66%).The mean optimal sensitivity of the model is 65.49%( st.dev.11.38%).The mean optimal specificity of the model is 73.44% (± 11.88% st.dev.).We estimate the AUC of the cross-validated model to be 73% (Figure 6), indicating non-negligible power of LCN biomarker for SSc prediction.

Discussion
ILD is the leading cause of morbidity and mortality in scleroderma patients.The mechanisms leading to SSc lung disease remain unknown.However, a variety of cytokines and growth factors have been reported to be increased in BALF from SSc patients.It is known fact that, Osteoprotegerin acts as a competitive receptor for the membrane-bound receptor activator of nuclear factor-kappaB 21 .It is also known that, osteoprotegerin-deficient mice exhibit osteoporosis at early stage of life span with the activation of RANK ligand (RANKL) 22 .Macrophage inflammatory protein (MIP) has been previously shown to be associated with scleroderma diagnosis 23 .OPG in primary myelofibrosis (PMF) expressed significantly higher (up to 71-fold) when compared with prefibrotic cellular PMF and control cases 24 .
In this paper, we present a model for inferring the cytokines predictive of SSc-ILD using local causal neighborhood framework.The model identified osteoprotegerin (OPG) and macrophage inflammatory protein-1 delta (MIP-1delta), to be closely related to the scleroderma status.We tested the hypothesis that the identified cytokines can be used as predictive biomarkers.
We compared the biomarkers selected by LCN with those resulting from LASSO regression 25 .In our data, the regression model selects the same features as LCN.This suggests that LCN approach can result in equivalent inference to LASSO.However, the major advantage of an LCN feature selection approach is in the theoretical causal guarantees it provides.LCN identifies features based on their causal proximity to the target variable, something that LASSO regression feature selection cannot match.
In this study, clinical disease characterization and demographic information, other than sex and race, have not been incorporated in the predictive model of SS.In practice, we maintain that combining molecular biomarkers with clinical and demographic data should improve the quality of the predictor.Such a biomarker, however, will require a larger cohort of patients to be developed.
Understanding the role of BALF's cytokines in pathogenesis of SSc-ILD may identify new targets for the development of diagnostic biomarkers predicting the biological behavior of the disease.

Figure 2 .
Figure 2. A comparison of RayBiotech human cytokine array map for different.Select cytokine expression locations are shown by arrows.In this instance Caucasian and African American controls are brought against Caucasian and African American with positive scleroderma status.

Figure 3 .
Figure 3. Local causal neighborhood of the scleroderma status.HITON-PC algorithm identified two cytokines, Osteoprot and MIP-1delta, to be in the PC-set of the scleroderma status (SS).6 other cytokines are found to be univariately associated with SS.These additional cytokines as well as every other cytokine not in PC-set, however, are conditionally independent of the SS, given the PC-set.The causal connectivity of the remaining cytokines with SS is unknown.

Figure 4 .
Figure 4. Osteoprot and MIP-1delta LCN biomarkers are robust to repeated cross-validation.We performed 1,000 iterations of repeated cross validation to determine stability of the two cytokine biomarkers.The number of times each cytokine (adjusted for sex and race) appears in the PC-set LCN of SS is depicted on this plot.Osteoprot and MIP-1delta appear in the LCN of SS with high frequency, indicating high likelihood of their causal proximity to scleroderma status.These results are similar if no adjustments for sex and race are made (data not shown).

Figure 5 .
Figure 5. Scatterplot of Osteoprot vs. MIP-1delta.The scatterplot shows the distribution of the two cytokines against each other in data adjusted by sex and race.Red and black points indicate cases and controls, respectively.The concentration of the red points in the upper right quadrant indicates potential presence of predictive signal towards disease status in these cytokines.

Figure 6 .
Figure 6.The receiver operating characteristic curve of the logistic regression model for SS prediction with sex and race adjusted cytokines.The ROC curve shows the estimated values of true positive rate against the false positive rate for the predictive model.The estimated value for the area under curve is 73%.The significant difference of the AUC value from 0.5 confirms the strength of the identified biomarkers in predicting the SS.