ALL Metrics
-
Views
-
Downloads
Get PDF
Get XML
Cite
Export
Track
Research Article

Identification of predictive cytokine biomarkers of scleroderma via local causal neighborhood methods

[version 1; peer review: 1 approved with reservations, 1 not approved]
PUBLISHED 24 Oct 2017
Author details Author details
OPEN PEER REVIEW
REVIEWER STATUS

Abstract

Background: Scleroderma is an autoimmune disease with established relationship between immune cytokines and prognosis. Therefore, it is necessary to identify and investigate the causal relationship between cytokines and scleroderma diagnosis and to use this information to identify predictive biomarkers of scleroderma status.
Methods: Forty scleroderma positive patients and twenty-four healthy controls have been included in this study. Twenty-nine cytokines implicated in scleroderma have been measured in the bronchoalveolar lavage fluid of these patients, and eight have been found to be univariately associated with scleroderma status.
Results: Using local causal neighborhood learning methods, we have found two cytokines, Osteoprotegerin (OPG), also known as osteoclast genesis inhibitory factor, or tumor necrosis factor receptor superfamily member 11B and macrophage inflammatory protein-1 delta, to be causally related to the scleroderma status. Logistic regression predictor based on these cytokines achieves 73% AUC for the task of identifying the scleroderma status.
Conclusions: Our results demonstrate the feasibility of developing predictive local causal neighborhood biomarkers of scleroderma status based on bronchoalveolar lavage fluid.

Keywords

Local causal neighborhood, Predictive modeling, Scleroderma, Cytokine biomarkers.

Introduction

Scleroderma, also known as systemic sclerosis (SSc), is an autoimmune disease of the connective tissues. The disease is characterized by accumulation of collagen and other connective tissue macromolecules in skin as well as internal organs. The disease is not considered heritable, although it is argued that genetic predisposition plays an important role in its development1.

The mechanisms that cause interstitial lung disease in scleroderma (SSc-ILD) remain poorly understood. It is well documented that fibrosis is associated with increased expression of the profibrotic cytokines and chemokines. Several cytokines such as transforming growth factor (TGF)2, connective tissue growth factor (CTGF)3, interleukins, tumor necrosis factor (TNF)4, oncostatin M (OSM)5 and others have been reported to be increased in bronchoalveolar lavage fluid (BALF) or serum from scleroderma patients when compared to healthy controls. It has been postulated that production of profibrotic cytokines and chemokines is likely to be key components in the pathology that leads to progressive pulmonary fibrosis in scleroderma6.

It has been suggested that T-helper (Th) cells and their associated cytokines may play a role in the pathophysiology of scleroderma7. Different subsets of Th cells play prominent roles in the progression of scleroderma7. Although heavily implicated in disease pathogenesis, immunoproteomic biomarkers do not currently provide an effective way to predict SSc-ILD8.

In this paper, we use computational local causal neighborhood (LCN) discovery techniques9,10 to identify the cytokines that have direct causal relationship to scleroderma status. Rather than reconstructing a detailed causal graph, LCN methods identify variables (cytokines) that are causally ‘close’ to a target variable (SSc-ILD). Based on theoretical results and empirical studies, this task can be achieved even from observational data under very common assumptions about the distribution of the data9,10. The causal nature of the identified variables is complemented by provably advantageous properties with respect to the predictive value of these variables. Specifically, the set of biomarkers in the LCN of a target node make all other biomarkers conditionally independent. This implies that LCN yield the most parsimonious yet maximally predictive diagnostic biomarkers. This has been demonstrated in a number of empirical studies with real biological data11,12. Therefore, we seek to develop a biomarker of SSc-ILD by discovering its LCN in cytokine data measured from the bronchoalveolar lavage fluid (BALF).

Materials and methods

Study cohort

Forty patients with SSc-ILD (20 African American: 5 males, 15 females, mean age 43.7 ± 11.0; 20 Caucasian: 10 males, 10 females, mean age 53.8 ± 11.6) and 24 healthy subjects (12 African American: 5 males, 7 females, mean age 32.0 ± 9.5; 12 Caucasian: 4 males, 8 females, mean age 29.6 ± 9.9), all nonsmokers, were examined.

BALF Preparation. Bronchoalveolar lavage was performed as a part of standard care after informed consent was obtained under a protocol approved by the Institutional Review Board for Human Research of Medical University of South Carolina. Recovered BALF was centrifuged at 500 g for 10 minutes at 4°C. Pellet was removed and subjected to total and differential cell counts. Supernatant was dialyzed against sterile distilled H2O overnight at 4°C, lyophilized, and stored at -80°C until assayed.

Cytokine Array. Lyophilized BALF powder was recovered in 5 mM Tris, pH 7.4 to a protein concentration of 1 mg/ml, and analyzed using Human Cytokine Antibody Array V from RayBiotech, Inc. (Norcross, GA) according to the manufacturer’s instructions. Briefly, 500 µg of BALF samples were incubated with array support at room temperature for 2 h followed by the incubation with cocktail of biotin conjugated antibodies at 4°C overnight. The arrays were then incubated with horseradish peroxidase-conjugated streptavidin at room temperature for 2 h, and developed by using enhanced chemiluminescence-type solution. The images were scanned and analyzed with the NIH Image software. The net optical density (OD) was obtained by subtracting a background measurement of the same size negative area from the OD measurement of the signal area. Positive controls (six per membrane) were used to normalize the results from different membranes being compared. The variation between two identical cytokine spots ranged from 0 to 10% in duplicated experiments.

The final dataset consisted of 28 measured cytokines. Figure 1 shows the description of the cytokine array map.

fe33d638-85ec-4af1-b73d-ee53f5128b59_figure1.gif

Figure 1. RayBiotech human cytokine array map.

Pos, positive; Neg, negative; ENA, epithelial cell-derived neutrophil-activating peptide; GCSF, granulocyte-colony stimulating factor; GM-CSF, granulocyte/macrophage-colony stimulating factor; GRO, growth-regulated oncogene; IL, interleukin; IFN, interferon; MCP, monocyte chemotactic protein; MCSF, macrophage colony stimulating factor; MDC, macrophage-derived chemokine; MIG, monokine induced by gamma interferon; MIP, macrophage inflammatory protein; RANTES, regulated upon activation in normal T cells, expressed, and secreted; SCF, stem cell factor; SDF, stromal cell-derived factor; TARC, thymus and activation regulated chemokine; TNF, tumor necrosis factor; EGF, epidermal growth factor; IGF, insulin-like growth factor; Ang, angiogenin; OSM, oncostatin; TPO, thrombopoietin; VEGF, vascular endothelial growth factor; PDGF, platelet-derived growth factor; BDNF, brain-derived neurotrophic factor; BLC, B-lymphocyte chemoattractant; IGFBP, insulin-like growth factor binding protein; IP-10, interferon-inducible protein 10; LIF, leukemia inhibitory factor; LIGHT, lymphotoxins, inducible expression, competes with HSV glycoprotein for HVEM, a receptor expressed on T-lymphocytes; MIF, microphage migration inhibitory factor; NAP, neutrophil activating peptide; NT, neurotrophin; PARC, pulmonary and activation regulatory chemokine; PIGF, placenta growth factor; TIMP, tissue inhibitors of metalloproteinases.

Data handling and preparation

The raw cytokine measurements have been log transformed. We have imputed the missing cytokine values from their 10 nearest neighbors identified by K-nearest neighbors (KNN) algorithm using Bioconductor package impute, version 3.413. Adjustment for sex and race have been performed by fitting linear regression model to individual cytokine values and extracting the residuals for further analysis. All analyses have been performed in R, version 3.2.514.

Univariate analyses

We have performed Welch t-test15 to find cytokines univariately associated with scleroderma status. We have adjusted for multiple comparisons using False Discovery Rate correction (FDR)16 with significance threshold set at 0.05. Table 2.

Local causal neighborhood biomarker selection

We define LCN of a variable as a set of variables in its vicinity. LCNs can be constructed in such a way as to maximize their utility for biomarker development. For instance, Markov blanket (MB) of a variable is an LCN defined as the most compact set of variables (cytokines) rendering other variables independent of a target variable (SS)17. Intuitively, MB is the optimal solution of variable selection problem for predictive purposes. In practice, this means that the selected biomarkers contain all essential predictive information about the target node and are causally close to it (causal parents, children or spouses). Inference of MB requires inference of causal directionality, which is infeasible in most biomedical datasets due to their limited sample size and complexity. A closely related LCN that preserves much of the predictive utility of MB is the parent-child set (PC-set), which only consists of direct causal parents and children of the target node. As opposed to MB the PC-set does not include common confounders of the effects of the target node, and thus can be slightly less predictive. Nonetheless, PC-sets are much easier to discover computationally. Overall, the major advantage LCNs for biomarker discovery is that the size of the biomarker is typically small and has a causal interpretation.

In order to infer LCN of SS, we have used the HITON-PC algorithm18 from causal explorer toolbox19 in MATLAB Release 2016b20. This algorithm performs a series of conditional independence tests (Fisher’s Z test with 0.05 significance threshold) to infer the PC-set of the target variable (SS).

Development of a predictive model based on the local causal neighborhood

We use logistic regression model for predictive analysis. To ensure accurate and unbiased estimates of predictive power of the local causal neighborhood, we have performed 1,000 cross validation (CV) runs of the predictive analysis. First, we have randomly split the sample into training (70% of the data) and testing sets preserving case-control balance in each of the splits. Next, we have identified the PC-set of SSc-ILD using HITON-PC algorithm. The cytokines in the PC-set served as the predictors in building the logistic regression model of SSc-ILD from the training set data. Finally, we estimated the performance of the logistic regression model on the data in the test set. The performance estimates from each of the 1,000 CV runs have been used to determine the overall predictive power of the PC-set.

Our main metric of predictive performance is area under receiver operating characteristic (ROC) curve (AUC). Curve specifies the relationship between the true positive rate (TPR) and the false positive rate (FPR) of the model under all possible model parameter thresholds. The value of AUC from 0.5 (no predictive signal) to 1 (perfect prediction). We have extracted the following additional performance metrics from the ROC: (i) mean classification success rate, defined as the average rate of correctly identifying the scleroderma status on the test set, over the entire simulation runs; (ii) mean optimal sensitivity, the average value of the optimal sensitivity of the predictive model over the entire simulation runs. (ii) mean optimal specificity, the average of the optimal specificity of the predictive model over the entire simulation runs.

Results

Cytokine array

Cytokine arrays were utilized to explore and compare the expression levels of multiple cytokines in BALF samples. The array revealed elevated expression of 28 cytokines in BALF from scleroderma patients when compared to controls, which were selected for further analysis. Figure 2.

fe33d638-85ec-4af1-b73d-ee53f5128b59_figure2.gif

Figure 2. A comparison of RayBiotech human cytokine array map for different.

Select cytokine expression locations are shown by arrows. In this instance Caucasian and African American controls are brought against Caucasian and African American with positive scleroderma status.

Scleroderma status is univariately associated with 8 out of 28 cytokines measured in BALF

From the 28 cytokines measured, 8 are significantly associated with scleroderma status at p-value of 0.05 (Table 1). After correction for multiple comparisons using FDR, 3 cytokines, MIP-1delta, VEGF and Osteoprotegerin,remain significantly associated. The same cytokines are significant after adjustment for subject sex and race.

Table 1. Demographic information of the cohort.

Cases (N=40)Controls (N=24)
SexMaleFemaleMaleFemale
1525915
RaceAfrican AmericanWhiteAfrican AmericanWhite
20201212

Table 2. Results of univariate screening of cytokines for association with SS.

MeanUnadjustedAdjusted
Healthy(sd)Scleroderma(sd)P-value*FDRP-value*FDR
MCP.12.002(1.85)3.010(1.48)0.0290.1100.0280.107
MIP.1delta2.887(1.34)3.833(0.74)0.0030.0300.0020.022
SCF1.184(1.14)1.900(0.85)0.0110.0750.0100.067
ANG3.934(0.79)4.390(0.50)0.0160.0800.0150.077
TPO2.286(1.03)2.863(1.07)0.0380.1280.0350.120
VEGF3.223(0.46)3.627(0.47)0.0010.0180.0000.006
Osteoprot2.439(0.91)3.186(0.57)0.0010.0180.0000.006
PARC4.130(0.87)4.588(0.24)0.0180.0800.0170.077

* Welch t-test, p-values significant at 0.05 are shown in bold.

Local causal neighborhood of SSc-ILD consists of OPG and MIP1delta

HITON-PC algorithm identifies two cytokines, OPG and MIP-1delta, as the members of the PC-set LCN of SS in both adjusted and unadjusted data. Figure 3 shows the inferred LCN for the scleroderma status.

fe33d638-85ec-4af1-b73d-ee53f5128b59_figure3.gif

Figure 3. Local causal neighborhood of the scleroderma status.

HITON-PC algorithm identified two cytokines, Osteoprot and MIP-1delta, to be in the PC-set of the scleroderma status (SS). 6 other cytokines are found to be univariately associated with SS. These additional cytokines as well as every other cytokine not in PC-set, however, are conditionally independent of the SS, given the PC-set. The causal connectivity of the remaining cytokines with SS is unknown.

In order to assess the stability of the PC-Set estimate, we performed a re-sampling analysis as described in the methods section. The histogram of the appearance of the different cytokines in the causal neighborhood of scleroderma status is shown in Figure 4. Osteoprot and MIP-1delta appeared most frequently in the PC-set LCN of SS. In the 1000 simulation runs, either one or both cytokines appeared in the PC-set of scleroderma status 847 times. The scatterplot of MIP1-delta against Osteoprot adjusted values (Figure 5) shows apparent concentration of scleroderma cases in the top right quadrant, indicating association of higher expression values of these cytokines with disease.

fe33d638-85ec-4af1-b73d-ee53f5128b59_figure4.gif

Figure 4. Osteoprot and MIP-1delta LCN biomarkers are robust to repeated cross-validation.

We performed 1,000 iterations of repeated cross validation to determine stability of the two cytokine biomarkers. The number of times each cytokine (adjusted for sex and race) appears in the PC-set LCN of SS is depicted on this plot. Osteoprot and MIP-1delta appear in the LCN of SS with high frequency, indicating high likelihood of their causal proximity to scleroderma status. These results are similar if no adjustments for sex and race are made (data not shown).

fe33d638-85ec-4af1-b73d-ee53f5128b59_figure5.gif

Figure 5. Scatterplot of Osteoprot vs. MIP-1delta.

The scatterplot shows the distribution of the two cytokines against each other in data adjusted by sex and race. Red and black points indicate cases and controls, respectively. The concentration of the red points in the upper right quadrant indicates potential presence of predictive signal towards disease status in these cytokines.

LCN cytokines predict SS with 73% AUC

We first fit the logistic regression model using all cytokines as predictors. The resulted model attains cross-validated predictivity indistinguishable from random predictor on the testing data and almost perfect predictivity in the training data. This indicates high levels of noise in the weakly predictive data, which requires feature selection before attempting to build predictive model in order to obtain accurate models and estimates of model predictivity. To estimate the predictive utility of PC-set LCN in an unbiased way, we have performed 1,000 iterations of repeated cross-validated predictive model building using logistic regression. In data adjusted for sex and race, the mean classification success rate is 68.46% (st. dev. 6.66%). The mean optimal sensitivity of the model is 65.49%( st. dev. 11.38%). The mean optimal specificity of the model is 73.44% (± 11.88% st. dev.). We estimate the AUC of the cross-validated model to be 73% (Figure 6), indicating non-negligible power of LCN biomarker for SSc prediction.

fe33d638-85ec-4af1-b73d-ee53f5128b59_figure6.gif

Figure 6. The receiver operating characteristic curve of the logistic regression model for SS prediction with sex and race adjusted cytokines.

The ROC curve shows the estimated values of true positive rate against the false positive rate for the predictive model. The estimated value for the area under curve is 73%. The significant difference of the AUC value from 0.5 confirms the strength of the identified biomarkers in predicting the SS.

Discussion

ILD is the leading cause of morbidity and mortality in scleroderma patients. The mechanisms leading to SSc lung disease remain unknown. However, a variety of cytokines and growth factors have been reported to be increased in BALF from SSc patients. It is known fact that, Osteoprotegerin acts as a competitive receptor for the membrane-bound receptor activator of nuclear factor-kappaB21. It is also known that, osteoprotegerin-deficient mice exhibit osteoporosis at early stage of life span with the activation of RANK ligand (RANKL)22. Macrophage inflammatory protein (MIP) has been previously shown to be associated with scleroderma diagnosis23. OPG in primary myelofibrosis (PMF) expressed significantly higher (up to 71-fold) when compared with prefibrotic cellular PMF and control cases24.

In this paper, we present a model for inferring the cytokines predictive of SSc-ILD using local causal neighborhood framework. The model identified osteoprotegerin (OPG) and macrophage inflammatory protein-1 delta (MIP-1delta), to be closely related to the scleroderma status. We tested the hypothesis that the identified cytokines can be used as predictive biomarkers.

We compared the biomarkers selected by LCN with those resulting from LASSO regression25. In our data, the regression model selects the same features as LCN. This suggests that LCN approach can result in equivalent inference to LASSO. However, the major advantage of an LCN feature selection approach is in the theoretical causal guarantees it provides. LCN identifies features based on their causal proximity to the target variable, something that LASSO regression feature selection cannot match.

In this study, clinical disease characterization and demographic information, other than sex and race, have not been incorporated in the predictive model of SS. In practice, we maintain that combining molecular biomarkers with clinical and demographic data should improve the quality of the predictor. Such a biomarker, however, will require a larger cohort of patients to be developed.

Understanding the role of BALF’s cytokines in pathogenesis of SSc-ILD may identify new targets for the development of diagnostic biomarkers predicting the biological behavior of the disease.

Data and code availability

ZENODO: SLE_cytokines - Identification of predictive cytokine biomarkers of scleroderma via local causal neighborhood methods data

Raw and imputed cytokine array expressions, relevant code provided in Matlab and R format, and the results of applying LASSO regression on the BALF cytokine arrays are provided in the repository. https://zenodo.org/record/1001044; DOI: 10.5281/zenodo.100104426

Ethics and consent

All procedures were performed as a part of patient standard care after informed consent was obtained under a protocol approved by the Institutional Review Board for Human Research of Medical University of South Carolina.

Comments on this article Comments (0)

Version 1
VERSION 1 PUBLISHED 24 Oct 2017
Comment
Author details Author details
Competing interests
Grant information
Copyright
Download
 
Export To
metrics
Views Downloads
F1000Research - -
PubMed Central
Data from PMC are received and updated monthly.
- -
Citations
CITE
how to cite this article
Shojaee Bakhtiari A, S. Bogatkevich G and Alekseyenko AV. Identification of predictive cytokine biomarkers of scleroderma via local causal neighborhood methods [version 1; peer review: 1 approved with reservations, 1 not approved]. F1000Research 2017, 6:1875 (https://doi.org/10.12688/f1000research.12563.1)
NOTE: If applicable, it is important to ensure the information in square brackets after the title is included in all citations of this article.
track
receive updates on this article
Track an article to receive email alerts on any updates to this article.

Open Peer Review

Current Reviewer Status: ?
Key to Reviewer Statuses VIEW
ApprovedThe paper is scientifically sound in its current form and only minor, if any, improvements are suggested
Approved with reservations A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.
Not approvedFundamental flaws in the paper seriously undermine the findings and conclusions
Version 1
VERSION 1
PUBLISHED 24 Oct 2017
Views
13
Cite
Reviewer Report 11 Dec 2017
Kathryn Torok, Children's Hospital of Pittsburgh of UPMC, University of Pittsburgh, Pittsburgh, PA, USA 
Approved with Reservations
VIEWS 13
The authors present an interesting paper regarding the study of cytokines derived from BALF which offers the possibility to study local disease mechanisms and potential biomarkers associated with disease propagation. Most of the paper and references were dedicated to the ... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Torok K. Reviewer Report For: Identification of predictive cytokine biomarkers of scleroderma via local causal neighborhood methods [version 1; peer review: 1 approved with reservations, 1 not approved]. F1000Research 2017, 6:1875 (https://doi.org/10.5256/f1000research.13606.r28122)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
Views
20
Cite
Reviewer Report 01 Dec 2017
Mark M Wurfel, Department of Medicine, University of Washington, Seattle, WA, USA 
Not Approved
VIEWS 20
This an interesting report that addresses an important clinical problem. However, the methods and presentation potentially obscure what the key findings from these data might be. The following should be addressed.
  1. There is very little data
... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Wurfel MM. Reviewer Report For: Identification of predictive cytokine biomarkers of scleroderma via local causal neighborhood methods [version 1; peer review: 1 approved with reservations, 1 not approved]. F1000Research 2017, 6:1875 (https://doi.org/10.5256/f1000research.13606.r27316)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.

Comments on this article Comments (0)

Version 1
VERSION 1 PUBLISHED 24 Oct 2017
Comment
Alongside their report, reviewers assign a status to the article:
Approved - the paper is scientifically sound in its current form and only minor, if any, improvements are suggested
Approved with reservations - A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.
Not approved - fundamental flaws in the paper seriously undermine the findings and conclusions
Sign In
If you've forgotten your password, please enter your email address below and we'll send you instructions on how to reset your password.

The email address should be the one you originally registered with F1000.

Email address not valid, please try again

You registered with F1000 via Google, so we cannot reset your password.

To sign in, please click here.

If you still need help with your Google account password, please click here.

You registered with F1000 via Facebook, so we cannot reset your password.

To sign in, please click here.

If you still need help with your Facebook account password, please click here.

Code not correct, please try again
Email us for further assistance.
Server error, please try again.