Two-step feature selection for predicting survival time of patients with metastatic castrate resistant prostate cancer

Motoki Shiga

doi:10.12688/f1000research.8201.1

Home Browse Two-step feature selection for predicting survival time of patients...

ALL Metrics

-

Views

-

Downloads

Get PDF

Get XML

Export

▬

✚

Method Article

Two-step feature selection for predicting survival time of patients with metastatic castrate resistant prostate cancer

[version 1; peer review: 2 approved]

Motoki Shiga

PUBLISHED 16 Nov 2016

Author details Author details

Department of Electrical, Electronic and Computer Engineering, Gifu University, Gifu, Japan

OPEN PEER REVIEW

REVIEWER STATUS

This article is included in the Bioinformatics gateway.

Abstract

Metastatic castrate resistant prostate cancer (mCRPC) is the major cause of death in prostate cancer patients. Even though some options for treatment of mCRPC have been developed, the most effective therapies remain unclear. Thus finding key patient clinical variables related with mCRPC is an important issue for understanding the disease progression mechanism of mCRPC and clinical decision making for these patients. The Prostate Cancer DREAM Challenge is a crowd-based competition to tackle this essential challenge using new large clinical datasets. This paper proposes an effective procedure for predicting global risks and survival times of these patients, aimed at sub-challenge 1a and 1b of the Prostate Cancer DREAM challenge. The procedure implements a two-step feature selection procedure, which first implements sparse feature selection for numerical clinical variables and statistical hypothesis testing of differences between survival curves caused by categorical clinical variables, and then implements a forward feature selection to narrow the list of informative features. Using Cox’s proportional hazards model with these selected features, this method predicted global risk and survival time of patients using a linear model whose input is a median time computed from the hazard model. The challenge results demonstrated that the proposed procedure outperforms the state of the art model by correctly selecting more informative features on both the global risk prediction and the survival time prediction.

Keywords

Survival analysis, Cox-proportional hazards model,feature selection

Corresponding author: Motoki Shiga

Competing interests: No competing interests were disclosed.

Grant information: This work is partially supported by JSPS KAKENHI 25870322.
The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Copyright: © 2016 Shiga M. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Data associated with the article are available under the terms of the Creative Commons Zero "No rights reserved" data waiver (CC0 1.0 Public domain dedication).

How to cite: Shiga M. Two-step feature selection for predicting survival time of patients with metastatic castrate resistant prostate cancer [version 1; peer review: 2 approved]. F1000Research 2016, 5:2678 (https://doi.org/10.12688/f1000research.8201.1) First published: 16 Nov 2016, 5:2678 (https://doi.org/10.12688/f1000research.8201.1) Latest published: 16 Nov 2016, 5:2678 (https://doi.org/10.12688/f1000research.8201.1)

Introduction

Prostate cancer is the most common malignant tumor among men and ranks third in terms of mortality after lung cancer and colorectal cancer. The major clinical treatment against prostate cancer is an anti-androgen therapy to inhibit male hormones providing to prostate cancer cells. However, the therapy cannot inhibit the cancer cell growth for long because these cells can develop the resistance against the androgen absence condition. This developed prostate cancer is called metastatic castrate resistant prostate cancer (mCRPC), which is the major cause of death in prostate cancer patients^1,2. Even though some options for treatment of mCRPC have been developed, the most effective therapies remain unclear³. Finding key clinical variables related with mCRPC is an important first step for understanding the disease progression mechanism and clinical decision making for these patients. Halabi et al.⁴ identified key factors of mCRPC from a lot of clinical variables by feature selection based on a Cox’s proportional hazards model with a L₁ penalty, i.e. a variant of Lasso for survival analysis^6,7 and built a mCRPC prognostic model. This data-driven approach is important to correctly predict patient health status for treatment choices. To validate and improve such prediction models of mCRPC patients, larger scale clinical datasets collected from several clinical institutes are useful. The Prostate Cancer DREAM challenge in DREAM 9.5 (https://www.synapse.org/ProstateCancerChallenge) provided such datasets and an opportunity to tackle this essential challenge using the wisdom of the crowd, in which participating teams were required to submit prediction models based on clinical variables from the comparator arms of four phase III clinical trials with over 2,000 mCRPC patients treated with first-line docetaxel. My method for this challenge consists of a two-step feature selection procedure, which first performs both sparse feature selection⁷ and statistical hypothesis testing⁸, and then performs a forward feature selection⁹ to screen out non-informative features. Selected clinical variables were used to build a prognostic model to predict global risks of patients. For a survival time prediction, my method further used a linear model fitting with median survival time⁵ computed by the established progression model. The final result of this DREAM challenge demonstrated that, in the sub-challenge 1a, the proposed procedure outperforms Halabi’s model⁴ by correctly selecting more informative features on global risk prediction. In sub-challenge 1b, my method using these selected features predicted the survival time more correctly and outperforms most of the other team’s methods.

Methods

Dataset and pre-process

Data across comparator arms of four phase III clinical trials have been compiled, annotated, cleaned and were made available through the Challenge and remain available on the web site⁷. These datasets include over 150 clinical variables and over 2,000 mCRPC patients treated with first-line docetaxel. The output value to be predicted for unknown new patients is the survival time. The survival times of patients are not always observed because some patients are still alive when they are lost to follow-up or when the study ends. Thus the observed survival times are right censoring. For the training dataset, three of the clinical trial cohorts were provided, which includes data for 476, 598, and 526 patients from clinical trial ASCENT-2 (Novacea, provided by Memorial Sloan Kettering Cancer Center)¹⁰, VENICE (Sanofi)¹¹, and MAINSAIL (Celgene)¹², respectively. For the test dataset, 470 patients’ data were provided from clinical trial ENTHUSE-33 (AstraZeneca)¹³. The goal of this challenge was to correctly predict global risk of death and survival time of patients in the test dataset. In these datasets, clinical variables for some patients were missing. These missing values were imputed by the median of each variable for numerical values and by the most frequent value for each categorical variable.

Hazard model

A Cox proportional hazards model is assumed for the relationship between clinical variables (input variables) of a patient and the survival time (a output variable)⁵. Let x be clinical variables of a patient. The hazard function of the patient at time t is given by

$h (t | x) = h_{0} (t) \exp (β^{T} x),$

where h₀(t) is a baseline hazard function and β is a weight vector to be optimized from training data. When the weight value of the d-th clinical variable β_d is large, the clinical variable is informative to predict the survival time. On the other hand, when β_d =0, the d-th clinical variable is independent with the survival time. Thus the correctly estimating β is the most important task in survival analysis. A common estimation is performed by maximizing a partial log likelihood function of N patients given by

$L (β) = \sum_{n = 1}^{N} δ_{n} [β^{T} x_{n} - \log {\sum_{j \in R_{n}} \exp (β^{T} x_{j})}],$

where x_n is a vector of clinical variables of the n-th patient, δ_n is a binary variable. δ_n = 1 for died patients and δ_n = 0 for right-censored patients at time t_n when is the survival time of the n-th patient. R_n is the risk set at time t_n. This estimation is of course affected by non-informative clinical variables (noise variables) because the size of the training data is limited, where the number of clinical variable is large but the number of patients is small. Before estimating weight vector β in the hazard function, my method implemented a two-step feature selection to screen out non-informative clinical variables.

Feature selection

The goal of feature selection is to divide the set of all clinical variables into a set of informative variables and non-informative variables by optimizing the final scoring metric. However, this optimization is NP-hard, i.e. intractable in general. Thus my procedure implemented this task in a heuristic manner; 1) screening numerical features by a L₁ sparse penalized regression and categorical features by a statistical hypothesis testing, and then 2) a forward sequential feature selection to narrow the list of informative selected features by optimizing the final scoring metric. For the first procedure, my procedure used a variant of LASSO for a Cox’s proportional hazards model⁷ provided by R package glmpath¹¹. This approach should choose the weight of the L₁ penalty term. My method automatically chose it by minimizing an information criterion (AIC), which is a criterion to estimate the generalized error. Because the computational cost of this implementation with a lot of clinical variables is expensive, my procedure used this sparse feature selection for only numerical variables to reduce the computational cost. Categorical variables were evaluated using rank statistical hypothesis testing^5,8. This method tests if there is a significant difference between two or more survival curves with different values of a categorical variable. If the difference of curves is statistically significant, the categorical variable might be related with survival times of patients. Therefore, such variables should be selected for a survival time prediction model.

Among selected features described above, my procedure further implemented a forward feature selection⁹ to narrow the list of clinical variables. In my procedure, the most useful feature that maximally increases an integrated time-dependent AUC (iAUC)¹⁴, which is the final scoring metric in sub-challenge 1a, is sequentially added one by one until all variables are selected. After that, the optimal set of clinical variables is selected by maximizing iAUC. iAUCs were estimated by cross-validation (CV), which was performed by randomly splitting all training data into 90% training data and 10% test data. iAUC was estimated as the median among ten calculated iAUC values.

Prediction of global risk of death and survival time

After selecting informative features, parameter β in the Cox proportional hazard function was optimized using only the selected clinical variables. Next, the hazard function was used to predict the global risk of death for each patient⁵. The survival time of each patient can be predicted based on the median time when an estimated survival probability is equal to 0.5, computed from the hazard function⁵. However the root mean squared error of this prediction method was still large and an estimation bias was included because of the right censoring setting, which will be experimentally demonstrated later. Against this problem, my method used a linear model fitting from computed median times to observed survival times in the training dataset. Survival time was predicted by the liner regression model whose input is the estimated median time of each patient.

Results

Selected clinical variables

My method removed clinical variables having a lot of missing values and then it used only 14 numerical clinical variables and 56 categorical clinical variables with less number of missing values. Feature selection for numerical clinical variables was first implemented using the L₁ penalized approach⁷ by function coxpath in R package glmpath (https://cran.r-project.org/web/packages/glmpath/glmpath.pdf). This function can compute the entire regularization path for the L₁ penalized model by increasing the weight of the penalty and check only steps of the path when a weight parameter of a clinical variable becomes greater than zero. Table 1 shows the first 20 steps and the sequence of added clinical variables. Figure 1 shows computed AIC scores of these steps. The best feature set (step) was selected by minimizing an AIC score. This procedure chose the 14th step and then selected nine clinical covariates (ENTRTPC, ALP, HB, AST, ECOGC, NEU, PLT, PSA and LDH) as informative clinical variables.

Table 1. Selected clinical variables at each step of the regularization path.

Step	Clinical variable
1	ENTRT_PC
2	ALP
3	HB
6	AST
8	ECOG_C
11	NEU
12	PLT
13	PSA
14	LDH
15	CA
16	CREAT
18	ALT
19	WBC
20	TBILI

Figure 1. AICs of steps in the L₁ regularization path.

On the other hand, differences of survival curves by categorical clinical variables were statistically tested using function survdiff in R package survival (https://cran.r-project.org/web/packages/survival/survival.pdf). Table 2 shows the ranking result of clinical variables with p-values. The threshold of a significance level was set to 0.05 and then the procedure selected categorical features ANALGESICS, MHGEN, MI, TURP, MHCARD, ACE_INHIBITORS, MHPSYCH and PROSTATECTOMY.

Table 2. p-value of statistical hypothesis testing for categorical clinical variables.

Rank	Clinical variable	p-value
1	ANALGESICS	9.8e-08
2	MHGEN	8.5e-03
3	MI	1.0e-02
4	TURP	1.2e-02
5	MHCARD	1.3e-02
6	ACE_INHIBITORS	2.6e-02
7	MHPSYCH	3.9e-02
8	PROSTATECTOMY	4.3e-02

Figure 2. iAUC at each step of the forward feature selection.

For these 17 selected clinical variables by two feature selections, we further implemented the forward feature selection described in the previous section. Figure 2 shows iAUC at each step of the forward feature selection. This figure shows that the step maximizing AUC is the sixth step which includes six clinical variables ALP, AST, ECOG_C, HB, MI and PLT. These clinical variables were finally selected to predict global risks and survival times of patients.

Prediction performance

The parameter vector β of a Cox-proportional hazards model with six selected clinical variables was optimized by maximizing the partial log-likelihood function. Then the global risks of death of patients in the test dataset were predicted from the optimized model. Prediction performance iAUC by the proposed method is 0.7671 although iAUC by Halabi’s model is 0.7429, which can be found in the ranking result of sub-challenge 1a in the web site of Prostate Cancer DREAM Challenge (https://www.synapse.org/ProstateCancerChallenge). This result demonstrated that the proposed prediction outperforms Halabi’s method by correctly selecting informative features.

Furthermore, survival times of patients were predicted using median times computed from the optimized hazard model. Figure 3(a) shows predicted values and observed values in the training dataset. This result demonstrates that the estimation of variance is large and the center of plotted data is located to the upper-left from the diagonal line, meaning that predicted values are biased. To improve these prediction errors, the median survival times were transformed by a linear model. Figure 3(b) shows the prediction result after this transformation. These figures demonstrate that the proposed prediction reduces both the estimation bias and variance. As a result, the root mean square error (RMSE) between true values and predictions is drastically improved, from 281.3 by median survival times to 198.7 by the proposed method. This prediction result in sub-challenge 1-b in the Prostate Cancer DREAM Challenge was ranked in the group of top-performers even though the global risk prediction result in sub-challenge 1a was worse than the best 10 performers.

Figure 3. Predicted survival times in the training dataset.

Conclusions

This paper outlines a prediction method of global risks of mCRPC patients for sub-challenge 1a and that of survival time for sub-challenge 1b in the Prostate Cancer DREAM Challenge. The challenge result in sub-challenge 1b demonstrated that this procedure, which is based on the two-step feature selection and the correction of naïve survival time predictions from the optimized hazard model, outperformed the other teams’ methods. Especially, for survival time prediction, this correction method based on centering and reducing estimation variance works well to improve RMSE, the scoring metric of sub-challenge 1b. This analysis demonstrates that a naïve prediction from a basic model (Cox’s proportional hazards model) is not always optimal for an evaluation metric. Thus a suitable transformation is necessary to optimize the metric.

This paper also provides a two-step feature selection procedure because using only a single feature selection method leaves a lot of non-informative features. By carefully selecting features by this two-step procedure, the global risk prediction outperformed Halabi’s model⁴ in sub-challenge 1a. This result demonstrated that multiple feature selection procedures are necessary to screen out non-informative features. Future work includes the validation of informative clinical variables selected by not only of the method proposed here, but also other top-performing methods. Table 3 shows the comparison of our selected clinical variables with Halabi’s selected variables⁴. Both models selected ALP, ECOG_C and HB but neither our model nor Halabi’s model selected the other eight clinical variables. Although selection results depend on the datasets used, we should further investigate the importance of these clinical variables using knowledge in clinical and biological research areas.

Table 3. Selected clinical variables by the proposed model and Halabi’s model⁴.

Clinical Variables	Proposed Model	Halabi’s Model
ALB	×	○
ALP	○	○
ANALGESICS	×	○
AST	○	×
ECOG_C	○	○
HB	○	○
LDH	×	○
LIVER	×	○
MI	○	×
PLT	○	×
PSA	×	○

Data availability

The Challenge datasets can be accessed at: https://www.projectdatasphere.org/projectdatasphere/html/pcdc

Challenge documentation, including the detailed description of the Challenge design, overall results, scoring scripts, and the clinical trials data dictionary can be found at: https://www.synapse.org/ProstateCancerChallenge

The code and documentation underlying the method presented in this paper can be found at: http://dx.doi.org/10.7303/syn4229266¹⁵

Competing interests

No competing interests were disclosed.

Grant information

This work is partially supported by JSPS KAKENHI 25870322.

The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Acknowledgement

Datasets were kindly provided by Celgene, Sanofi, Memorial Sloan Kettering Cancer Center, AstraZeneca and compiled in the Project Data Sphere® platform. I acknowledge Sage Bionetworks and the organizers of Prostate Cancer DREAM Challenge to provide this interesting clinical data analysis throughout this challenge.

This publication is based on research using information obtained from www.projectdatasphere.org, which is maintained by Project Data Sphere, LLC. Neither Project Data Sphere, LLC nor the owner(s) of any information from the web site have contributed to, approved or are in any way responsible for the contents of this publication.

Faculty Opinions recommended

References

1. Jemal A, Siegel R, Ward E, et al.: Cancer statistics, 2009. CA Cancer J Clin. 2009; 59(4): 225–249. PubMed Abstract | Publisher Full Text
2. Ryan CJ, Smith MR, de Bono JS, et al.: Abiraterone in metastatic prostate cancer without previous chemotherapy. N Engl J Med. 2013; 368(2): 138–148. PubMed Abstract | Publisher Full Text | Free Full Text
3. Wu JN, Fish KM, Evans CP, et al.: No improvement noted in overall or cause-specific survival for men presenting with metastatic prostate cancer over a 20-year period. Cancer. 2014; 120(6): 818–23. PubMed Abstract | Publisher Full Text
4. Halabi S, Lin CY, Kelly WK, et al.: Updated prognostic model for predicting overall survival in first-line chemotherapy for patients with metastatic castration-resistant prostate cancer. J Clin Oncol. 2014; 32(7): 671–677. PubMed Abstract | Publisher Full Text | Free Full Text
5. Kleinbaum DG, Klein M: Survival Analysis: A Self-Learning Text, Third Edition. Springer. 2012. Publisher Full Text
6. Zhang HH, Lu W: Adaptive Lasso for Cox’s proportional hazards model. Biometrika. 2007; 94(3): 691–703. Publisher Full Text
7. Park MY, Hastie T: L₁-regularization path algorithm for generalized linear models. J R Statist Soc. 2007; 69(4): 659–677. Publisher Full Text
8. Harrington DP, Fleming TR: A class of rank test procedures for censored survival data. Biometrika. 1982; 69(3): 553–566. Publisher Full Text
9. Hastie T, Tibshirani R, Friedman J: The Elements of Statistical Learning. Springer. 2009. Publisher Full Text
10. Scher HI, Jia X, Chi K, et al.: Randomized, open-label phase III trial of docetaxel plus high-dose calcitriol versus docetaxel plus prednisone for patients with castration-resistant prostate cancer. J Clin Oncol. 2011; 29(16): 2191–2198. PubMed Abstract | Publisher Full Text
11. Tannock IF, Fizazi K, Ivanov S, et al.: Aflibercept versus placebo in combination with docetaxel and prednisone for treatment of men with metastatic castration-resistant prostate cancer (VENICE): a phase 3, double-blind randomised trial. Lancet Oncol. 2013; 14(8): 760–768. PubMed Abstract | Publisher Full Text
12. Petrylak DP, Vogelzang NJ, Budnik N, et al.: Docetaxel and prednisone with or without lenalidomide in chemotherapy-naive patients with metastatic castration-resistant prostate cancer (MAINSAIL): a randomised, double-blind, placebo-controlled phase 3 trial. Lancet Oncol. 2015; 16(4): 417–425. PubMed Abstract | Publisher Full Text
13. Fizazi K, Higano CS, Nelson JB, et al.: Phase III, randomized, placebo-controlled study of docetaxel in combination with zibotentan in patients with metastatic castration-resistant prostate cancer. J Clin Oncol. 2013; 31(14): 1740–1747. PubMed Abstract | Publisher Full Text
14. Hung H, Chiang CT: Estimation methods for time-dependent AUC models with survival data. Can J Stat. 2010; 38(1): 8–26. Publisher Full Text
15. Shiga M: Write-up for DREAM 9.5 Prostate Cancer DREAM Challenge, Synapse Storage. 2016. Publisher Full Text

Comments on this article Comments (0)

Version 1

VERSION 1 PUBLISHED 16 Nov 2016

Author details Author details

Department of Electrical, Electronic and Computer Engineering, Gifu University, Gifu, Japan

Competing interests

No competing interests were disclosed.

Grant information

This work is partially supported by JSPS KAKENHI 25870322.
The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Article Versions (1)

version 1

Published: 16 Nov 2016, 5:2678

https://doi.org/10.12688/f1000research.8201.1

Copyright

© 2016 Shiga M. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Data associated with the article are available under the terms of the Creative Commons Zero "No rights reserved" data waiver (CC0 1.0 Public domain dedication).

Download

Export To

metrics

	Views	Downloads
F1000Research	-	-
PubMed Central Data from PMC are received and updated monthly.	-	-

Citations

0

SEE MORE DETAILS

CITE

how to cite this article

Shiga M. Two-step feature selection for predicting survival time of patients with metastatic castrate resistant prostate cancer [version 1; peer review: 2 approved]. F1000Research 2016, 5:2678 (https://doi.org/10.12688/f1000research.8201.1)

NOTE: If applicable, it is important to ensure the information in square brackets after the title is included in all citations of this article.

track

receive updates on this article

Track an article to receive email alerts on any updates to this article.

Open Peer Review

Version 1

VERSION 1

PUBLISHED 16 Nov 2016

Views

15

Reviewer Report 28 Nov 2016

Ka Yee Yeung, Institute of Technology, University of Washington, Tacoma, WA, USA

Approved

https://doi.org/10.5256/f1000research.8821.r17685

This paper is generally well-written, with a clear and concise description of the problem and challenge. The author adopted a two-step feature selection procedure: a penalized L₁ regression for Cox PH model (R package "glmpath") in the first step, and ... Continue reading

This paper is generally well-written, with a clear and concise description of the problem and challenge. The author adopted a two-step feature selection procedure: a penalized L₁ regression for Cox PH model (R package "glmpath") in the first step, and forward selection in the second step. Features are selected to optimize the iAUC (integrated time-dependent AUC) in 10-fold cross validation.

Major comments:

I am confused about how the two-step feature selection procedure works. The author mentioned the following

"Among selected features described above, my procedure further implemented a forward feature selection to narrow the list of clinical variables."
"This figure shows that the step maximizing AUC is the sixth step which includes six clinical variables ALP, AST, ECOG_C, HB, MI and PLT. These clinical variables were finally selected to predict global risks and survival times of patients."

Therefore, I assume the second step starts with the features selected from the first step. However, the features shown in Table 2 don't appear to be a subset of the features shown in Table 1. Also, the feature "MI" doesn't appear to be in Table 1.

Minor comments:

The difference between sub-challenge 1a and sub-challenge 1b is not documented in the Introduction. Please explain that in sub-challenge 1a, the submissions consist of the risk scores, while in sub-challenge 1b, the submissions consist of the predicted survival time.
Under Results and "Selected clinical variables", the author mentioned that "My method removed clinical variables having a lot of missing values and then it used only 14 numerical clinical variables and 56 categorical clinical variables with less number of missing values.". What are the exact criteria for filtering clinical variables given that there are 150+ clinical variables to start with?
Please explain what the clinical variables mean (e.g. ENTRTPC, ALP, HB, AST, ECOGC, NEU, PLT, PSA and LDH in Table 1).
Please expand the captions for Table 1 and Table 2 to put these tables in the context of the 2-step feature selection procedure.
In Table 3, I assume the circle means "yes" and the cross means "no". Please add a legend to the caption.

Competing Interests: No competing interests were disclosed.

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.

CITE

Report a concern

Respond or Comment

Views

23

Reviewer Report 25 Nov 2016

Niels Richard Hansen, Department of Mathematical Sciences, University of Copenhagen, Copenhagen, Denmark

Søren Wengel Mogensen, Department of Mathematical Sciences, University of Copenhagen, Copenhagen, Denmark

Approved

https://doi.org/10.5256/f1000research.8821.r17683

This paper offers methods to calculate patient risk scores and predict survival times from proportional hazard models in the context of the Prostate Cancer DREAM Challenge. The author used a two-step feature selection procedure by first using a combination of ... Continue reading

This paper offers methods to calculate patient risk scores and predict survival times from proportional hazard models in the context of the Prostate Cancer DREAM Challenge. The author used a two-step feature selection procedure by first using a combination of the LASSO and significance testing and then using a forward selection method.

The challenge consisted of two parts. In one part the contestants were to assign global risk scores to patients and in the other they were to predict survival times. The author states that the results of the methods in question for the former outcome did not make it into the top-10 of the challenge. However, the paper seems to conclude that the two-step feature selection is superior to one-step feature selection. This is possibly based on a comparison with the DREAM benchmark model only. In this case, the paper would benefit from a more specific statement.

For the feature selection it seems unclear if the LASSO variable selection was done conditionally on the categorical predictors (without penalizing their coefficients) or marginally on only the continuous predictors.

Cross-validation seems to have been carried out incorrectly in the sense that only the second step (the forward selection) and not the first step was cross-validated. Whether this has consequences for the quality of the selection is unclear, but the estimated iAUC-values reported in Figure 2 are suspiciously large – and they definitely overestimate the validation iAUC.

For predicting survival times, the author first used a fitted proportional hazards model to estimate median survival times. Then observed survival times were regressed linearly on the predicted medians. This estimated a linear transformation, which could be used to transform predicted medians to means. The paper would benefit from a brief discussion of the motivation behind this approach. It is stated that the linear transformation “reduces both the estimation bias and variance”, which is unclear as it is not stated what we’re aiming to estimate. Arguably, estimating the means from the medians should improve the performance as the RMSE is used to score the predictions.

Minor comments:
p. 5: 1-b→1b

Table 3: Please add information to the caption about what the symbols mean. It is clear from reading the paper that “open circle” means “selected”, but that is not self-evident.

Competing Interests: No competing interests were disclosed.

We confirm that we have read this submission and believe that we have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.

CITE

Report a concern

Respond or Comment

Comments on this article Comments (0)

Version 1

VERSION 1 PUBLISHED 16 Nov 2016

Open Peer Review

Reviewer Status

Reviewer Reports

	Invited Reviewers
	1	2
Version 1 16 Nov 16	read	read

Niels Richard Hansen, University of Copenhagen, Copenhagen, Denmark

Søren Wengel Mogensen, University of Copenhagen, Copenhagen, Denmark
Ka Yee Yeung, University of Washington, Tacoma, USA

Comments on this article

All Comments(0)

Add a comment

Sign up for content alerts

Browse by related subjects

Back to all reports

Reviewer Report

15 Views

28 Nov 2016 | for Version 1

Ka Yee Yeung, Institute of Technology, University of Washington, Tacoma, WA, USA

15 Views Cite this report Responses(0)

Approved

This paper is generally well-written, with a clear and concise description of the problem and challenge. The author adopted a two-step feature selection procedure: a penalized L₁ regression for Cox PH model (R package "glmpath") in the first step, and forward selection in the second step. Features are selected to optimize the iAUC (integrated time-dependent AUC) in 10-fold cross validation.

Major comments:

I am confused about how the two-step feature selection procedure works. The author mentioned the following

"Among selected features described above, my procedure further implemented a forward feature selection to narrow the list of clinical variables."
"This figure shows that the step maximizing AUC is the sixth step which includes six clinical variables ALP, AST, ECOG_C, HB, MI and PLT. These clinical variables were finally selected to predict global risks and survival times of patients."

Therefore, I assume the second step starts with the features selected from the first step. However, the features shown in Table 2 don't appear to be a subset of the features shown in Table 1. Also, the feature "MI" doesn't appear to be in Table 1.

Minor comments:

The difference between sub-challenge 1a and sub-challenge 1b is not documented in the Introduction. Please explain that in sub-challenge 1a, the submissions consist of the risk scores, while in sub-challenge 1b, the submissions consist of the predicted survival time.
Under Results and "Selected clinical variables", the author mentioned that "My method removed clinical variables having a lot of missing values and then it used only 14 numerical clinical variables and 56 categorical clinical variables with less number of missing values.". What are the exact criteria for filtering clinical variables given that there are 150+ clinical variables to start with?
Please explain what the clinical variables mean (e.g. ENTRTPC, ALP, HB, AST, ECOGC, NEU, PLT, PSA and LDH in Table 1).
Please expand the captions for Table 1 and Table 2 to put these tables in the context of the 2-step feature selection procedure.
In Table 3, I assume the circle means "yes" and the cross means "no". Please add a legend to the caption.

Competing Interests

No competing interests were disclosed.

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.

Respond to this report

Responses (0)

Back to all reports

Reviewer Report

23 Views

25 Nov 2016 | for Version 1

Niels Richard Hansen, Department of Mathematical Sciences, University of Copenhagen, Copenhagen, Denmark

Søren Wengel Mogensen, Department of Mathematical Sciences, University of Copenhagen, Copenhagen, Denmark

23 Views Cite this report Responses(0)

Approved

This paper offers methods to calculate patient risk scores and predict survival times from proportional hazard models in the context of the Prostate Cancer DREAM Challenge. The author used a two-step feature selection procedure by first using a combination of the LASSO and significance testing and then using a forward selection method.

The challenge consisted of two parts. In one part the contestants were to assign global risk scores to patients and in the other they were to predict survival times. The author states that the results of the methods in question for the former outcome did not make it into the top-10 of the challenge. However, the paper seems to conclude that the two-step feature selection is superior to one-step feature selection. This is possibly based on a comparison with the DREAM benchmark model only. In this case, the paper would benefit from a more specific statement.

For the feature selection it seems unclear if the LASSO variable selection was done conditionally on the categorical predictors (without penalizing their coefficients) or marginally on only the continuous predictors.

Cross-validation seems to have been carried out incorrectly in the sense that only the second step (the forward selection) and not the first step was cross-validated. Whether this has consequences for the quality of the selection is unclear, but the estimated iAUC-values reported in Figure 2 are suspiciously large – and they definitely overestimate the validation iAUC.

For predicting survival times, the author first used a fitted proportional hazards model to estimate median survival times. Then observed survival times were regressed linearly on the predicted medians. This estimated a linear transformation, which could be used to transform predicted medians to means. The paper would benefit from a brief discussion of the motivation behind this approach. It is stated that the linear transformation “reduces both the estimation bias and variance”, which is unclear as it is not stated what we’re aiming to estimate. Arguably, estimating the means from the medians should improve the performance as the RMSE is used to score the predictions.

Minor comments:
p. 5: 1-b→1b

Table 3: Please add information to the caption about what the symbols mean. It is clear from reading the paper that “open circle” means “selected”, but that is not self-evident.

Competing Interests

No competing interests were disclosed.

We confirm that we have read this submission and believe that we have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.

Respond to this report

Responses (0)

[1] 1. Jemal A, Siegel R, Ward E, et al.: Cancer statistics, 2009. CA Cancer J Clin. 2009; 59(4): 225–249. PubMed Abstract | Publisher Full Text

[2] 2. Ryan CJ, Smith MR, de Bono JS, et al.: Abiraterone in metastatic prostate cancer without previous chemotherapy. N Engl J Med. 2013; 368(2): 138–148. PubMed Abstract | Publisher Full Text | Free Full Text

[3] 3. Wu JN, Fish KM, Evans CP, et al.: No improvement noted in overall or cause-specific survival for men presenting with metastatic prostate cancer over a 20-year period. Cancer. 2014; 120(6): 818–23. PubMed Abstract | Publisher Full Text

[4] 4. Halabi S, Lin CY, Kelly WK, et al.: Updated prognostic model for predicting overall survival in first-line chemotherapy for patients with metastatic castration-resistant prostate cancer. J Clin Oncol. 2014; 32(7): 671–677. PubMed Abstract | Publisher Full Text | Free Full Text

[5] 5. Kleinbaum DG, Klein M: Survival Analysis: A Self-Learning Text, Third Edition. Springer. 2012. Publisher Full Text

[6] 6. Zhang HH, Lu W: Adaptive Lasso for Cox’s proportional hazards model. Biometrika. 2007; 94(3): 691–703. Publisher Full Text

[7] 7. Park MY, Hastie T: L₁-regularization path algorithm for generalized linear models. J R Statist Soc. 2007; 69(4): 659–677. Publisher Full Text

[8] 8. Harrington DP, Fleming TR: A class of rank test procedures for censored survival data. Biometrika. 1982; 69(3): 553–566. Publisher Full Text

[9] 9. Hastie T, Tibshirani R, Friedman J: The Elements of Statistical Learning. Springer. 2009. Publisher Full Text

[10] 10. Scher HI, Jia X, Chi K, et al.: Randomized, open-label phase III trial of docetaxel plus high-dose calcitriol versus docetaxel plus prednisone for patients with castration-resistant prostate cancer. J Clin Oncol. 2011; 29(16): 2191–2198. PubMed Abstract | Publisher Full Text

[11] 11. Tannock IF, Fizazi K, Ivanov S, et al.: Aflibercept versus placebo in combination with docetaxel and prednisone for treatment of men with metastatic castration-resistant prostate cancer (VENICE): a phase 3, double-blind randomised trial. Lancet Oncol. 2013; 14(8): 760–768. PubMed Abstract | Publisher Full Text

[12] 12. Petrylak DP, Vogelzang NJ, Budnik N, et al.: Docetaxel and prednisone with or without lenalidomide in chemotherapy-naive patients with metastatic castration-resistant prostate cancer (MAINSAIL): a randomised, double-blind, placebo-controlled phase 3 trial. Lancet Oncol. 2015; 16(4): 417–425. PubMed Abstract | Publisher Full Text

[13] 13. Fizazi K, Higano CS, Nelson JB, et al.: Phase III, randomized, placebo-controlled study of docetaxel in combination with zibotentan in patients with metastatic castration-resistant prostate cancer. J Clin Oncol. 2013; 31(14): 1740–1747. PubMed Abstract | Publisher Full Text

[14] 14. Hung H, Chiang CT: Estimation methods for time-dependent AUC models with survival data. Can J Stat. 2010; 38(1): 8–26. Publisher Full Text

[15] 15. Shiga M: Write-up for DREAM 9.5 Prostate Cancer DREAM Challenge, Synapse Storage. 2016. Publisher Full Text

Two-step feature selection for predicting survival time of patients with metastatic castrate resistant prostate cancer

Abstract

Keywords

Introduction

Methods

Dataset and pre-process

Hazard model

Feature selection

Prediction of global risk of death and survival time

Results

Selected clinical variables

Table 1. Selected clinical variables at each step of the regularization path.

Figure 1. AICs of steps in the L1 regularization path.

Table 2. p-value of statistical hypothesis testing for categorical clinical variables.

Figure 2. iAUC at each step of the forward feature selection.

Prediction performance

Figure 3. Predicted survival times in the training dataset.

Conclusions

Table 3. Selected clinical variables by the proposed model and Halabi’s model4.

Data availability

Competing interests

Grant information

Acknowledgement

References

Comments on this article Comments (0)

Open Peer Review

Comments on this article Comments (0)

Open Peer Review

Reviewer Status

Reviewer Reports

Comments on this article

Browse by related subjects

Competing Interests Policy

Stay Updated

Figure 1. AICs of steps in the L₁ regularization path.

Table 3. Selected clinical variables by the proposed model and Halabi’s model⁴.