Two-step feature selection for predicting survival time of patients with metastatic castrate resistant prostate cancer

Metastatic castrate resistant prostate cancer (mCRPC) is the major cause of death in prostate cancer patients. Even though some options for treatment of mCRPC have been developed, the most effective therapies remain unclear. Thus finding key patient clinical variables related with mCRPC is an important issue for understanding the disease progression mechanism of mCRPC and clinical decision making for these patients. The Prostate Cancer DREAM Challenge is a crowd-based competition to tackle this essential challenge using new large clinical datasets. This paper proposes an effective procedure for predicting global risks and survival times of these patients, aimed at sub-challenge 1a and 1b of the Prostate Cancer DREAM challenge. The procedure implements a two-step feature selection procedure, which first implements sparse feature selection for numerical clinical variables and statistical hypothesis testing of differences between survival curves caused by categorical clinical variables, and then implements a forward feature selection to narrow the list of informative features. Using Cox’s proportional hazards model with these selected features, this method predicted global risk and survival time of patients using a linear model whose input is a median time computed from the hazard model. The challenge results demonstrated that the proposed procedure outperforms the state of the art model by correctly selecting more informative features on both the global risk prediction and the survival time prediction.

Prostate cancer is the most common malignant tumor among men and ranks third in terms of mortality after lung cancer and colorectal cancer. The major clinical treatment against prostate cancer is an anti-androgen therapy to inhibit male hormones providing to prostate cancer cells. However, the therapy cannot inhibit the cancer cell growth for long because these cells can develop the resistance against the androgen absence condition. This developed prostate cancer is called metastatic castrate resistant prostate cancer (mCRPC), which is the major cause of death in prostate cancer patients 1,2 . Even though some options for treatment of mCRPC have been developed, the most effective therapies remain unclear 3 .
Finding key clinical variables related with mCRPC is an important first step for understanding the disease progression mechanism and clinical decision making for these patients. Halabi et al. 4 identified key factors of mCRPC from a lot of clinical variables by feature selection based on a Cox's proportional hazards model with a L 1 penalty, i.e. a variant of Lasso for survival analysis 6,7 and built a mCRPC prognostic model. This data-driven approach is important to correctly predict patient health status for treatment choices. To validate and improve such prediction models of mCRPC patients, larger scale clinical datasets collected from several clinical institutes are useful. The Prostate Cancer DREAM challenge in DREAM 9.5 (https://www.synapse.org/ProstateCancerChallenge) provided such datasets and an opportunity to tackle this essential challenge using the wisdom of the crowd, in which participating teams were required to submit prediction models based on clinical variables from the comparator arms of four phase III clinical trials with over 2,000 mCRPC patients treated with first-line docetaxel. My method for this challenge consists of a two-step feature selection procedure, which first performs both sparse feature selection 7 and statistical hypothesis testing 8 , and then performs a forward feature selection 9 to screen out non-informative features. A common estimation is performed by maximizing a partial log likelihood function of N patients given by where x n is a vector of clinical variables of the n-th patient, δ n is a binary variable. δ n = 1 for died patients and δ n = 0 for rightcensored patients at time t n when is the survival time of the n-th patient. R n is the risk set at time t n . This estimation is of course affected by non-informative clinical variables (noise variables) because the size of the training data is limited, where the number of clinical variable is large but the number of patients is small. Before estimating weight vector β in the hazard function, my method implemented a two-step feature selection to screen out noninformative clinical variables.

Feature selection
The goal of feature selection is to divide the set of all clinical variables into a set of informative variables and non-informative variables by optimizing the final scoring metric. However, this optimization is NP-hard, i.e. intractable in general. Thus my procedure implemented this task in a heuristic manner; 1) screening numerical features by a L 1 sparse penalized regression and categorical features by a statistical hypothesis testing, and then 2) a forward sequential feature selection to narrow the list of informative selected features by optimizing the final scoring metric. For the first procedure, my procedure used a variant of LASSO for a Cox's proportional hazards model 7 provided by R package glmpath 11 . This approach should choose the weight of the L 1 penalty term. My method automatically chose it by minimizing an information criterion (AIC), which is a criterion to estimate the generalized error. Because the computational cost of this implementation with a lot of clinical variables is expensive, my procedure used this sparse feature selection for only numerical variables to reduce the computational cost. Categorical variables were evaluated using rank statistical hypothesis testing 5,8 . This method tests if there is a significant difference between two or more survival curves with different values of a categorical variable. If the difference of curves is statistically significant, the categorical variable might be related with survival times of patients. Therefore, such variables should be selected for a survival time prediction model. Among selected features described above, my procedure further implemented a forward feature selection 9 to narrow the list of clinical variables. In my procedure, the most useful feature that maximally increases an integrated time-dependent AUC (iAUC) 14 , which is the final scoring metric in sub-challenge 1a, is sequentially added one by one until all variables are selected. After that, the optimal set of clinical variables is selected by maximizing iAUC. iAUCs were estimated by cross-validation (CV), which was performed by randomly splitting all training data into 90% training data and 10% test data. iAUC was estimated as the median among ten calculated iAUC values.

Prediction of global risk of death and survival time
After selecting informative features, parameter β in the Cox proportional hazard function was optimized using only the selected clinical variables. Next, the hazard function was used to predict the global risk of death for each patient 5 . The survival time of each patient can be predicted based on the median time when an estimated survival probability is equal to 0.5, computed from the hazard function 5 . However the root mean squared error of this prediction method was still large and an estimation bias was included because of the right censoring setting, which will be experimentally demonstrated later. Against this problem, my method used a linear model fitting from computed median times to observed survival times in the training dataset. Survival time was predicted by the liner regression model whose input is the estimated median time of each patient. Feature selection for numerical clinical variables was first implemented using the L 1 penalized approach 7 by function coxpath in R package glmpath (https://cran.r-project.org/web/packages/glmpath/glmpath.pdf). This function can compute the entire regularization path for the L 1 penalized model by increasing the weight of the penalty and check only steps of the path when a weight parameter of a clinical variable becomes greater than zero. Table 1 shows the first 20 steps and the sequence of added clinical variables. Figure 1 shows  Step AIC feature selection. This figure shows that the step maximizing AUC is the sixth step which includes six clinical variables ALP, AST, ECOG_C, HB, MI and PLT. These clinical variables were finally selected to predict global risks and survival times of patients.

Prediction performance
The parameter vector β of a Cox-proportional hazards model with six selected clinical variables was optimized by maximizing the partial log-likelihood function. Furthermore, survival times of patients were predicted using median times computed from the optimized hazard model. Figure 3(a) shows predicted values and observed values in the training dataset. This result demonstrates that the estimation of variance is large and the center of plotted data is located to the upper-left from the diagonal line, meaning that predicted values are biased. To improve these prediction errors, the median survival times were transformed by a linear model. Figure 3(b) shows the prediction result after this transformation. These figures demonstrate that the proposed prediction reduces both the estimation bias and variance. As a result, the root mean square error (RMSE) between true values and predictions is drastically improved, from 281.3 by median survival times to 198.7 by the proposed method. This prediction result in sub-challenge 1-b in the Prostate Cancer DREAM Challenge was ranked in the group of top-performers even though the global risk prediction result in sub-challenge 1a was worse than the best 10 performers.
computed AIC scores of these steps. The best feature set (step) was selected by minimizing an AIC score. This procedure chose the 14th step and then selected nine clinical covariates (ENTRTPC, ALP, HB, AST, ECOGC, NEU, PLT, PSA and LDH) as informative clinical variables.
On the other hand, differences of survival curves by categorical clinical variables were statistically tested using function survdiff in R package survival (https://cran.r-project.org/web/packages/ survival/survival.pdf). Table 2  For these 17 selected clinical variables by two feature selections, we further implemented the forward feature selection described in the previous section. Figure 2 shows iAUC at each step of the forward  Step iAUC

Conclusions
This paper outlines a prediction method of global risks of mCRPC patients for sub-challenge 1a and that of survival time for subchallenge 1b in the Prostate Cancer DREAM Challenge. The challenge result in sub-challenge 1b demonstrated that this procedure, which is based on the two-step feature selection and the correction of naïve survival time predictions from the optimized hazard model, outperformed the other teams' methods. Especially, for survival time prediction, this correction method based on centering and reducing estimation variance works well to improve RMSE, the scoring metric of sub-challenge 1b. This analysis demonstrates that a naïve prediction from a basic model (Cox's proportional hazards model) is not always optimal for an evaluation metric. Thus a suitable transformation is necessary to optimize the metric.
This paper also provides a two-step feature selection procedure because using only a single feature selection method leaves a lot of non-informative features. By carefully selecting features by this two-step procedure, the global risk prediction outperformed Halabi's model 4 in sub-challenge 1a. This result demonstrated that multiple feature selection procedures are necessary to screen out non-informative features. Future work includes the validation of informative clinical variables selected by not only of the method proposed here, but also other top-performing methods. Table 3 shows the comparison of our selected clinical variables with Halabi's selected variables 4 . Both models selected ALP, ECOG_C and HB but neither our model nor Halabi's model selected the other eight clinical variables. Although selection results depend on the datasets used, we should further investigate the importance of these clinical variables using knowledge in clinical and biological research areas.

Data availability
The Challenge datasets can be accessed at: https://www.projectdatasphere.org/projectdatasphere/html/pcdc Challenge documentation, including the detailed description of the Challenge design, overall results, scoring scripts, and the clinical trials data dictionary can be found at: https://www.synapse.org/ ProstateCancerChallenge The author adopted a two-step feature selection procedure: a penalized regression for Cox PH model L (R package "glmpath") in the first step, and forward selection in the second step. Features are selected to optimize the iAUC (integrated time-dependent AUC) in 10-fold cross validation.

Major comments:
I am confused about how the two-step feature selection procedure works. The author mentioned the following "Among selected features described above, my procedure further implemented a forward feature ." selection to narrow the list of clinical variables "This figure shows that the step maximizing AUC is the sixth step which includes six clinical variables ALP, AST, ECOG_C, HB, MI and PLT. These clinical variables were finally selected to ." predict global risks and survival times of patients Therefore, I assume the second step starts with the features selected from the first step. However, the features shown in Table 2 don't appear to be a subset of the features shown in Table 1. Also, the feature "MI" doesn't appear to be in Table 1.

Minor comments:
The difference between sub-challenge 1a and sub-challenge 1b is not documented in the Introduction. Please explain that in sub-challenge 1a, the submissions consist of the risk scores, while in sub-challenge 1b, the submissions consist of the predicted survival time. Please explain what the clinical variables mean (e.g. ENTRTPC, ALP, HB, AST, ECOGC, NEU, PLT, PSA and LDH in Table 1).
Please expand the captions for Table 1 and Table 2 to put these tables in the context of the 2-step 1 Please expand the captions for Table 1 and Table 2 to put these tables in the context of the 2-step feature selection procedure.
In Table 3, I assume the circle means "yes" and the cross means "no". Please add a legend to the caption.
I have read this submission. I believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.
No competing interests were disclosed. Competing Interests: 25  This paper offers methods to calculate patient risk scores and predict survival times from proportional hazard models in the context of the Prostate Cancer DREAM Challenge. The author used a two-step feature selection procedure by first using a combination of the LASSO and significance testing and then using a forward selection method.
In one part the contestants were to assign global risk scores to The challenge consisted of two parts. patients and in the other they were to predict survival times. The author states that the results of the methods in question for the former outcome did not make it into the top-10 of the challenge. However, the paper seems to conclude that the two-step feature selection is superior to one-step feature selection. This is possibly based on a comparison with the DREAM benchmark model only. In this case, the paper would benefit from a more specific statement.
For the feature selection it seems unclear if the LASSO variable selection was done conditionally on the categorical predictors (without penalizing their coefficients) or marginally on only the continuous predictors. seems to have been carried out incorrectly in the sense that only the second step (the Cross-validation forward selection) and not the first step was cross-validated. Whether this has consequences for the quality of the selection is unclear, but the estimated iAUC-values reported in Figure 2 are suspiciously large -and they definitely overestimate the validation iAUC.
, the author first used a fitted proportional hazards model to estimate For predicting survival times median survival times. Then observed survival times were regressed linearly on the predicted medians. This estimated a linear transformation, which could be used to transform predicted medians to means. The paper would benefit from a brief discussion of the motivation behind this approach. It is stated that the linear transformation "reduces both the estimation bias and variance", which is unclear as it is not stated what we're aiming to estimate. Arguably, estimating the means from the medians should improve the performance as the RMSE is used to score the predictions.
Minor comments: p. 5: 1-b→1b Table 3: Please add information to the caption about what the symbols mean. It is clear from reading the paper that "open circle" means "selected", but that is not self-evident.
We have read this submission. We believe that we have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.
No competing interests were disclosed. Competing Interests: