ProSGPV: an R package for variable selection with second- generation p-values [version 1; peer review: 1 approved with reservations]

We introduce the ProSGPV R package, which implements a variable selection algorithm based on second-generation p-values (SGPV) instead of traditional p-values. Most variable selection algorithms shrink point estimates to arrive at a sparse solution. In contrast, the ProSGPV algorithm accounts for the estimation uncertainty – via confidence intervals – in the selection process. This additional information leads to better inference and prediction performance in finite sample sizes. ProSGPV maintains good performance even in the high dimensional case where $p>n$, or when explanatory variables are highly correlated. Moreover, ProSGPV is a unifying algorithm that works with continuous, binary, count, and time-to-event outcomes. No cross-validation or iterative processes are needed and thus ProSGPV is very fast to compute. Visualization tools are available in this package for assessing the variable selection process. Here we present simulation studies and a real-world example to demonstrate ProSGPV’s inference and prediction performance in relation to the current standards in variable selection procedures.


Introduction
As the sheer volume of data grows at an astronomical rate, variable selection plays an increasingly crucial role in research. This is particularly true in the high dimensional setting where p > n and classical statistical methods exploiting the full feature space no longer work. An ideal variable selection procedure would recover the underlying true support with high probability, yield parameter estimation with low bias, and achieve good prediction performance. While it is hard for a statistical procedure to strike a balance between inference and prediction tasks, 1,2 the ProSGPV algorithm is remarkably able in this sense. 3,4 One natural approach to variable selection is best subset selection (BSS). BSS chooses k out of p total variables for each k∈ 1, 2,…, p f gthat maximize a chosen loss function. It can be thought of as an ℓ 0 -penalized regression. 5 showed that the BSS problem is nonconvex and NP-hard. With recent advancements, [6][7][8] solving the BSS routine with thousands of features is no longer infeasible. Particularly, an efficient R package called BeSS 8 is scalable to identify the best submodel in seconds or a few minutes when p is around 10,000. ℓ 1 -penalized likelihood procedures are also used for variable selection. Lasso, for example, produces models with a strong predictive ability. 9 However, lasso is not always variable selection consistent. 1,10 To address the inconsistency issue, adaptive lasso was proposed, which introduces weights in the ℓ 1 penalty. 11 Despite oracle variable selection properties of the adaptive lasso, it is often hard in practice to find a tuple of tuning parameters that achieve the properties. Both lasso and adaptive lasso can be implemented in the glmnet package. 12,13 Smoothly clipped absolute deviation (SCAD) 14 and minimax concave penalty with penalized linear unbiased selection (MC+) 15 were proposed to bridge the gap between the ℓ 0 and ℓ 1 penalties. The two algorithms are largely distinguished by their piecewise linear thresholding functions. Sure independence screening (SIS), 16,17 implemented in the SIS package, 18 ranks the maximum marginal likelihood estimates and can greatly reduce the dimensionality of feature space by keeping top variables in the ranking, even when p≫n. Iterative SIS (ISIS) can improve its performance in finite sample sizes. Note that all aforementioned algorithms shrink point estimates to derive a sparse solution and there is room for improvement of inference and prediction properties in finite sample sizes.
Many R packages have been developed to address certain data types. For example, clogitL1 19 performs variable selection with lasso and elastic net penalties in conditional logistic regression; pogit 20 performs Bayesian variable selection with spike and slab priors in Poisson and Logistic regressions; penPHcure 21 performs variable selection in Cox proportional hazards cure model with time-varying covariates. The ideal R package would have superior or comparable performance as the current algorithms, and work with each of continuous, binary, count, and time-to-event outcomes.
Recently, we 3,4 developed penalized regression with second generation p-values (ProSGPV) for variable selection in both low-dimensional (n > p) and high-dimensional (p > n) settings. Unlike traditional algorithms, ProSGPV incorporates estimation uncertainty via confidence intervals in the variable selection process. This addition often leads to better support recovery, parameter estimation, and prediction performance. This paper describes an R package named ProSGPV, which implements the ProSGPV algorithm. 29 Here, we extend the algorithm to work with data from logistic regression, Poisson regression, and Cox proportional hazards regression. We also provide visualization tools for variable selection process, which was not discussed in the original paper. 3,4 Simulation studies below compare the inference and prediction performance of ProSGPV against that of glmnet, BeSS, and ISIS, in scenarios not discussed in. 3,4 A realworld example compares the sparsity of solutions and prediction accuracy of all algorithms.

Methods
What is a second-generation p-value? Second-generation p-values (SGPVs) were proposed to address some of the well-known flaws in traditional p-values. 22,23 The basic idea is to replace the point null hypothesis with a pre-specified interval null. The SGPV is denoted by p δ , where δ represents the half-width of the interval null. The interval null represents the set of effect sizes that are scientifically indistinguishable from the point null hypothesis, due to limited precision or practicality.
SGPVs are essentially the fraction of data-supported hypotheses that are also null hypotheses. See Figure 1 for an illustration of how SGPVs work. Their formal definition is as follows: let θ be the parameter of interest, and let I ¼ θ l , θ u ½ be an interval estimate of θ whose length is given by jIj ¼ θ u À θ l . Here, I can be a confidence interval (we will use 95% CIs in this paper) or likelihood support interval or a credible interval. The coverage probability of the interval estimate I will largely drive the frequency properties of the SGPV upon which it is based.
Denote the length of the interval null by jH 0 j. Then the SGPV, p δ , is defined as where I∩H 0 is the intersection of two intervals. 22,23 Notice how the SGPV indicates when the data are compatible with null hypotheses (p δ ¼ 1), or with alternative hypotheses (p δ ¼ 0), or when the data are inconclusive (0 < p δ < 1).
Notice also that the correction term max jIj= 2jH 0 j ð Þ,1 f g resolves any problems that arise when the interval estimate I is too wide to be useful or reliable, in which case the data are effectively deemed inconclusive. It is in this way that SGPVs emphasize effects that are clinically meaningful over effects that are small and near the null hypothesis. Empirical studies have shown how SGPVs can be used to identify feature importance in high dimensional settings. 3,4,22,23 The null bound in the SGPVs is typically the smallest effect that would be clinically relevant or the effect magnitude that can be distinguished from noise, on average. 3,4 proposed using a generic null interval for regression coefficients that shrinks to zero and is based on the observed level of noise in the data. This extends the null bound in 22,23 that remains constant. The interval is easily obtained in the variable selection step and promotes good statistical properties in the selection algorithm.

The ProSGPV algorithm
The ProSGPV algorithm is a two-stage algorithm. In the first stage, the algorithm identifies a candidate set of variables using a broad regularization scheme. In the second stage, the algorithm applies SGPV regularization to the model based on the candidate set identified in the first stage. The successive regularization approach is easy and fast to implement, and quite accurate for screening out false features. The steps of the ProSGPV algorithm are shown in Algorithm 1.
Stage one: Find a candidate set 3: Fit a lasso and evaluate it at λ gic 4: Fit OLS/GLM/Cox models on the lasso active set 5: Stage two: SGPV screening 6: Extract the confidence intervals of all variables from the previous step 7: Calculate the mean coefficient standard error SE 8: Calculate the SGPV for each variable where I j ¼ b β j AE 1:96 Â SE j and H 0 ¼ ÀSE,SE Â Ã

9:
Keep variables with SGPV of zero 10: Refit the OLS/GLM/Cox with selected variables 11: end procedure By default, data are scaled and centered in linear regression but are not transformed as such in GLM and Cox regression. No notable difference is observed when data are standardized in GLM and Cox regression. In the first stage, lasso is used to reduce the feature space to a candidate set that is very likely to contain true signals. This pre-screening is crucial to high dimensional data (n < p) and improves the support recovery and parameter estimation in low dimensional data. 3 The lasso is evaluated at λ gic , but the algorithm is robust with respect to the choice of λ. 3 In the second stage, the null bound is set to be the mean standard error of all coefficient estimates. However, the algorithm is insensitive to any reasonable scale change in the null bound. 3,4 When data are highly correlated, a generalized variance inflation factor (GVIF) 24 adjusted null bound can be used to improve the inference and prediction performance. 4 In essence, ProSGPV is a hard thresholding algorithm that shrinks small effect to zero and reserve the large effect to obtain an unbiased estimate when the true support is successfully recovered. Notation-wise, the solution to the ProSGPV jS is a vector of length p with non-zero elements being the OLS/GLM/Cox coefficient estimates from the model with variables only in the set S. S is the final selection set and C is the candidate set from the first-stage screening. Note that the cutoff λ k ¼ 1:96 * SE k þ SE.
When λ gic in Algorithm 1 is replaced with zero in the first stage, ProSGPV reduces to a one-stage algorithm. That amounts to calculating SGPVs for each variable in the full model and select variables whose effects are above the threshold. However, the support recovery and parameter estimation performance of the one-stage algorithm is slightly worse than that of the two-stage algorithm. 3 Moreover, the one-stage algorithm is not applicable when p > n, i.e. when the full OLS/GLM/Cox model is not identifiable.

Implementation
The ProSGPV package is publicly available from the Comprehensive R Archive Network (CRAN), a development version is available on GitHub, and is archived with Zenodo. 29 To install from CRAN, please run install.packages("ProSGPV") To install a development version, please run library (devtools) devtools::install_github("zuoyi93/ProSGPV") The main function pro.sgpv implements the default two-stage and optional one-stage ProSGPV algorithm. User-friendly print, coef, summary, predict, and plot functions are equipped with pro.sgpv for both one-and two-stage algorithms. Jeffreys prior penalized logistic regression 25 is used when the outcome is binary to stabilize coefficient estimates in the case of complete/quasi-complete separation. In the next section, we demonstrate how pro.sgpv works with simulated continuous outcome data.

Operation
The ProSGPV package works across different platforms (Windows, mac OS, and Linux). The R version number should be greater than 3.5.0. Once installed, the workflow is described as below. We first present an example by applying the ProSGPV algorithm to a simulated dataset by use of gen.sim.data function. With sample size n = 100, number of variables p = 20, number of true signals s = 4, smallest effect size β min = 1, largest effect size β max = 5, autoregressive correlation ρ = 0.2 and variance σ 2 = 1, signal-to-noise ratio (SNR) defined in 26 ν = 2, we generate outcomes Y following Gaussian distribution. gen.sim.data outputs X, Y, indices of true signals, and a vector of true coefficients.
> sgpv.out.2 <-pro.sgpv(x,y) > sgpv.out.2 Selected variables are V2 V4 V5 V7 The variable selection process can be visualized by using the plot function. Figure 2 shows the fully relaxed lasso path along a range of λ s . The shaded area is the null zone and any effect whose 95% confidence interval overlaps with the null region will be considered irrelevant or insignificant. ProSGPV is evaluated at λ gic . The lpv argument can be used to choose to display one line per variable (the confidence bound that is closer to the null region) or three lines per variable (an point estimate and confidence bounds, default). The lambda.max argument control the limit of the x-axis in the plot. summary function outputs the regression summary of the selected model. When the outcome is continuous, an OLS is used. predict function can be used to predict outcomes using the selected model. In-sample prediction can be made by calling predict (sgpv.out.2) and out-of-sample prediction can be made by feeding new data set into the newdata argument.
In addition to the two-stage algorithm, one-stage algorithm can also be used to select variables when n > p. The computation time is shorter for the one-stage algorithm at the expense of slightly reduced support recovery rate in the limit, as shown in. 3 Figure 3 shows the variable selection result of the one-stage algorithm on the same data. The one-stage algorithm missed V4 and only selected three variables. The lower confidence bound of the estimate for V4 barely excludes the null region, and was dropped from the final model because of that. The one-stage algorithm illustrates how estimation uncertainty via confidence intervals (horizontal segments) can be incorporated in variable selection. Examples of binary, count, and time-to-event data can be found in the package vignettes.

Simulation
Simulation design is adapted from 3,4 and we will present below scenarios not discussed in those two papers. The setup can be found in the Table 1 below. The scale and shape parameters are used to generate time-to-event from a Weibull distribution. ProSGPV is compared against lasso, BeSS, and ISIS. Results are aggregated over 1000 simulations. Evaluation metrics include support recovery rate, parameter estimation mean absolute error (MAE), prediction root mean square error (RMSE) and area under the curve in a separate test set. Support recovery is defined as capturing the exact true support, not containing. An estimate of MAE is 1=p P p j¼1 ∥ b β j À β 0,j ∥, where β 0,j is the jth true coefficient. Simulation results are summarized in Figure 4.
From Figure 4, we observe similar results as in. 3,4 ProSGPV often has the highest capture rates of the exact true model, has lowest parameter estimation bias, has one of the lowest prediction error, and is the fastest to compute. Note that GVIF-adjusted null bound is used in the Logistic regression because of the high correlation in the design matrix.

Real-world data example
In this section, we compare the performance of ProSGPV with lasso, BeSS, and ISIS algorithms using a real-world financial data. 28 The close price of Dow Jones Industrial Average (DJIA) was documented from Jan 1, 2010 to November 15, 2017 and eight groups of primitive, technical indicators, big U.S. companies, commodities, exchange rate of currencies, future contracts and worlds stock indices, and other sources of information 27 were collected to predict the DJIA close price. In the analyzed data with complete records, there are 1114 observations and 82 predictors. We randomly sampled 614 observations as a fixed test set to evaluate the prediction performance of models built on the training set. We allowed the training set size n to vary from 40 to 300. At each n, we recorded the distribution of the training model size for each algorithm as well as the distribution of the prediction RMSE over 1000 repetitions. Results are summarized in Figure 5. ProSGPV and lasso had sparser models than BeSS and ISIS did, while ProSGPV had much better prediction performance in the test set. BeSS and ISIS had better prediction performance at the cost of including much more variables in the final model. The tradeoff between the sparsity of solutions and prediction accuracy is well illustrated in this example. Variables frequently selected by ProSGPV include 5-, 10-, and 15-day rate of change, and 10-day exponential moving average of (DJIA). Technical indicators seem more predictive than other world indices, commodity, exchange rate, futures, etc.

Conclusions
We introduced an R package to implement the ProSGPV algorithm, a variable selection algorithm that incorporates estimation uncertainty via confidence intervals. This novel addition often leads to better support recovery, parameter estimation, and prediction performance. The package is user-friendly, has nice visualization tools for the variable selection process, facilitates subsequent inference and prediction, and very fast computationally. The efficacy of our package is demonstrated on both simulation and real-world data sets.

Data availability Source data
The real-world data are from a Kaggle data challenge, which is available at https://www.kaggle.com/ehoseinz/cnnpredstock-market-prediction. Detailed description of data elements can be found in. 27 Underlying data Zenodo: zuoyi93/r-code-prosgpv-r: v1.0.0. https://doi.org/10.5281/zenodo.5655772. 28 • Processed_DJI.csv (real-world data) Data are available under the terms of the Creative Commons Zero "No rights reserved" data waiver (CC0 1.0 Public domain dedication).  The paper describes an R-package with which a recently proposed method of variable selection, ProSGPV, can be applied to data sets. I am not fully convinced of the method. It would need a neutral comparison study (one not conducted by the authors who invented the method) to prove if it's useful. But to conduct such a study, an efficient implementation of the method is needed, which is introduced in this report. So the paper has some relevance. Introduction: 'classical statistical methods exploiting the full feature space no longer work': but any variable selection algorithm would have to exploit the full feature space when searching for predictive variables? One may be interested in getting a sparse solution because one assumes that only a small subset of features is really predictive, or if a smaller prediction model has other advantages, e.g. costs or practicability. But there are many modeling techniques that do not sparsify the feature space which works very well with high-dimensional data such as ridge regression. So more convincing arguments for the need for data-driven variable selection should be given. Figure 1 may need some more explanations in the caption. At step 9 of the algorithm, only variables with a SGPV of zero are kept. What if two variables have a high correlation and both of their SGPV are greater than zero, but omitting them both from the model would reduce the information significantly. Isn't this a drawback of this procedure? Why not sequentially (one-by-one) eliminating variables and refitting the model? In the definition of the null interval, the authors speak of 'minimum clinical relevance' (so having medical applications in mind), but the example is from the financial world. Could the authors give some idea how to define the null interval for other areas of application? It seems that the SGPV depends on the confidence levels because with lower confidence levels, the intervals are shorter and the probability to get SGPV of 0 gets larger. The empirical null interval definition seems a bit arbitrary (mean of standardized standard errors).
Why not the root of the mean of standardized variances? Why take the mean at all? Does this imply any assumptions on an equal a-priori importance of the variables? What about correlated variables?
In the simulation study, the authors confounded the type of regression with varying n or p. This simulation can only be seen as proof that the package works, but I am not so convinced yet that it also proves the advantages of the new method. So the results still have to be interpreted with caution. We don't know if the simulation scenarios were pre-specified or selected after the fact, and if the simulated joint covariates-outcome distributions are representative of any real data sets one encounter in practice.
Is the rationale for developing the new software tool clearly explained? Yes

Is the description of the software tool technically sound? Yes
Are sufficient details of the code, methods and analysis (if applicable) provided to allow replication of the software development and its use by others? Partly

Is sufficient information provided to allow interpretation of the expected output datasets and any results generated using the tool? Partly
Are the conclusions about the tool and its performance adequately supported by the findings presented in the article? Partly The benefits of publishing with F1000Research: Your article is published within days, with no editorial bias • You can publish traditional articles, null/negative results, case reports, data notes and more • The peer review process is transparent and collaborative • Your article is indexed in PubMed after passing peer review • Dedicated customer support at every stage • For pre-submission enquiries, contact research@f1000.com