Cross-validation for sail

Does k-fold cross-validation for sail and determines the optimal tuning parameter \(\lambda\).

cv.sail(x, y, e, ..., weights, lambda = NULL, type.measure = c("mse",
  "deviance", "class", "auc", "mae"), nfolds = 10, foldid,
  grouped = TRUE, keep = FALSE, parallel = FALSE)

Arguments

x	input matrix of dimension `n x p`, where `n` is the number of subjects and p is number of X variables. Each row is an observation vector. Can be a high-dimensional (n < p) matrix. Can be a user defined design matrix of main effects only (without intercept) if `expand=FALSE`
y	response variable. For `family="gaussian"` should be a 1 column matrix or numeric vector. For `family="binomial"`, should be a 1 column matrix or numeric vector with -1 for failure and 1 for success.
e	exposure or environment vector. Must be a numeric vector. Factors must be converted to numeric.
...	other arguments that can be passed to `sail`
weights	observation weights. Default is 1 for each observation. Currently NOT IMPLEMENTED.
lambda	Optional user-supplied lambda sequence; default is NULL, and `sail` chooses its own sequence
type.measure	loss to use for cross-validation. Currently only 3 options are implemented. The default is `type.measure="deviance"`, which uses squared-error for gaussian models (and is equivalent to `type.measure="mse"`) there). `type.measure="mae"` (mean absolute error) can also be used which measures the absolute deviation from the fitted mean to the response (\(\|y-\hat{y}\|\)).
nfolds	number of folds. Although `nfolds` can be as large as the sample size (leave-one-out CV), it is not recommended for large datasets. Smallest value allowable is `nfolds=3`. Default: 10
foldid	an optional vector of values between 1 and `nfold` identifying what fold each observation is in. If supplied,`nfold` can be missing. Often used when wanting to tune the second tuning parameter (\(\alpha\)) as well (see details).
grouped	This is an experimental argument, with default `TRUE`, and can be ignored by most users. This refers to computing `nfolds` separate statistics, and then using their mean and estimated standard error to describe the CV curve. If `grouped=FALSE`, an error matrix is built up at the observation level from the predictions from the `nfold` fits, and then summarized (does not apply to `type.measure="auc"`). Default: TRUE.
keep	If `keep=TRUE`, a prevalidated array is returned containing fitted values for each observation and each value of `lambda`. This means these fits are computed with this observation and the rest of its fold omitted. The `folid` vector is also returned. Default: FALSE
parallel	If `TRUE`, use parallel `foreach` to fit each fold. Must register parallel before hand using the `registerDoParallel` function from the `doParallel` package. See the example below for details. Default: FALSE

Value

an object of class "cv.sail" is returned, which is a list with the ingredients of the cross-validation fit.

lambda: the values of converged lambda used in the fits.
cvm: The mean cross-validated error - a vector of length length(lambda).
cvsd: estimate of standard error of cvm.
cvup: upper curve = cvm+cvsd.
cvlo: lower curve = cvm-cvsd.
nzero: number of non-zero coefficients at each lambda. This is the sum of the total non-zero main effects and interactions. Note that when expand=TRUE, we only count a variable once in the calculation of nzero, i.e., if a variable is expanded to three columns, then this is only counted once even though all three coefficients are estimated to be non-zero
name: a text string indicating type of measure (for plotting purposes).
sail.fit: a fitted sail object for the full data.
lambda.min: value of lambda that gives minimum cvm.
lambda.1se: largest value of lambda such that error is within 1 standard error of the minimum.
fit.preval: if keep=TRUE, this is the array of prevalidated fits. Some entries can be NA, if that and subsequent values of lambda are not reached for that fold
foldid: if keep=TRUE, the fold assignments used

Details

The function runs sail nfolds+1 times; the first to get the lambda sequence, and then the remainder to compute the fit with each of the folds omitted. Note that a new lambda sequence is computed for each of the folds and then we use the predict method to get the solution path at each value of the original lambda sequence. The error is accumulated, and the average error and standard deviation over the folds is computed. Note that cv.sail does NOT search for values for alpha. A specific value should be supplied, else alpha=0.5 is assumed by default. If users would like to cross-validate alpha as well, they should call cv.sail with a pre-computed vector foldid, and then use this same fold vector in separate calls to cv.sail with different values of alpha. Note also that the results of cv.sail are random, since the folds are selected at random. Users can reduce this randomness by running cv.sail many times, and averaging the error curves.

Note

The skeleton of this function and the documentation were taken straight from the glmnet package. See references for details.

References

Jerome Friedman, Trevor Hastie, Robert Tibshirani (2010). Regularization Paths for Generalized Linear Models via Coordinate Descent. Journal of Statistical Software, 33(1), 1-22. http://www.jstatsoft.org/v33/i01/.

Bhatnagar SR, Yang Y, Greenwood CMT. Sparse additive interaction models with the strong heredity property (2018+). Preprint.

Examples

if (FALSE) {
if(interactive()){
f.basis <- function(i) splines::bs(i, degree = 5)
data("sailsim")
cvfit <- cv.sail(x = sailsim$x, y = sailsim$y, e = sailsim$e,
                 basis = f.basis, nfolds = 10)

# Parallel
library(doParallel)
registerDoParallel(cores = 4)
cvfit <- cv.sail(x = sailsim$x, y = sailsim$y, e = sailsim$e,
                 parallel = TRUE, nlambda = 100, nfolds = 10)
# plot cross validated curve
plot(cvfit)
# plot solution path
plot(cvfit$sail.fit)

# solution at lambda.min
coef(cvfit, s = "lambda.min")
# solution at lambda.1se
coef(cvfit, s = "lambda.1se")
# non-zero coefficients at lambda.min
predict(cvfit, s = "lambda.min", type = "nonzero")

# predicted response
predict(cvfit, s = "lambda.min")
predict(cvfit, s = "lambda.1se")
# predict response at any value for lambda
predict(cvfit, s = 0.457)

# predict response for new data set
newx <- sailsim$x * 1.10
newe <- sailsim$e * 2
predict(cvfit, newx = newx, newe = newe, s = "lambda.min")
 }
}