Function that generates data of the different simulation studies presented in the accompanying paper. This function requires the truncnorm package to be installed.

gendata(n, p, corr, E = truncnorm::rtruncnorm(n, a = -1, b = 1), betaE,
  SNR, parameterIndex)

Arguments

n

number of observations

p

number of main effect variables (X)

corr

correlation between predictors

E

simulated environment vector of length n. Can be continuous or integer valued. Factors must be converted to numeric. Default: truncnorm::rtruncnorm(n, a = -1, b = 1)

betaE

exposure effect size

SNR

signal to noise ratio

parameterIndex

simulation scenario index. See details for more information.

Value

A list with the following elements:

x

matrix of dimension nxp of simulated main effects

y

simulated response vector of length n

e

simulated exposure vector of length n

Y.star

linear predictor vector of length n

f1

the function f1 evaluated at x_1 (f1(X1))

f2

the function f1 evaluated at x_1 (f1(X1))

f3

the function f1 evaluated at x_1 (f1(X1))

f4

the function f1 evaluated at x_1 (f1(X1))

betaE

the value for \(\beta_E\)

f1.f

the function f1

f2.f

the function f2

f3.f

the function f3

f4.f

the function f4

X1

an n length vector of the first predictor

X2

an n length vector of the second predictor

X3

an n length vector of the third predictor

X4

an n length vector of the fourth predictor

scenario

a character representing the simulation scenario identifier as described in Bhatnagar et al. (2018+)

causal

character vector of causal variable names

not_causal

character vector of noise variables

Details

We evaluate the performance of our method on three of its defining characteristics: 1) the strong heredity property, 2) non-linearity of predictor effects and 3) interactions.

Heredity Property

Truth obeys weak hierarchy (parameterIndex = 2) $$Y* = f_1(X_{1}) + f_2(X_{2}) + \beta_E * X_{E} + X_{E} * f_3(X_{3}) + X_{E} * f_4(X_{4}) $$

Truth only has interactions (parameterIndex = 3)$$Y* = X_{E} * f_3(X_{3}) + X_{E} * f_4(X_{4}) $$

Non-linearity

Truth is linear (parameterIndex = 4) $$Y* = \sum_{j=1}^{4}\beta_j X_{j} + \beta_E * X_{E} + X_{E} * X_{3} + X_{E} * X_{4} $$

Interactions

Truth only has main effects (parameterIndex = 5) $$Y* = \sum_{j=1}^{4} f_j(X_{j}) + \beta_E * X_{E} $$

.

The functions are from the paper by Lin and Zhang (2006):

f2

f2 <- function(t) 3 * (2 * t - 1)^2

f3

f3 <- function(t) 4 * sin(2 * pi * t) / (2 - sin(2 * pi * t))

f4

f4 <- function(t) 6 * (0.1 * sin(2 * pi * t) + 0.2 * cos(2 * pi * t) + 0.3 * sin(2 * pi * t)^2 + 0.4 * cos(2 * pi * t)^3 + 0.5 * sin(2 * pi * t)^3)

The response is generated as $$Y = Y* + k*error$$ where Y* is the linear predictor, the error term is generated from a standard normal distribution, and k is chosen such that the signal-to-noise ratio is SNR = Var(Y*)/Var(error), i.e., the variance of the response variable Y due to error is 1/SNR of the variance of Y due to Y*

The covariates are simulated as follows as described in Huang et al. (2010). First, we generate \(w1,\ldots, wp, u,v\) independently from \(Normal(0,1)\) truncated to the interval [0,1] for \(i=1,\ldots,n\). Then we set \(x_j = (w_j + t*u)/(1 + t)\) for \(j = 1,\ldots, 4\) and \(x_j = (w_j + t*v)/(1 + t)\) for \(j = 5,\ldots, p\), where the parameter \(t\) controls the amount of correlation among predictors. This leads to a compound symmetry correlation structure where \(Corr(x_j,x_k) = t^2/(1+t^2)\), for \(1 \le j \le 4, 1 \le k \le 4\), and \(Corr(x_j,x_k) = t^2/(1+t^2)\), for \(5 \le j \le p, 5 \le k \le p\), but the covariates of the nonzero and zero components are independent.

References

Lin, Y., & Zhang, H. H. (2006). Component selection and smoothing in multivariate nonparametric regression. The Annals of Statistics, 34(5), 2272-2297.

Huang J, Horowitz JL, Wei F. Variable selection in nonparametric additive models (2010). Annals of statistics. Aug 1;38(4):2282.

Bhatnagar SR, Yang Y, Greenwood CMT. Sparse additive interaction models with the strong heredity property (2018+). Preprint.

Examples

DT <- gendata(n = 75, p = 100, corr = 0, betaE = 2, SNR = 1, parameterIndex = 1)