gendata.Rd
Function that generates data of the different simulation studies
presented in the accompanying paper. This function requires the
truncnorm
package to be installed.
gendata(n, p, corr, E = truncnorm::rtruncnorm(n, a = -1, b = 1), betaE, SNR, parameterIndex)
n | number of observations |
---|---|
p | number of main effect variables (X) |
corr | correlation between predictors |
E | simulated environment vector of length |
betaE | exposure effect size |
SNR | signal to noise ratio |
parameterIndex | simulation scenario index. See details for more information. |
A list with the following elements:
matrix of
dimension nxp
of simulated main effects
simulated response
vector of length n
simulated exposure vector of length
n
linear predictor vector of length n
the function f1
evaluated at x_1
(f1(X1)
)
the function f1
evaluated at x_1
(f1(X1)
)
the function f1
evaluated at x_1
(f1(X1)
)
the function f1
evaluated at x_1
(f1(X1)
)
the value for \(\beta_E\)
the function
f1
the function f2
the function
f3
the function f4
an n
length
vector of the first predictor
an n
length vector of the
second predictor
an n
length vector of the third
predictor
an n
length vector of the fourth predictor
a character representing the simulation scenario identifier as described in Bhatnagar et al. (2018+)
character vector of causal variable names
character vector of noise variables
We evaluate the performance of our method on three of its defining characteristics: 1) the strong heredity property, 2) non-linearity of predictor effects and 3) interactions.
Truth obeys
weak hierarchy (parameterIndex = 2
) $$Y* = f_1(X_{1}) +
f_2(X_{2}) + \beta_E * X_{E} + X_{E} * f_3(X_{3}) + X_{E} * f_4(X_{4}) $$
Truth only has interactions (parameterIndex = 3
)$$Y* =
X_{E} * f_3(X_{3}) + X_{E} * f_4(X_{4}) $$
Truth is
linear (parameterIndex = 4
) $$Y* = \sum_{j=1}^{4}\beta_j X_{j} +
\beta_E * X_{E} + X_{E} * X_{3} + X_{E} * X_{4} $$
Truth only has main effects (parameterIndex = 5
)
$$Y* = \sum_{j=1}^{4} f_j(X_{j}) + \beta_E * X_{E} $$
.
The functions are from the paper by Lin and Zhang (2006):
f2 <- function(t) 3 * (2 * t - 1)^2
f3 <- function(t) 4 * sin(2 * pi * t) / (2 - sin(2 * pi * t))
f4 <- function(t) 6 * (0.1 * sin(2 * pi * t) + 0.2 * cos(2 * pi * t) + 0.3 * sin(2 * pi * t)^2 + 0.4 * cos(2 * pi * t)^3 + 0.5 * sin(2 * pi * t)^3)
The response is generated as $$Y = Y* + k*error$$ where Y* is the linear predictor, the error term is generated from a standard normal distribution, and k is chosen such that the signal-to-noise ratio is SNR = Var(Y*)/Var(error), i.e., the variance of the response variable Y due to error is 1/SNR of the variance of Y due to Y*
The covariates are simulated as follows as described in Huang et al.
(2010). First, we generate \(w1,\ldots, wp, u,v\) independently from
\(Normal(0,1)\) truncated to the interval [0,1]
for
\(i=1,\ldots,n\). Then we set \(x_j = (w_j + t*u)/(1 + t)\) for \(j
= 1,\ldots, 4\) and \(x_j = (w_j + t*v)/(1 + t)\) for \(j = 5,\ldots,
p\), where the parameter \(t\) controls the amount of correlation among
predictors. This leads to a compound symmetry correlation structure where
\(Corr(x_j,x_k) = t^2/(1+t^2)\), for \(1 \le j \le 4, 1 \le k \le 4\),
and \(Corr(x_j,x_k) = t^2/(1+t^2)\), for \(5 \le j \le p, 5 \le k \le
p\), but the covariates of the nonzero and zero components are independent.
Lin, Y., & Zhang, H. H. (2006). Component selection and smoothing in multivariate nonparametric regression. The Annals of Statistics, 34(5), 2272-2297.
Huang J, Horowitz JL, Wei F. Variable selection in nonparametric additive models (2010). Annals of statistics. Aug 1;38(4):2282.
Bhatnagar SR, Yang Y, Greenwood CMT. Sparse additive interaction models with the strong heredity property (2018+). Preprint.
DT <- gendata(n = 75, p = 100, corr = 0, betaE = 2, SNR = 1, parameterIndex = 1)