gendata.Rd
Function that generates data of the different simulation studies
presented in the accompanying paper. This function requires the
truncnorm
package to be installed.
gendata(n, p, corr, E = truncnorm::rtruncnorm(n, a = -1, b = 1), betaE, SNR, parameterIndex)
n | number of observations |
---|---|
p | number of main effect variables (X) |
corr | correlation between predictors |
E | simulated environment vector of length |
betaE | exposure effect size |
SNR | signal to noise ratio |
parameterIndex | simulation scenario index. See details for more information. |
A list with the following elements:
matrix of
dimension nxp
of simulated main effects
simulated response
vector of length n
simulated exposure vector of length
n
linear predictor vector of length n
the function f1
evaluated at x_1
(f1(X1)
)
the function f1
evaluated at x_1
(f1(X1)
)
the function f1
evaluated at x_1
(f1(X1)
)
the function f1
evaluated at x_1
(f1(X1)
)
the value for βE
the function
f1
the function f2
the function
f3
the function f4
an n
length
vector of the first predictor
an n
length vector of the
second predictor
an n
length vector of the third
predictor
an n
length vector of the fourth predictor
a character representing the simulation scenario identifier as described in Bhatnagar et al. (2018+)
character vector of causal variable names
character vector of noise variables
We evaluate the performance of our method on three of its defining characteristics: 1) the strong heredity property, 2) non-linearity of predictor effects and 3) interactions.
Truth obeys
weak hierarchy (parameterIndex = 2
) Y∗=f1(X1)+f2(X2)+βE∗XE+XE∗f3(X3)+XE∗f4(X4)
Truth only has interactions (parameterIndex = 3
)Y∗=XE∗f3(X3)+XE∗f4(X4)
Truth is
linear (parameterIndex = 4
) Y∗=4∑j=1βjXj+βE∗XE+XE∗X3+XE∗X4
Truth only has main effects (parameterIndex = 5
)
Y∗=4∑j=1fj(Xj)+βE∗XE
.
The functions are from the paper by Lin and Zhang (2006):
f2 <- function(t) 3 * (2 * t - 1)^2
f3 <- function(t) 4 * sin(2 * pi * t) / (2 - sin(2 * pi * t))
f4 <- function(t) 6 * (0.1 * sin(2 * pi * t) + 0.2 * cos(2 * pi * t) + 0.3 * sin(2 * pi * t)^2 + 0.4 * cos(2 * pi * t)^3 + 0.5 * sin(2 * pi * t)^3)
The response is generated as Y=Y∗+k∗error where Y* is the linear predictor, the error term is generated from a standard normal distribution, and k is chosen such that the signal-to-noise ratio is SNR = Var(Y*)/Var(error), i.e., the variance of the response variable Y due to error is 1/SNR of the variance of Y due to Y*
The covariates are simulated as follows as described in Huang et al.
(2010). First, we generate w1,…,wp,u,v independently from
Normal(0,1) truncated to the interval [0,1]
for
i=1,…,n. Then we set xj=(wj+t∗u)/(1+t) for j=1,…,4 and xj=(wj+t∗v)/(1+t) for j=5,…,p, where the parameter t controls the amount of correlation among
predictors. This leads to a compound symmetry correlation structure where
Corr(xj,xk)=t2/(1+t2), for 1≤j≤4,1≤k≤4,
and Corr(xj,xk)=t2/(1+t2), for 5≤j≤p,5≤k≤p, but the covariates of the nonzero and zero components are independent.
Lin, Y., & Zhang, H. H. (2006). Component selection and smoothing in multivariate nonparametric regression. The Annals of Statistics, 34(5), 2272-2297.
Huang J, Horowitz JL, Wei F. Variable selection in nonparametric additive models (2010). Annals of statistics. Aug 1;38(4):2282.
Bhatnagar SR, Yang Y, Greenwood CMT. Sparse additive interaction models with the strong heredity property (2018+). Preprint.
DT <- gendata(n = 75, p = 100, corr = 0, betaE = 2, SNR = 1, parameterIndex = 1)