R/utils.R
gen_structured_model.Rd
Function that generates data of the different simulation studies
presented in the accompanying paper. This function requires the
popkin
and bnpsd
package to be installed.
gen_structured_model( n, p_design, p_kinship, k, s, Fst, b0, nPC = 10, eta, sigma2, geography = c("ind", "1d", "circ"), percent_causal, percent_overlap, train_tune_test = c(0.6, 0.2, 0.2) )
n | number of observations to simulate |
---|---|
p_design | number of variables in X_test, i.e., the design matrix |
p_kinship | number of variable in X_kinship, i.e., matrix used to calculate kinship |
k | number of intermediate subpopulations. |
s | the desired bias coefficient, which specifies sigma indirectly. Required if sigma is missing |
Fst | The desired final FST of the admixed individuals. Required if sigma is missing |
b0 | the true intercept parameter |
nPC | number of principal components to include in the design matrix used for regression adjustment for population structure via principal components. This matrix is used as the input in a standard lasso regression routine, where there are no random effects. |
eta | the true eta parameter, which has to be |
sigma2 | the true sigma2 parameter |
geography | the type of geography for simulation the kinship matrix.
"ind" is independent populations where every individuals is actually
unadmixed, "1d" is a 1D geography and "circ" is circular geography.
Default: "ind". See the functions in the |
percent_causal | percentage of |
percent_overlap | this represents the percentage of causal SNPs that will also be included in the calculation of the kinship matrix |
train_tune_test | the proportion of sample size used for training tuning parameter selection and testing. default is 60/20/20 split |
A list with the following elements
simulated response vector for tuning parameter selection set
simulated response vector for test set
simulated design matrix for training set
simulated design matrix for tuning parameter selection set
simulated design matrix for testing set
simulated design matrix for training set for lasso model. This is the same as xtrain, but also includes the nPC principal components
simulated design matrix for tuning parameter selection set for lasso model. This is the same as xtune, but also includes the nPC principal components
simulated design matrix for testing set for lasso model. This is the same as xtest, but also includes the nPC principal components
character vector of the names of the causal SNPs
the vector of true regression coefficients
2 times the estimated kinship for the training set individuals
The covariance matrix between the tuning set and the training set individuals
The covariance matrix between the test set and training set individuals
the matrix of SNPs used to estimate the kinship matrix
character vector of the non-causal SNPs
the principal components for population structure adjustment
The kinship is estimated using the popkin
function from the
popkin
package. This function will multiple that kinship matrix by 2
to give the expected covariance matrix which is subsequently used in the
linear mixed models
admixed <- gen_structured_model(n = 100, p_design = 50, p_kinship = 5e2, geography = "1d", percent_causal = 0.10, percent_overlap = "100", k = 5, s = 0.5, Fst = 0.1, b0 = 0, nPC = 10, eta = 0.1, sigma2 = 1, train_tune_test = c(0.8, 0.1, 0.1))#>names(admixed)#> [1] "ytrain" "ytune" "ytest" "xtrain" #> [5] "xtune" "xtest" "xtrain_lasso" "xtune_lasso" #> [9] "xtest_lasso" "Xkinship" "kin_train" "kin_tune_train" #> [13] "kin_test_train" "mu_train" "causal" "beta" #> [17] "not_causal" "kinship" "coancestry" "PC" #> [21] "subpops"