In this short post, I show how to calculate polygenic risk scores (PRS) using the data.table
package in R
. I will show an example on a small dataset, but can be easily extended to much larger datasets. The PRS based on $p$
SNPs is given by:
$$ PRS_i = \sum_{j=1}^{p}\beta_j \times SNP_{ij} $$
where $\beta_j$
is the beta coefficient for the $j^{th}$
SNP, and $SNP_{ij}$
is the value of $j^{th}$
SNP for the $i^{th}$
individual.
First we load the data.table
package:
pacman::p_load(data.table)
Next we create some sample data, where the rows are SNPs and the columns are individuals. This data.table
also contains a column with the beta coefficients for each SNP.
# sample data, rows are SNPs, columns are individuals
DT <- data.table(ID1 = sample(0:2, 10, replace = T),
ID2 = sample(0:2, 10, replace = T),
ID3 = sample(0:2, 10, replace = T),
beta = rnorm(10))
DT
## ID1 ID2 ID3 beta
## 1: 2 0 1 2.35674416
## 2: 2 0 1 0.22178574
## 3: 2 0 2 -0.15026475
## 4: 0 1 0 -0.97745574
## 5: 0 0 1 -0.39511441
## 6: 0 0 1 1.10683879
## 7: 2 2 2 0.42516701
## 8: 2 2 2 -0.93244438
## 9: 0 0 1 0.09155816
## 10: 0 2 1 -1.42107331
In situations where there are many individuals, this can be a tedious calculation because standard methods would require to type out the name of all of the columns to be multiplied by. Instead, using the .SDcols
argument, greatly simplifies this process by allowing columns to be called on dynamically.
We first create a character vector of the column names that we want to multiply beta by as well as the new column names that we want to store the results in:
# the names of the columns to be multiplied by beta
id_names <- paste0("ID",1:3)
# the names of the new columns that have been multiplied by beta
new_id_names <- paste0("new_",id_names)
Now with a simple command we get the desired multiplication:
DT[, (new_id_names) := .SD * beta, .SDcols = id_names]
DT
## ID1 ID2 ID3 beta new_ID1 new_ID2 new_ID3
## 1: 2 0 1 2.35674416 4.7134883 0.0000000 2.35674416
## 2: 2 0 1 0.22178574 0.4435715 0.0000000 0.22178574
## 3: 2 0 2 -0.15026475 -0.3005295 0.0000000 -0.30052949
## 4: 0 1 0 -0.97745574 0.0000000 -0.9774557 0.00000000
## 5: 0 0 1 -0.39511441 0.0000000 0.0000000 -0.39511441
## 6: 0 0 1 1.10683879 0.0000000 0.0000000 1.10683879
## 7: 2 2 2 0.42516701 0.8503340 0.8503340 0.85033403
## 8: 2 2 2 -0.93244438 -1.8648888 -1.8648888 -1.86488876
## 9: 0 0 1 0.09155816 0.0000000 0.0000000 0.09155816
## 10: 0 2 1 -1.42107331 0.0000000 -2.8421466 -1.42107331
The PRS for each individual can be calculated using the colSums
function in base R
:
DT[, colSums(.SD), .SDcols = new_id_names]
## new_ID1 new_ID2 new_ID3
## 3.8419756 -4.8341571 0.6456549