In this post I discuss how to create custom contrasts for factor variables in R. First lets create some simulated data. Create the data, and factor Disease status:

We want the following contrasts:

• Control versus all 4 diseases combined
• RA versus the combination of (SLE, Scleroderma, Myositis), leaving out the Controls

Default settings

Let $x_1,x_2,x_3,x_4$ be the indicators for Myositis, RA, Scleroderma and SLE, respectively. The standard linear model R will fit is given by (for simplicity I am ignoring age and sex, but it won’t make a difference when you add them in the model):

This is the default contrast matrix with unordered factor variables:

Myositis RA Scleroderma SLE
Control 0 0 0 0
Myositis 1 0 0 0
RA 0 1 0 0
Scleroderma 0 0 1 0
SLE 0 0 0 1

This compares the mean of the response for the Controls to the mean of the response for Myositis, RA, Scleroderma, and SLE separately. The table can be read by column, and the numbers in the columns represent the weight of the regression coefficient, e.g. in the first column Myositis is being compare to Control.

Custom Contrats

Since we want only two contrasts, we want R to fit the following model:

where $\beta_1$ represents the contrast estimate for the comparison between controls and all other diseases, and $\beta_2$ represents the contrast estimate of RA versus the combination of SLE, Scleroderma, Myositis.

To create custom contrasts, we must specify the contrast matrix as follows:

Control_vs_All RA_vs_Myos_Scle_SLE
Control 0.8 0.0000000
Myositis -0.2 -0.3333333
RA -0.2 1.0000000
Scleroderma -0.2 -0.3333333
SLE -0.2 -0.3333333

Again we look at the above table, column by column. The variables we want to contrast should have opposite signs and the columns should sum to 0. This contrast matrix leads to the following mean response equations for each of the groups:

To solve for $\beta_0$ we can add up all the equations to get

To solve for $\beta_1$ we substract $\mu_{control}$ from the combined mean of $\mu_{myos},\mu_{ra},\mu_{scler}$ and $\mu_{sle}$ which gives:

To solve for $\beta_2$ we substract $\mu_{ra}$ from the combined mean of $\mu_{myos},\mu_{scler}$ and $\mu_{sle}$ which gives:

First we create the contrast matrix with appropriate row and column names for clarity:

Then we store the contrasts attribute to the Disease variable. The how.many argument specifies how many contrasts we want, therefore this should correspond to the number of columns in the contrast matrix.

Here we check to make sure that the lm fit is giving the same result as the formulas derived above: