# Introduction

## casebase: An Alternative Framework for Survival Analysis¶

This vignette introduces the main functions in the `casebase`

package.
The methods implemented in this package are based on the method
developped in Fitting Smooth-in-Time Prognostic Risk Functions via
Logistic Regression (Hanley and Miettinen,
2009).
A rigorous treatment of the theory is developed in A case-base sampling
method for estimating recurrent event intensities (Saarela,
2015)
and Non-parametric Bayesian Hazard Regression for Chronic Disease Risk
Assessment.
The motivation for this work is nicely summarised by Cox:

### Why another package for survival analysis?¶

The purpose of the `casebase`

package is to provide practitioners with
an easy-to-use software tool to predict the risk (or cumulative
incidence (CI)) of an event, for a particular patient. The following
points should be noted:

- Time matching/risk set sampling (including Cox partial likelihood) eliminates the baseline hazard from the likelihood expression for the hazard ratios
- If, however, the absolute risks are of interest, they have to be recovered using the semi-parametric Breslow estimator
- Alternative approaches for fitting flexible hazard models for estimating absolute risks, not requiring this two-step approach? Yes! Hanley and Miettinen, 2009

Hanley and Miettinen, 2009 propose a fully parametric hazard model that can be fit via logistic regression. From the fitted hazard function, cumulative incidence and, thus, risk functions of time, treatment and profile can be easily derived.

### Parametric family of hazard functions¶

The `casebase`

package fits the family of hazard functions of the form

where \( t \) denotes the numerical value (number of units) of a point in prognostic/prospective time and \( x \) is the realization of the vector \( X \) of variates based on the patient's profile and intervention (if any). Different functions of \( t \) lead to different parametric hazard models.

The simplest of these models is the one-parameter exponential distribution which is obtained by taking the hazard function to be constant over the range of \( t \).

The instantaneous failure rate is independent of \( t \), so that the conditional chance of failure in a time interval of specified length is the same regardless of how long the individual has been on study a.k.a the memoryless property (Kalbfleisch and Prentice, 2002).

The Gompertz hazard model is given by including a linear term for time:

Use of \( log(t) \) yields the Weibull hazard which allows for a power dependence of the hazard on time (Kalbfleisch and Prentice, 2002):

Recall that the relative risk model (Cox, 1972)

where \( \lambda_0(\cdot) \) is an arbitrary unspecified baseline hazard function for continous time.

## Cox Model vs. Case-base Sampling¶

In the following table we provide a comparison between the Cox model and case-base sampling:

feature | Cox | Case.Base.Sampling |
---|---|---|

model type | semi-parametric | fully parametric (logistic/multinomial regression) |

time | left hand side of the equation | right hand side - allows flexible modeling of time |

cumulative incidence | step function | smooth-in-time curve |

non-proportional hazards | interaction of covariates with time | interaction of covariates with time |

model testing | make use of GLM framework (LRT, AIC, BIC) | |

competing risks | difficult | cause-specific cumulative incidence functions (CIFs) directly obtained via multinomial regression |

prediction | Kaplan-Meier-based | ROC, AUC, risk reclassification probabilities |

## Intuition Behind Casebase sampling¶

## Load Required Packages¶

We fist load the required packages:

if (!requireNamespace("pacman", quietly = TRUE)) install.packages("pacman") pacman::p_load(survival) pacman::p_load(casebase) pacman::p_load(splines)

## European Randomized Study of Prostate Cancer Screening Data¶

Throughout this vignette, we make use of the European Randomized Study
of Prostate Cancer Screening data which ships with the `casebase`

package:

data("ERSPC") head(ERSPC)

## ScrArm Follow.Up.Time DeadOfPrCa ## 1 1 0.0027 0 ## 2 1 0.0027 0 ## 3 1 0.0027 0 ## 4 0 0.0027 0 ## 5 0 0.0027 0 ## 6 0 0.0027 0

ERSPC$ScrArm <- factor(ERSPC$ScrArm, levels = c(0,1), labels = c("Control group", "Screening group"))

The results of this study were published by Schroder FH, et al. N Engl
J Med
2009.
There's a really interesting story on how this data was obtained. See
`help(ERSPC)`

and Liu Z, Rich B, Hanley JA, Recovering the raw data
behind a non-parametric survival curve. Systematic Reviews
2014
for further details.

## Population Time Plot¶

Population time plots can be extremely informative graphical displays of
survival data. They should be the first step in your exploratory data
analyses. We facilitate this task in the `casebase`

package using the
`popTime`

function. We first create the necessary dataset for producing
the population time plots:

pt_object <- casebase::popTime(ERSPC, event = "DeadOfPrCa")

## 'Follow.Up.Time' will be used as the time variable

We can see its contents and its class:

head(pt_object)

## ScrArm time event original.time original.event event status ## 1: Screening group 0.0027 0 0.0027 0 censored ## 2: Screening group 0.0027 0 0.0027 0 censored ## 3: Screening group 0.0027 0 0.0027 0 censored ## 4: Control group 0.0027 0 0.0027 0 censored ## 5: Control group 0.0027 0 0.0027 0 censored ## 6: Control group 0.0027 0 0.0027 0 censored ## ycoord yc n_available ## 1: 159893 0 0 ## 2: 159892 0 0 ## 3: 159891 0 0 ## 4: 159890 0 0 ## 5: 159889 0 0 ## 6: 159888 0 0

class(pt_object)

## [1] "popTime" "data.table" "data.frame"

The `casebase`

package has a `plot`

method for objects of class
`popTime`

:

plot(pt_object)

Can you explain the distinct shape of the grey area?

## Exposure Stratified Population Time Plot¶

We can also create exposure stratified plots by specifying the
`exposure`

argument in the `popTime`

function:

pt_object_strat <- casebase::popTime(ERSPC, event = "DeadOfPrCa", exposure = "ScrArm")

## 'Follow.Up.Time' will be used as the time variable

We can see its contents and its class:

head(pt_object_strat)

## $data ## ScrArm time event original.time original.event ## 1: Control group 0.0027 0 0.0027 0 ## 2: Control group 0.0027 0 0.0027 0 ## 3: Control group 0.0027 0 0.0027 0 ## 4: Control group 0.0027 0 0.0027 0 ## 5: Control group 0.0137 0 0.0137 0 ## --- ## 159889: Screening group 14.9405 0 14.9405 0 ## 159890: Screening group 14.9405 0 14.9405 0 ## 159891: Screening group 14.9405 0 14.9405 0 ## 159892: Screening group 14.9405 0 14.9405 0 ## 159893: Screening group 14.9405 0 14.9405 0 ## event status ycoord yc n_available ## 1: censored 88232 0 0 ## 2: censored 88231 0 0 ## 3: censored 88230 0 0 ## 4: censored 88229 0 0 ## 5: censored 88228 0 0 ## --- ## 159889: censored 5 0 0 ## 159890: censored 4 0 0 ## 159891: censored 3 0 0 ## 159892: censored 2 0 0 ## 159893: censored 1 0 0 ## ## $exposure ## [1] "ScrArm"

class(pt_object_strat)

## [1] "popTimeExposure" "list"

The `casebase`

package also has a `plot`

method for objects of class
`popTimeExposure`

:

plot(pt_object_strat)

We can also plot them side-by-side using the `ncol`

argument:

plot(pt_object_strat, ncol = 2)

## Cox Model¶

We first fit a Cox model, examine the hazard ratio for the screening group (relative to the control group), and plot the cumulative incidence function (CIF).

cox_model <- survival::coxph(Surv(Follow.Up.Time, DeadOfPrCa) ~ ScrArm, data = ERSPC) (sum_cox_model <- summary(cox_model))

## Call: ## survival::coxph(formula = Surv(Follow.Up.Time, DeadOfPrCa) ~ ## ScrArm, data = ERSPC) ## ## n= 159893, number of events= 540 ## ## coef exp(coef) se(coef) z Pr(>|z|) ## ScrArmScreening group -0.222 0.801 0.088 -2.52 0.012 * ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## exp(coef) exp(-coef) lower .95 upper .95 ## ScrArmScreening group 0.801 1.25 0.674 0.952 ## ## Concordance= 0.519 (se = 0.011 ) ## Rsquare= 0 (max possible= 0.075 ) ## Likelihood ratio test= 6.45 on 1 df, p=0.0111 ## Wald test = 6.37 on 1 df, p=0.0116 ## Score (logrank) test = 6.39 on 1 df, p=0.0115

We can plot the CIF for each group:

new_data <- data.frame(ScrArm = c("Control group", "Screening group"), ignore = 99) plot(survfit(cox_model, newdata=new_data), xlab = "Years since Randomization", ylab="Cumulative Incidence", fun = "event", xlim = c(0,15), conf.int = F, col = c("red","blue"), main = sprintf("Estimated Cumulative Incidence (risk) of Death from Prostate Cancer Screening group Hazard Ratio: %.2g (%.2g, %.2g)", sum_cox_model$conf.int[,"exp(coef)"], sum_cox_model$conf.int[,"lower .95"], sum_cox_model$conf.int[,"upper .95"])) legend("topleft", legend = c("Control group", "Screening group"), col = c("red","blue"), lty = c(1, 1), bg = "gray90")

We compare it to the figure in Schroder FH, et al. N Engl J Med 2009 and see that the plots are very similar, as is the hazard ratio and 95% confidence interval:

## Case-base Sampling¶

Next we fit several models using case-base sampling. The models we fit differ in how we choose to model time.

The `fitSmoothHazard`

function provides an estimate of the hazard
function \( h(x, t) \) is the hazard function, \( t \) denotes the
numerical value (number of units) of a point in prognostic/prospective
time and \( x \) is the realization of the vector \( X \) of
variates based on the patient's profile and intervention (if any).

# set the seed for reproducible output set.seed(1234) casebase_exponential <- casebase::fitSmoothHazard(DeadOfPrCa ~ ScrArm, data = ERSPC, ratio = 100)

## 'Follow.Up.Time' will be used as the time variable

summary(casebase_exponential)

## ## Call: ## glm(formula = formula, family = binomial, data = sampleData) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -0.148 -0.148 -0.148 -0.132 3.080 ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) -7.7608 0.0557 -139.36 <0.0000000000000002 *** ## ScrArmScreening group -0.2261 0.0884 -2.56 0.011 * ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for binomial family taken to be 1) ## ## Null deviance: 6059.0 on 54539 degrees of freedom ## Residual deviance: 6052.3 on 54538 degrees of freedom ## AIC: 6056 ## ## Number of Fisher Scoring iterations: 7

exp(casebase_exponential$coefficients[2])

## ScrArmScreening group ## 0.8

exp(confint(casebase_exponential)[2,])

## Waiting for profiling to be done... ## 2.5 % 97.5 % ## 0.67 0.95

The `absoluteRisk`

function provides an estimate of the cumulative
incidence curves for a specific risk profile using the following
equation:

In the plot below, we overlay the estimated CIF from the casebase exponential model on the Cox model CIF:

smooth_risk_exp <- casebase::absoluteRisk(object = casebase_exponential, time = seq(0,15,0.1), newdata = new_data) plot(survfit(cox_model, newdata = new_data), xlab = "Years since Randomization", ylab = "Cumulative Incidence", fun = "event", xlim = c(0,15), conf.int = F, col = c("red","blue"), main = sprintf("Estimated Cumulative Incidence (risk) of Death from Prostate Cancer Screening group Hazard Ratio: %.2g (%.2g, %.2g)", sum_cox_model$conf.int[,"exp(coef)"], sum_cox_model$conf.int[,"lower .95"], sum_cox_model$conf.int[,"upper .95"])) lines(smooth_risk_exp[,1], smooth_risk_exp[,2], col = "red", lty = 2) lines(smooth_risk_exp[,1], smooth_risk_exp[,3], col = "blue", lty = 2) legend("topleft", legend = c("Control group (Cox)","Control group (Casebase)", "Screening group (Cox)", "Screening group (Casebase)"), col = c("red","red", "blue","blue"), lty = c(1, 2, 1, 2), bg = "gray90")

As we can see, the exponential model is not a good fit. Based on what we observed in the population time plot, where more events are observed later on in time, this poor fit is expected. A constant hazard model would overestimate the cumulative incidence earlier on in time, and underestimate it later on, which is what we see in the cumulative incidence plot. This example demonstrates the benefits of population time plots as an exploratory analysis tool.

### Linear Time¶

Next we enter time linearly into the model:

casebase_time <- fitSmoothHazard(DeadOfPrCa ~ Follow.Up.Time + ScrArm, data = ERSPC, ratio = 100)

## 'Follow.Up.Time' will be used as the time variable

summary(casebase_time)

## ## Call: ## glm(formula = formula, family = binomial, data = sampleData) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -0.391 -0.161 -0.123 -0.095 3.462 ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) -9.0013 0.1116 -80.66 <0.0000000000000002 *** ## Follow.Up.Time 0.2174 0.0145 15.04 <0.0000000000000002 *** ## ScrArmScreening group -0.2455 0.0886 -2.77 0.0056 ** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for binomial family taken to be 1) ## ## Null deviance: 6059.0 on 54539 degrees of freedom ## Residual deviance: 5818.9 on 54537 degrees of freedom ## AIC: 5825 ## ## Number of Fisher Scoring iterations: 8

exp(casebase_time$coefficients)

## (Intercept) Follow.Up.Time ScrArmScreening group ## 0.00012 1.24290 0.78230

exp(confint(casebase_time))

## Waiting for profiling to be done... ## 2.5 % 97.5 % ## (Intercept) 0.000099 0.00015 ## Follow.Up.Time 1.208290 1.27878 ## ScrArmScreening group 0.656812 0.92986

smooth_risk_time <- casebase::absoluteRisk(object = casebase_time, time = seq(0,15,0.1), newdata = new_data) plot(survfit(cox_model, newdata = new_data), xlab = "Years since Randomization", ylab = "Cumulative Incidence", fun = "event", xlim = c(0,15), conf.int = F, col = c("red","blue"), main = sprintf("Estimated Cumulative Incidence (risk) of Death from Prostate Cancer Screening group Hazard Ratio: %.2g (%.2g, %.2g)", sum_cox_model$conf.int[,"exp(coef)"], sum_cox_model$conf.int[,"lower .95"], sum_cox_model$conf.int[,"upper .95"])) lines(smooth_risk_time[,1], smooth_risk_time[,2], col = "red", lty = 2) lines(smooth_risk_time[,1], smooth_risk_time[,3], col = "blue", lty = 2) legend("topleft", legend = c("Control group (Cox)","Control group (Casebase)", "Screening group (Cox)", "Screening group (Casebase)"), col = c("red","red", "blue","blue"), lty = c(1, 2, 1, 2), bg = "gray90")

We see that the Weibull model leads to a better fit.

### Flexible time using BSplines¶

Next we try to enter a smooth function of time into the model using the
`splines`

package

casebase_splines <- fitSmoothHazard(DeadOfPrCa ~ bs(Follow.Up.Time) + ScrArm, data = ERSPC, ratio = 100)

## 'Follow.Up.Time' will be used as the time variable

summary(casebase_splines)

## ## Call: ## glm(formula = formula, family = binomial, data = sampleData) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -0.437 -0.173 -0.135 -0.085 3.803 ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) -10.3009 0.3182 -32.37 < 0.0000000000000002 *** ## bs(Follow.Up.Time)1 4.4289 0.8088 5.48 0.000000043451 *** ## bs(Follow.Up.Time)2 1.9887 0.4875 4.08 0.000045142561 *** ## bs(Follow.Up.Time)3 4.7511 0.7215 6.59 0.000000000045 *** ## ScrArmScreening group -0.2097 0.0886 -2.37 0.018 * ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for binomial family taken to be 1) ## ## Null deviance: 6059.0 on 54539 degrees of freedom ## Residual deviance: 5789.2 on 54535 degrees of freedom ## AIC: 5799 ## ## Number of Fisher Scoring iterations: 8

exp(casebase_splines$coefficients)

## (Intercept) bs(Follow.Up.Time)1 bs(Follow.Up.Time)2 ## 0.000034 83.840791 7.305938 ## bs(Follow.Up.Time)3 ScrArmScreening group ## 115.710908 0.810854

exp(confint(casebase_splines))

## Waiting for profiling to be done... ## 2.5 % 97.5 % ## (Intercept) 0.000017 0.000061 ## bs(Follow.Up.Time)1 17.745271 424.854594 ## bs(Follow.Up.Time)2 2.900215 19.704298 ## bs(Follow.Up.Time)3 26.568668 455.040311 ## ScrArmScreening group 0.680804 0.963786

smooth_risk_splines <- absoluteRisk(object = casebase_splines, time = seq(0,15,0.1), newdata = new_data) plot(survfit(cox_model, newdata=new_data), xlab = "Years since Randomization", ylab = "Cumulative Incidence", fun = "event", xlim = c(0,15), conf.int = F, col = c("red","blue"), main = sprintf("Estimated Cumulative Incidence (risk) of Death from Prostate Cancer Screening group Hazard Ratio: %.2g (%.2g, %.2g)", sum_cox_model$conf.int[,"exp(coef)"], sum_cox_model$conf.int[,"lower .95"], sum_cox_model$conf.int[,"upper .95"])) lines(smooth_risk_splines[,1], smooth_risk_splines[,2], col = "red", lty = 2) lines(smooth_risk_splines[,1], smooth_risk_splines[,3], col = "blue", lty = 2) legend("topleft", legend = c("Control group (Cox)","Control group (Casebase)", "Screening group (Cox)", "Screening group (Casebase)"), col = c("red","red", "blue","blue"), lty = c(1, 2, 1, 2), bg = "gray90")

It looks like the best fit.

## Comparing Models Using Likelihood Ratio Test¶

Since we are in the GLM framework, we can easily test for which model better fits the data using a Likelihood ratio test (LRT). The null hypothesis here is that the linear model is just as good as the larger (in terms of number of parameters) splines model.

anova(casebase_time, casebase_splines, test = "LRT")

## Analysis of Deviance Table ## ## Model 1: DeadOfPrCa ~ Follow.Up.Time + ScrArm + offset(offset) ## Model 2: DeadOfPrCa ~ bs(Follow.Up.Time) + ScrArm + offset(offset) ## Resid. Df Resid. Dev Df Deviance Pr(>Chi) ## 1 54537 5819 ## 2 54535 5789 2 29.7 0.00000036 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We see that splines model is the better fit.

## Session Information¶

print(sessionInfo(), locale = F)

## R version 3.3.3 (2017-03-06) ## Platform: x86_64-pc-linux-gnu (64-bit) ## Running under: Ubuntu 16.04.2 LTS ## ## attached base packages: ## [1] splines stats graphics grDevices utils datasets methods ## [8] base ## ## other attached packages: ## [1] casebase_0.1.0 data.table_1.10.4 survival_2.40-1 ## [4] devtools_1.12.0.9000 ## ## loaded via a namespace (and not attached): ## [1] Rcpp_0.12.10 git2r_0.18.0 plyr_1.8.4 ## [4] tools_3.3.3 testthat_1.0.2 digest_0.6.12 ## [7] pkgbuild_0.0.0.9000 pkgload_0.0.0.9000 memoise_1.0.0 ## [10] evaluate_0.10 tibble_1.3.0 gtable_0.2.0 ## [13] lattice_0.20-35 Matrix_1.2-8 commonmark_1.1 ## [16] curl_2.3 yaml_2.1.14 httr_1.2.1 ## [19] withr_1.0.2 stringr_1.2.0 knitr_1.15.1 ## [22] roxygen2_6.0.1 xml2_1.1.1 desc_1.1.0 ## [25] stats4_3.3.3 rprojroot_1.2 grid_3.3.3 ## [28] R6_2.2.0 VGAM_1.0-3 rmarkdown_1.3.9003 ## [31] pacman_0.4.1 callr_1.0.0.9000 ggplot2_2.2.1 ## [34] magrittr_1.5 MASS_7.3-45 scales_0.4.1 ## [37] backports_1.0.5 htmltools_0.3.6 rsconnect_0.7 ## [40] assertthat_0.1 colorspace_1.3-2 labeling_0.3 ## [43] stringi_1.1.5 lazyeval_0.2.0 munsell_0.4.3 ## [46] crayon_1.3.2