Miettinen and Hanley (2009) explained how case-base sampling can be used to estimate smooth-in-time parametric hazard functions. The idea is to sample person-moments, which may or may not correspond to an event, and then fit the hazard using logistic regression.

fitSmoothHazard(
  formula,
  data,
  time,
  family = c("glm", "gam", "glmnet"),
  censored.indicator,
  ratio = 100,
  ...
)

fitSmoothHazard.fit(
  x,
  y,
  formula_time,
  time,
  event,
  family = c("glm", "glmnet"),
  censored.indicator,
  ratio = 100,
  ...
)

prepareX(formula, data)

Arguments

formula

an object of class "formula" (or one that can be coerced to that class): a symbolic description of the model to be fitted. The details of model specification are given under Details.

data

a data frame, list or environment containing the variables in the model. If not found in data, the variables are taken from environment(formula), typically the environment from which fitSmoothHazard is called.

time

a character string giving the name of the time variable. See Details.

family

a character string specifying the family of regression models used to fit the hazard.

censored.indicator

a character string of length 1 indicating which value in event is the censored. This function will use relevel to set censored.indicator as the reference level. This argument is ignored if the event variable is a numeric.

ratio

integer, giving the ratio of the size of the base series to that of the case series. Defaults to 100.

...

Additional parameters passed to fitting functions (e.g. glm, glmnet, gam).

x

Matrix containing covariates.

y

Matrix containing two columns: one corresponding to time, the other to the event type.

formula_time

A formula describing how the hazard depends on time. Defaults to linear.

event

a character string giving the name of the event variable.

Value

An object of glm and lm when there is only one event of interest, or of class CompRisk, which inherits from vglm, for a competing risk analysis. As such, functions like summary, deviance and coefficients give familiar results.

Details

The object data should either be the output of the function sampleCaseBase or the source dataset on which case-base sampling will be performed. In the latter case, it is assumed that data contains the two columns corresponding to the supplied time and event variables. The variable time is used for the sampling the base series, and therefore it should represent the time variable on its original (i.e. non transformed) scale. If time is missing, the function looks for a column named "time" in the data. Note that the event variable is inferred from formula, since it is the left hand side.

For single-event survival analysis, it is also possible to fit the hazard function using glmnet or gam. The choice of fitting family is controlled by the parameter family. The default value is glm, which corresponds to logistic regression. For competing risk analysis, only glm and glmnet are allowed.

We also provide a matrix interface through fitSmoothHazard.fit, which mimics glm.fit. This is mostly convenient for family = "glmnet", since a formula interface becomes quickly cumbersome as the number of variables increases. In this setting, the matrix y should have two columns and contain the time and event variables (e.g. like the output of survival::Surv). We need this linear function of time in order to perform case-base sampling. Therefore, nonlinear functions of time should be specified as a one-sided formula through the argument formula_time (the left-hand side is always ignored).

prepareX is a slightly modified version of the same function from the glmnet package. It can be used to convert a data.frame to a matrix with categorical variables converted to dummy variables using one-hot encoding

Examples

# Simulate censored survival data for two outcome types from exponential
# distributions
library(data.table)
nobs <- 500
tlim <- 20

# simulation parameters
b1 <- 200
b2 <- 50

# event type 0-censored, 1-event of interest, 2-competing event
# t observed time/endpoint
# z is a binary covariate
DT <- data.table(z = rbinom(nobs, 1, 0.5))
DT[, `:=`(
  "t_event" = rweibull(nobs, 1, b1),
  "t_comp" = rweibull(nobs, 1, b2)
)]
#>          z   t_event    t_comp
#>      <int>     <num>     <num>
#>   1:     0 125.67322 53.090763
#>   2:     0  27.92365 41.517160
#>   3:     1 625.41749 23.956800
#>   4:     0 279.93382 62.361918
#>   5:     1 315.90292 19.739612
#>  ---                          
#> 496:     1 111.76799 57.900562
#> 497:     1 153.70621 63.501104
#> 498:     0  38.02085 14.325562
#> 499:     0 109.52164 20.573577
#> 500:     0 575.37257  8.339102
DT[, `:=`(
  "event" = 1 * (t_event < t_comp) + 2 * (t_event >= t_comp),
  "time" = pmin(t_event, t_comp)
)]
#>          z   t_event    t_comp event      time
#>      <int>     <num>     <num> <num>     <num>
#>   1:     0 125.67322 53.090763     2 53.090763
#>   2:     0  27.92365 41.517160     1 27.923655
#>   3:     1 625.41749 23.956800     2 23.956800
#>   4:     0 279.93382 62.361918     2 62.361918
#>   5:     1 315.90292 19.739612     2 19.739612
#>  ---                                          
#> 496:     1 111.76799 57.900562     2 57.900562
#> 497:     1 153.70621 63.501104     2 63.501104
#> 498:     0  38.02085 14.325562     2 14.325562
#> 499:     0 109.52164 20.573577     2 20.573577
#> 500:     0 575.37257  8.339102     2  8.339102
DT[time >= tlim, `:=`("event" = 0, "time" = tlim)]
#>          z   t_event    t_comp event      time
#>      <int>     <num>     <num> <num>     <num>
#>   1:     0 125.67322 53.090763     0 20.000000
#>   2:     0  27.92365 41.517160     0 20.000000
#>   3:     1 625.41749 23.956800     0 20.000000
#>   4:     0 279.93382 62.361918     0 20.000000
#>   5:     1 315.90292 19.739612     2 19.739612
#>  ---                                          
#> 496:     1 111.76799 57.900562     0 20.000000
#> 497:     1 153.70621 63.501104     0 20.000000
#> 498:     0  38.02085 14.325562     2 14.325562
#> 499:     0 109.52164 20.573577     0 20.000000
#> 500:     0 575.37257  8.339102     2  8.339102

out_linear <- fitSmoothHazard(event ~ time + z, DT, ratio = 10)
#> 'time' will be used as the time variable
out_log <- fitSmoothHazard(event ~ log(time) + z, DT, ratio = 10)
#> 'time' will be used as the time variable

# Use GAMs
library(mgcv)
#> Loading required package: nlme
#> This is mgcv 1.9-1. For overview type 'help("mgcv-package")'.
DT[event == 2, event := 1]
#>          z   t_event    t_comp event      time
#>      <int>     <num>     <num> <num>     <num>
#>   1:     0 125.67322 53.090763     0 20.000000
#>   2:     0  27.92365 41.517160     0 20.000000
#>   3:     1 625.41749 23.956800     0 20.000000
#>   4:     0 279.93382 62.361918     0 20.000000
#>   5:     1 315.90292 19.739612     1 19.739612
#>  ---                                          
#> 496:     1 111.76799 57.900562     0 20.000000
#> 497:     1 153.70621 63.501104     0 20.000000
#> 498:     0  38.02085 14.325562     1 14.325562
#> 499:     0 109.52164 20.573577     0 20.000000
#> 500:     0 575.37257  8.339102     1  8.339102
out_gam <- fitSmoothHazard(event ~ s(time) + z, DT,
                           ratio = 10, family = "gam")
#> 'time' will be used as the time variable