Miettinen and Hanley (2009) explained how case-base sampling can be used to estimate smooth-in-time parametric hazard functions. The idea is to sample person-moments, which may or may not correspond to an event, and then fit the hazard using logistic regression.

- formula
an object of class "formula" (or one that can be coerced to that class): a symbolic description of the model to be fitted. The details of model specification are given under Details.

- data
a data frame, list or environment containing the variables in the model. If not found in data, the variables are taken from

`environment(formula)`

, typically the environment from which`fitSmoothHazard`

is called.- time
a character string giving the name of the time variable. See Details.

- family
a character string specifying the family of regression models used to fit the hazard.

- censored.indicator
a character string of length 1 indicating which value in

`event`

is the censored. This function will use`relevel`

to set`censored.indicator`

as the reference level. This argument is ignored if the`event`

variable is a numeric.- ratio
integer, giving the ratio of the size of the base series to that of the case series. Defaults to 100.

- ...
Additional parameters passed to fitting functions (e.g.

`glm`

,`glmnet`

,`gam`

).- x
Matrix containing covariates.

- y
Matrix containing two columns: one corresponding to time, the other to the event type.

- formula_time
A formula describing how the hazard depends on time. Defaults to linear.

- event
a character string giving the name of the event variable.

An object of `glm`

and `lm`

when there is only one event of
interest, or of class `CompRisk`

, which inherits from
`vglm`

, for a competing risk analysis. As such, functions like
`summary`

, `deviance`

and `coefficients`

give familiar
results.

The object `data`

should either be the output of the function
`sampleCaseBase`

or the source dataset on which case-base
sampling will be performed. In the latter case, it is assumed that
`data`

contains the two columns corresponding to the supplied time and
event variables. The variable `time`

is used for the sampling the base
series, and therefore it should represent the time variable on its original
(i.e. non transformed) scale. If `time`

is missing, the function looks
for a column named `"time"`

in the data. Note that the event variable is
inferred from `formula`

, since it is the left hand side.

For single-event survival analysis, it is also possible to fit the hazard
function using `glmnet`

or `gam`

. The choice of fitting family is
controlled by the parameter `family`

. The default value is `glm`

,
which corresponds to logistic regression. For competing risk analysis, only
`glm`

and `glmnet`

are allowed.

We also provide a matrix interface through `fitSmoothHazard.fit`

, which
mimics `glm.fit`

. This is mostly convenient for ```
family =
"glmnet"
```

, since a formula interface becomes quickly cumbersome as the number
of variables increases. In this setting, the matrix `y`

should have two
columns and contain the time and event variables (e.g. like the output of
`survival::Surv`

). We need this linear function of time in order to
perform case-base sampling. Therefore, nonlinear functions of time should be
specified as a one-sided formula through the argument `formula_time`

(the left-hand side is always ignored).

`prepareX`

is a slightly modified version of the same function from
the `glmnet`

package. It can be used to convert a data.frame to a matrix
with categorical variables converted to dummy variables using one-hot
encoding

```
# Simulate censored survival data for two outcome types from exponential
# distributions
library(data.table)
nobs <- 500
tlim <- 20
# simulation parameters
b1 <- 200
b2 <- 50
# event type 0-censored, 1-event of interest, 2-competing event
# t observed time/endpoint
# z is a binary covariate
DT <- data.table(z = rbinom(nobs, 1, 0.5))
DT[, `:=`(
"t_event" = rweibull(nobs, 1, b1),
"t_comp" = rweibull(nobs, 1, b2)
)]
#> z t_event t_comp
#> <int> <num> <num>
#> 1: 0 125.67322 53.090763
#> 2: 0 27.92365 41.517160
#> 3: 1 625.41749 23.956800
#> 4: 0 279.93382 62.361918
#> 5: 1 315.90292 19.739612
#> ---
#> 496: 1 111.76799 57.900562
#> 497: 1 153.70621 63.501104
#> 498: 0 38.02085 14.325562
#> 499: 0 109.52164 20.573577
#> 500: 0 575.37257 8.339102
DT[, `:=`(
"event" = 1 * (t_event < t_comp) + 2 * (t_event >= t_comp),
"time" = pmin(t_event, t_comp)
)]
#> z t_event t_comp event time
#> <int> <num> <num> <num> <num>
#> 1: 0 125.67322 53.090763 2 53.090763
#> 2: 0 27.92365 41.517160 1 27.923655
#> 3: 1 625.41749 23.956800 2 23.956800
#> 4: 0 279.93382 62.361918 2 62.361918
#> 5: 1 315.90292 19.739612 2 19.739612
#> ---
#> 496: 1 111.76799 57.900562 2 57.900562
#> 497: 1 153.70621 63.501104 2 63.501104
#> 498: 0 38.02085 14.325562 2 14.325562
#> 499: 0 109.52164 20.573577 2 20.573577
#> 500: 0 575.37257 8.339102 2 8.339102
DT[time >= tlim, `:=`("event" = 0, "time" = tlim)]
#> z t_event t_comp event time
#> <int> <num> <num> <num> <num>
#> 1: 0 125.67322 53.090763 0 20.000000
#> 2: 0 27.92365 41.517160 0 20.000000
#> 3: 1 625.41749 23.956800 0 20.000000
#> 4: 0 279.93382 62.361918 0 20.000000
#> 5: 1 315.90292 19.739612 2 19.739612
#> ---
#> 496: 1 111.76799 57.900562 0 20.000000
#> 497: 1 153.70621 63.501104 0 20.000000
#> 498: 0 38.02085 14.325562 2 14.325562
#> 499: 0 109.52164 20.573577 0 20.000000
#> 500: 0 575.37257 8.339102 2 8.339102
out_linear <- fitSmoothHazard(event ~ time + z, DT, ratio = 10)
#> 'time' will be used as the time variable
out_log <- fitSmoothHazard(event ~ log(time) + z, DT, ratio = 10)
#> 'time' will be used as the time variable
# Use GAMs
library(mgcv)
#> Loading required package: nlme
#> This is mgcv 1.9-1. For overview type 'help("mgcv-package")'.
DT[event == 2, event := 1]
#> z t_event t_comp event time
#> <int> <num> <num> <num> <num>
#> 1: 0 125.67322 53.090763 0 20.000000
#> 2: 0 27.92365 41.517160 0 20.000000
#> 3: 1 625.41749 23.956800 0 20.000000
#> 4: 0 279.93382 62.361918 0 20.000000
#> 5: 1 315.90292 19.739612 1 19.739612
#> ---
#> 496: 1 111.76799 57.900562 0 20.000000
#> 497: 1 153.70621 63.501104 0 20.000000
#> 498: 0 38.02085 14.325562 1 14.325562
#> 499: 0 109.52164 20.573577 0 20.000000
#> 500: 0 575.37257 8.339102 1 8.339102
out_gam <- fitSmoothHazard(event ~ s(time) + z, DT,
ratio = 10, family = "gam")
#> 'time' will be used as the time variable
```