Flexible Bayesian parametric survival models. Individual data are
represented using M-splines and a proportional hazards or flexible
non-proportional hazards model. External aggregate data can be
included, for example, to enable extrapolation outside the
individual data. A fixed background hazard can also be included in
an additive hazards (relative survival) model. Mixture cure
versions of these models can also be used.
formula |
A survival formula in standard R formula syntax, with a call to Surv()
on the left hand side.
Covariates included on the right hand side of the formula with be
modelled with proportional hazards, or if nonprop is
TRUE then a non-proportional hazards is used.
If data is omitted, so that the model is being fitted to
external aggregate data alone, without individual data, then the
formula should not include a Surv() call. The left-hand
side of the formula will then be empty, and the right hand side
specifies the covariates as usual. For example, formula =
~1 if there are no covariates.
|
data |
Data frame containing variables in formula .
Variables should be in a data frame, and not in the working
environment.
This may be omitted, in which case external must be
supplied. This allows a model to be fitted to external aggregate
data alone, without any individual-level data.
|
external |
External data as a data frame of aggregate survival counts with columns named:
start : Start time
stop : Follow-up time
n : Number of people alive at start
r : Number of those people who are still alive at stop
If there are covariates in formula , then the values they
take in the external data must be supplied as additional columns in
external . Therefore if there are external data, the
covariates in formula and data should not be named
start ,stop ,n or r .
|
cure |
If TRUE , a mixture cure model is used, where the
"uncured" survival is defined by the M-spline model, and the cure
probability is estimated.
|
nonprop |
Non-proportional hazards model specification.
This is achieved by modelling the spline basis coefficients in terms of the covariates. See
the methods vignette for more details.
If TRUE , then all covariates are modelled with
non-proportional hazards, using the same model formula as
formula .
If this is a formula, then this is assumed to define a model for
the dependence of the basis coefficients on the covariates.
IF this is NULL or FALSE (the default) then any
covariates are modelled with proportional hazards.
|
prior_hscale |
Prior for the baseline log hazard scale
parameter (alpha or log(eta) ). This should be a call to a
prior constructor function, such as p_normal(0,1) or
p_t(0,2,2) . Supported prior distribution families are normal
(parameters mean and SD) and t distributions (parameters
location, scale and degrees of freedom). The default is a normal
distribution with mean 0 and standard deviation 20.
Note that eta is not in itself a hazard, but it is proportional
to the hazard (see the vignette for the full model specification).
"Baseline" is defined by the continuous covariates taking a value
of zero and factor covariates taking their reference level. To
use a different baseline, the data should be transformed
appropriately beforehand, so that a value of zero has a different
meaning. For continuous covariates, it helps for both
computation and interpretation to define the value of zero to
denote a typical value in the data, e.g. the mean.
|
prior_loghr |
Priors for log hazard ratios. This should be a
call to p_normal() or p_t() . A list of calls can also be
provided, to give different priors to different coefficients,
where the name of each list component matches the name of the
coefficient, e.g. list("age45-59" = p_normal(0,1), "age60+" = p_t(0,2,3))
The default is p_normal(0,2.5) for all coefficients.
|
prior_hsd |
Gamma prior for the standard deviation that
controls the variability over time (or smoothness) of the hazard
function. This should be a call to p_gamma() . The default is
p_gamma(2,1) . See prior_haz_sd for a way to
calibrate this to represent a meaningful belief.
|
prior_cure |
Prior for the baseline cure probability. This should be a
call to p_beta() . The default is a uniform prior, p_beta(1,1) .
Baseline is defined by the mean of continuous covariates and the reference
level of factor covariates.
|
prior_logor_cure |
Priors for log odds ratios on cure probabilities.
This should be a call to p_normal() or p_t() . The default is
p_normal(0,2.5) .
|
prior_hrsd |
Prior for the standard deviation parameters that
smooth the non-proportionality effects over time in
non-proportional hazards models. This should be a call to
p_gamma() or a list of calls to p_gamma() with one component
per covariate, as in prior_loghr . See
prior_hr_sd for a way to calibrate this to
represent a meaningful belief.
|
backhaz |
Background hazard, that is, for causes of death
other than the cause of interest. This defines a
"relative survival" or "additive hazards" model. The overall
hazard that describes the all-cause survival data (given in the
data and/or external argument) is then modelled as the sum of
a cause-specific hazard and a background hazard.
The background hazard is assumed to be known, and the
cause-specific hazard is modelled with the flexible parametric
model.
The background hazard can be supplied in two forms. The meaning of predictions
from the model depends on which of these is used.
(a) A data frame with columns "hazard" and "time" ,
specifying the background hazard at all times as a
piecewise-constant (step) function. Each row gives the background
hazard between the specified time and the next time. The first
element of "time" should be 0, and the final row specifies
the hazard at all times greater than the last element of
"time" . Predictions from the model fitted by survextrap
will then include this background hazard, because it is known at
all times.
(b) The (quoted) name of a variable in the data giving the
background hazard. For censored cases, the exact value does not
matter. The predictions from survextrap will then describe the
excess hazard or survival on top of this background. The overall
hazard cannot be predicted in general, because the background
hazard is only specified over a limited range of time.
If there is external data, and backhaz is supplied in form (b),
then the user should also supply the background survival at the
start and stop points in columns of the external data named
"backsurv_start" and "backsurv_stop" . That is, the probability
of survival up to each of these times for someone alive at time 0.
This should describe the
same reference population as backhaz , though the package does not
check for consistency between these.
If there are stratifying variables specified in
backhaz_strata , then there should be multiple rows giving
the background hazard value for each time period and stratifying
variable.
If backhaz is NULL (the default) then no background hazard
component is included in the model.
|
backhaz_strata |
A character vector of names of variables that
appear in backhaz that indicate strata, e.g.
backhaz_strata = c("agegroup","sex") . This allows
different background hazard values to be used for different
subgroups. These variables must also
appear in the datasets being modelled, that is, in data ,
external or both. Each row of those datasets should then
have a corresponding row in backhaz which has the same
values of the stratifying variables.
This is NULL by default, indicating no stratification of the
background hazard.
If stratification is done, then backhaz must be supplied in
form (a), as a data frame rather than a variable in the data.
|
mspline |
A list of control parameters defining the spline model.
knots : Spline knots. If this is not supplied, then the number
of knots is taken from df , and their location is taken from
equally-spaced quantiles of the observed event times in the
individual-level data.
add_knots : This is intended
to be used when there are external data included in the
model. External data are typically outside the time period
covered by the individual data. add_knots would then be chosen
to span the time period covered by the external data, so that the
hazard trajectory can vary over that time.
If there are external data, and both knots and add_knots are
omitted, then a default set of knots is chosen to span both the
individual and external data, by taking the quantiles of a vector
defined by concatenating the individual-level event times with
the start and stop times in the external data.
df : Degrees of freedom, i.e. the number of parameters (or basis
terms) intended to result from choosing knots based on quantiles
of the data. The total number of parameters will then be df
plus the number of additional knots specified in
add_knots . df defaults to 10. This does not necessarily
overfit, because the function is smoothed through the prior.
degree : Polynomial degree used for the basis function. The
default is 3, giving a cubic. This can only be changed from 3
if bsmooth is FALSE .
bsmooth : If TRUE (on by default) the spline is smoother
at the highest knot, by defining the derivative and second derivative
at this point to be zero.
|
add_knots |
Any extra knots beyond those chosen from the
individual-level data (or supplied in knots ). All other spline
specifications are set to their defaults, as described in
mspline . For example, add_knots = 10 is a shorthand
for mspline = list(add_knots = 10) .
|
smooth_model |
The default "random_walk" , specifies a random walk
prior for the multinomial-logit spline coefficients, based on
logistic distributions. See the methods vignette
for full details.
The alternative "exchangeable" uses
independent logistic priors on the multinomial-logit spline
coefficients, conditionally on a common smoothing variance
parameter. Note this is the method explained in the original
survextrap paper (Jackson, BMC Med Res 2023). The random
walk model is shown to perform better in Timmins et al (2025).
In non-proportional hazards models, setting smooth_model also
determines whether an exchangeable or random walk model is used for the
non-proportionality parameters (\delta ).
|
hsd |
Smoothing variance parameter estimation.
"bayes" : the smoothing parameter is estimated by full Bayes (the default).
"eb" : empirical Bayes is used.
Alternatively, if a number is supplied here, then the smoothing parameter is fixed to this number.
|
coefs_mean |
Spline basis coefficients that define the prior
mean for the hazard function. By default, these are set to values
that define a constant hazard function (see
mspline_constant_coefs ). They are normalised to
sum to 1 internally (if they do not already).
|
fit_method |
Method from rstan used to fit the model.
If fit_method="mcmc" then a sample from the posterior is
drawn using Markov Chain Monte Carlo sampling, via rstan's
rstan::sampling()
function. This is the default. It is the most accurate but the
slowest method.
If fit_method="opt" , then instead of an MCMC sample from
the posterior, survextrap returns the posterior mode
calculated using optimisation, via rstan's
rstan::optimizing()
function. A sample from a normal approximation to the
(real-line-transformed) posterior distribution is drawn in order
to obtain credible intervals. This is useful for model development,
while using MCMC for the "final answer".
If fit_method="vb" , then variational Bayes methods are
used, via rstan's
rstan::vb() function.
This is labelled as "experimental" by rstan. It might
give a better approximation to the posterior than
fit_method="opt" , and is faster than MCMC, in particular
for large datasets, but has not been investigated in depth for
these models.
|
loo |
Compute leave-one-out cross-validation statistics.
This is done by default. Set to FALSE to not compute
them. If these statistics are computed, then they are returned
in the loo and loo_external components of the
object returned by survextrap . loo describes the
fit of the model to the individual-level data, and
loo_external describes fit to the external data.
See the "examples" vignette for more explanation of these.
|
... |
Additional arguments to supply to control the Stan fit,
passed to the appropriate rstan function, depending on
which is chosen through the fit_method argument.
|
A list of objects defining the fitted model. These are not
intended to be extracted directly by users. Instead see
summary.survextrap
for summarising the parameter
estimates, and the functions hazard
,
survival
, rmst
and others for
computing interesting summaries of the fitted survival
distribution.
Timmins I, Torabi F, Jackson C, Lambert P, Sweeting M J. (2025)
Simulation-based assessment of a Bayesian survival model with flexible baseline hazard and time-dependent effects.
doi:10.48550/arXiv.2503.21388.