gen.dat {BayesPIM}R Documentation

gen.dat: Simulate Screening Data for a Prevalence-Incidence Mixture Model

Description

Generates synthetic data according to the Bayesian prevalence-incidence mixture (PIM) framework of Klausch et al. (2025) with interval-censored screening outcomes. The function simulates continuous or discrete baseline covariates, event times from one of several parametric families, and irregular screening schedules, yielding interval-censored observations suitable for testing or demonstrating PIM-based or other interval-censored survival methods.

Usage

gen.dat(
  kappa = 0.7,
  n = 1000,
  p = 2,
  p.discrete = 0,
  r = 0,
  s = 1,
  sigma.X = 1/2,
  mu.X = 4,
  beta.X = NULL,
  beta.W = NULL,
  theta = 0.15,
  v.min = 1,
  v.max = 6,
  mean.rc = 40,
  dist.X = "weibull",
  k = 1,
  sel.mod = "probit",
  prob.r = 0
)

Arguments

kappa

Numeric. Test sensitivity parameter \kappa used when generating misclassification. A value of 1 implies perfect sensitivity.

n

Integer. Sample size.

p

Integer. Number of continuous baseline covariates to simulate.

p.discrete

Integer. If 1, include an additional discrete covariate Z_{\mathrm{discrete}} from \mathrm{Bernoulli}(0.5); otherwise, none.

r

Numeric. Correlation coefficient(s) used to build the covariance matrix of continuous covariates. If p > 1, off-diagonal entries of the correlation matrix are set to r.

s

Numeric. Standard deviation(s) of the continuous covariates. If p > 1, all continuous covariates share the same s.

sigma.X

Numeric. Scale parameter \sigma_X in the AFT model for \log(x_i).

mu.X

Numeric. Intercept \beta_{x0} in the AFT model. In the linear predictor, it appears as \log(x_i) = \beta_{x0} + \beta_{x}^\top Z_i + \sigma_X \epsilon_i. Practically, mu.X is prepended to beta.X when forming the full parameter vector.

beta.X

Numeric vector. The coefficients \beta_{x} for the AFT model. Combined with mu.X, the log-scale model is cbind(1, Z_i) %*% c(mu.X, beta.X).

beta.W

Numeric vector. The coefficients \beta_{w} for the prevalence model. The intercept \beta_{w0} is derived from theta.

theta

Numeric. Baseline prevalence parameter on the probability scale. Under:

  • sel.mod = "probit": \beta_{w0} = \mathrm{qnorm}(\theta).

  • sel.mod = "logit": \beta_{w0} = \log(\theta / (1 - \theta)).

v.min

Numeric. Minimum spacing for irregular screening intervals.

v.max

Numeric. Maximum spacing for irregular screening intervals.

mean.rc

Numeric. Mean of the exponential distribution controlling a random right-censoring time t_{\mathrm{rc}} after the first screening.

dist.X

Character. Distribution for survival times x_i: "weibull", "lognormal", "loglog" (log-logistic), or "gengamma" (generalized gamma).

k

Numeric. Shape parameter for "gengamma" only.

sel.mod

Character. Either "probit" or "logit", specifying the link function for the prevalence model.

prob.r

Numeric. Probability that a baseline test is performed (r_i = 1). If prob.r = 0, no baseline tests are done.

Details

The data-generating process includes:

  1. Covariates Z: Continuous covariates are simulated using a correlation structure specified by r and a common standard deviation s. If p.discrete = 1, a single discrete covariate is added, drawn from \mathrm{Bernoulli}(0.5).

  2. Event Times X: An Accelerated Failure Time (AFT) model is used:

    \log(x_i) = \beta_{x0} + \beta_{x}^\top z_{xi} + \sigma_X \,\epsilon_i,

    where \beta_{x0} is the intercept (set by mu.X) and \beta_{x} are the other regression coefficients (provided via beta.X). The error term \epsilon_i is drawn from the distribution chosen by dist.X: "weibull", "lognormal", "loglog" (log-logistic), or "gengamma" (generalized gamma). For "gengamma", the shape parameter k is additionally used.

  3. Irregular Screening Schedules V_i: Each individual has multiple screening times generated randomly between v.min and v.max, ending in right censoring or the time of detection. These screening times (including a 0 for baseline and Inf for censoring) are returned in Vobs.

  4. Prevalence Indicator g_i: Baseline prevalence is modeled via either a probit or logit link, consistent with:

    w_i = \beta_{w0} + \beta_{w}^\top z_{wi} + \psi_i,

    where \beta_{w0} is determined by theta, and \beta_{w} by beta.W. Specifically:

    • If sel.mod = "probit", then \beta_{w0} = \mathrm{qnorm}(\theta).

    • If sel.mod = "logit", then \beta_{w0} = \log(\theta / (1-\theta)).

    We set g_i = 1 if w_i > 0, and g_i = 0 otherwise.

  5. Baseline Test Missingness r_i: A baseline test indicator r_i \in \{0,1\} is generated via \mathrm{Bernoulli}(\text{prob.r}), so r_i = 1 means the baseline test is performed and r_i = 0 means it is missing.

  6. Test Sensitivity \kappa: A misclassification parameter \kappa (test sensitivity) can be specified via kappa. If \kappa < 1, some truly positive cases are missed.

Value

A list with the following elements:

Vobs

A list of length n, each entry containing screening times. The first element is 0 (baseline), and Inf may indicate right censoring.

X.true

Numeric vector of length n giving the true (latent) event times x_i.

Z

Numeric matrix of dimension n \times p (plus an extra column if p.discrete = 1) containing the covariates.

C

Binary vector of length n, indicating whether an individual is truly positive at baseline (g_i = 1).

r

Binary vector of length n, indicating whether the baseline test was performed (r_i = 1) or missing (r_i = 0).

p.W

Numeric vector of length n giving the true prevalence probabilities, P(g_i = 1).

References

T. Klausch, B. I. Lissenberg-Witte, and V. M. Coupé, “A Bayesian prevalence-incidence mixture model for screening outcomes with misclassification,” arXiv:2412.16065.

Examples

# Generate a small dataset for testing
set.seed(2025)
sim_data <- gen.dat(n = 100, p = 1, p.discrete = 1,
                    sigma.X = 0.5, mu.X = 2,
                    beta.X = c(0.2, 0.2), beta.W = c(0.5, -0.2),
                    theta = 0.2,
                    dist.X = "weibull", sel.mod = "probit")
str(sim_data)


[Package BayesPIM version 1.0.0 Index]