generate_data {ARTtransfer}R Documentation

generate_data: Generate synthetic primary, auxiliary, and noisy datasets for transfer learning

Description

This function generates synthetic datasets for the primary task (target domain), auxiliary datasets (source domains), and noisy datasets for use in transfer learning simulations. It allows flexible input for the sizes of the auxiliary and noisy datasets, supports different covariance structures, and can optionally generate test datasets. Users can specify true coefficients or rely on random generation. The function supports generating datasets for both regression and binary classification tasks.

Usage

generate_data(
  n0,
  p,
  K,
  nk,
  is_noise = TRUE,
  K_noise = 2,
  nk_noise = 30,
  mu_trgt,
  xi_aux,
  ro,
  err_sig,
  true_beta = NULL,
  noise_beta = NULL,
  Sigma_type = "AR",
  is_test = TRUE,
  n_test = n0,
  task = "regression"
)

Arguments

n0

An integer specifying the number of observations in the primary dataset (target domain).

p

An integer specifying the dimension, namely the number of predictors. All the generated data must have the same dimension.

K

An integer specifying the number of auxiliary datasets (source domains).

nk

Either an integer specifying the number of observations in each auxiliary dataset (source domains), or a vector where each element specifies the size of the corresponding auxiliary dataset. If 'nk' is a vector, its length must match the number of auxiliary datasets ('K').

is_noise

Logical; if TRUE, includes noisy data. If FALSE, 'K_noise' and 'nk_noise' are ignored. Default is TRUE.

K_noise

An integer specifying the number of noisy auxiliary datasets. If 'K_noise = 0', noisy datasets are skipped. If 'is_noise = FALSE', this argument is not used.

nk_noise

Either an integer specifying the number of observations in each noisy dataset, or a vector where each element specifies the size of the corresponding noisy dataset. If 'nk_noise' is a vector, its length must match the number of noisy datasets ('K_noise').

mu_trgt

A numeric value specifying the mean of the true coefficients in the primary dataset.

xi_aux

A numeric value representing the shift applied to the true coefficients in the auxiliary datasets.

ro

A numeric value representing the correlation between predictors (applies to the covariance matrix).

err_sig

A numeric value specifying the standard deviation of the noise added to the response.

true_beta

A vector of true coefficients for the primary dataset. If 'NULL', it is randomly generated. Default is 'NULL'.

noise_beta

A vector of noise coefficients. If 'NULL', it is set to '-true_beta'. Default is 'NULL'.

Sigma_type

A string specifying the covariance structure for the predictors. Options are: "AR" (auto-regressive structure) or "CS" (compound symmetry structure). Default is "AR".

is_test

Logical; if TRUE, generates test dataset ('X_test', 'y_test'). Default is TRUE.

n_test

An integer specifying the number of observations in the test data. Default is n0.

task

A string specifying the type of task. Options are "regression" or "classification". Default is "regression".

Details

The function first generates a covariance matrix based on the specified 'Sigma_type', then creates the primary dataset ('X', 'y'), the auxiliary datasets ('X_aux', 'y_aux'), and optionally generates test datasets ('X_test', 'y_test'). The auxiliary datasets are combined with noisy datasets into 'X_aux' and 'y_aux' for transfer learning use.

If 'is_noise = FALSE', then no noisy data is generated and 'K_noise' and 'nk_noise' are ignored. If 'K_noise = 0', noisy data is skipped regardless of the value of 'is_noise'. The task can be either "regression" or "classification". In classification mode, binary response variables are generated using a logistic function.

If 'nk' or 'nk_noise' is a vector, it checks if its length matches the number of auxiliary or noisy +, respectively. If the lengths do not match, an error is returned.

Value

A list containing:

X

The primary dataset predictors (target domain).

y

The primary dataset responses (target domain).

X_aux

A list of matrices combining auxiliary and noisy dataset predictors.

y_aux

A list of vectors combining auxiliary and noisy dataset responses.

X_test

The test dataset predictors, if 'is_test=TRUE'.

y_test

The test dataset responses, if 'is_test=TRUE'.

Examples

# Example: Generate data with auxiliary, noisy, and test datasets for regression
dat_reg <- generate_data(n0=100, p=10, K=3, nk=50, is_noise=TRUE, K_noise=2, nk_noise=30, 
                         mu_trgt=1, xi_aux=0.5, ro=0.3, err_sig=1, 
                         is_test=TRUE, task="regression")

# Example: Generate data with auxiliary, noisy, and test datasets for classification
dat_class <- generate_data(n0=100, p=10, K=3, nk=50, is_noise=TRUE, K_noise=2, nk_noise=30, 
                           mu_trgt=1, xi_aux=0.5, ro=0.3, err_sig=1, 
                           is_test=TRUE, task="classification")

# Display the dimensions of the generated data
cat("Primary dataset (X):", dim(dat_reg$X), "\n")   # Should print 100 x 10 for regression
cat("Primary dataset (y):", length(dat_reg$y), "\n") # Should print length 100 for regression

# Display the dimensions of auxiliary datasets
cat("Auxiliary dataset 1 (X_aux[[1]]):", dim(dat_reg$X_aux[[1]]), "\n") # Should print 50 x 10
cat("Auxiliary dataset 2 (X_aux[[2]]):", dim(dat_reg$X_aux[[2]]), "\n") # Should print 50 x 10

# Display the dimensions of noisy datasets (if generated)
cat("Noisy dataset 1 (X_aux[[4]]):", dim(dat_reg$X_aux[[4]]), "\n") # Should print 30 x 10

# Display test data dimensions (if generated)
if (!is.null(dat_reg$X_test)) {
  cat("Test dataset (X_test):", dim(dat_reg$X_test), "\n") # Should print 100 x 10
  cat("Test dataset (y_test):", length(dat_reg$y_test), "\n") # Should print length 100
}

[Package ARTtransfer version 1.0.0 Index]