rf_domain_score {viraldomain}R Documentation

Calculate the Random Forest Model Domain Applicability Score

Description

This function fits a Random Forest model to the provided data and computes a domain applicability score based on PCA distances.

Usage

rf_domain_score(
  featured_col,
  train_data,
  rf_hyperparameters,
  test_data,
  threshold_value
)

Arguments

featured_col

A character string specifying the name of the response variable to predict.

train_data

A data frame containing predictor variables and the response variable for training the model.

rf_hyperparameters

A list of hyperparameters for the Random Forest model, including:

  • mtry: Number of predictors sampled at each split.

  • min_n: Minimum number of data points in a node for further splitting.

  • trees: Number of trees in the ensemble.

test_data

A data frame for making predictions.

threshold_value

A numeric threshold value used for computing domain applicability scores.

Details

Random Forest creates a large number of decision trees, each independent of the others. The final prediction combines the predictions from all individual trees. This function uses the ranger engine for fitting regression models.

Value

A data frame containing the computed domain applicability scores for each observation in the test dataset.

Examples

set.seed(123)
library(dplyr)
featured_col <- "cd_2022"
train_data <- viral %>%
  dplyr::select(cd_2022, vl_2022)
test_data <- sero
rf_hyperparameters <- list(mtry = 2, min_n = 5, trees = 500)
threshold_value <- 0.99
rf_domain_score(featured_col, train_data, rf_hyperparameters, test_data, threshold_value)

[Package viraldomain version 0.0.7 Index]