ResamplingSameOtherCV {mlr3resampling} | R Documentation |
Resampling for comparing training on same or other subsets
Description
ResamplingSameOtherCV
defines how a task is partitioned for
resampling, for example in
resample()
or
benchmark()
.
Resampling objects can be instantiated on a
Task
,
which should define at least one subset variable.
After instantiation, sets can be accessed via
$train_set(i)
and
$test_set(i)
, respectively.
Details
This provides an implementation of SOAK, Same/Other/All K-fold
cross-validation. After instantiation, this class provides information
in $instance
that can be used for visualizing the
splits, as shown in the vignette. Most typical machine learning users
should instead use
ResamplingSameOtherSizesCV
, which does not support these
visualization features, but provides other relevant machine learning
features, such as group role, which is not supported by
ResamplingSameOtherCV
.
A supervised learning algorithm inputs a train set, and outputs a prediction function, which can be used on a test set. If each data point belongs to a subset (such as geographic region, year, etc), then how do we know if it is possible to train on one subset, and predict accurately on another subset? Cross-validation can be used to determine the extent to which this is possible, by first assigning fold IDs from 1 to K to all data (possibly using stratification, usually by subset and label). Then we loop over test sets (subset/fold combinations), train sets (same subset, other subsets, all subsets), and compute test/prediction accuracy for each combination. Comparing test/prediction accuracy between same and other, we can determine the extent to which it is possible (perfect if same/other have similar test accuracy for each subset; other is usually somewhat less accurate than same; other can be just as bad as featureless baseline when the subsets have different patterns).
Stratification
ResamplingSameOtherCV
supports stratified sampling.
The stratification variables are assumed to be discrete,
and must be stored in the Task with column role "stratum"
.
In case of multiple stratification variables,
each combination of the values of the stratification variables forms a stratum.
Grouping
ResamplingSameOtherCV
does not support grouping of
observations that should not be split in cross-validation.
See ResamplingSameOtherSizesCV
for another sampler which
does support both group
and subset
roles.
Subsets
The subset variable is assumed to be discrete,
and must be stored in the Task with column role "subset"
.
The number of cross-validation folds K should be defined as the
fold
parameter.
In each subset, there will be about an equal number of observations
assigned to each of the K folds.
The assignments are stored in
$instance$id.dt
.
The train/test splits are defined by all possible combinations of
test subset, test fold, and train subsets (Same/Other/All).
The splits are stored in
$instance$iteration.dt
.
Methods
Public methods
Method new()
Creates a new instance of this R6 class.
Usage
Resampling$new( id, param_set = ps(), duplicated_ids = FALSE, label = NA_character_, man = NA_character_ )
Arguments
id
(
character(1)
)
Identifier for the new instance.param_set
(paradox::ParamSet)
Set of hyperparameters.duplicated_ids
(
logical(1)
)
Set toTRUE
if this resampling strategy may have duplicated row ids in a single training set or test set.label
(
character(1)
)
Label for the new instance.man
(
character(1)
)
String in the format[pkg]::[topic]
pointing to a manual page for this object. The referenced help package can be opened via method$help()
.
Method train_set()
Returns the row ids of the i-th training set.
Usage
Resampling$train_set(i)
Arguments
i
(
integer(1)
)
Iteration.
Returns
(integer()
) of row ids.
Method test_set()
Returns the row ids of the i-th test set.
Usage
Resampling$test_set(i)
Arguments
i
(
integer(1)
)
Iteration.
Returns
(integer()
) of row ids.
See Also
arXiv paper https://arxiv.org/abs/2410.08643 describing SOAK algorithm.
Articles https://github.com/tdhock/mlr3resampling/wiki/Articles
Package mlr3 for standard
Resampling
, which does not support comparing train on Same/Other/All subsets.-
vignette(package="mlr3resampling")
for more detailed examples.
Examples
same_other <- mlr3resampling::ResamplingSameOtherCV$new()
same_other$param_set$values$folds <- 5