DataSimilarity-package {DataSimilarity}R Documentation

Quantifying Similarity of Datasets and Multivariate Two- And k-Sample Testing

Description

A collection of methods for quantifying the similarity of two or more datasets, many of which can be used for two- or k-sample testing. It provides newly implemented methods as well as wrapper functions for existing methods that enable calling many different methods in a unified framework. The methods were selected from the review and comparison of Stolte et al. (2024) <doi:10.1214/24-SS149>.

Details

The DESCRIPTION file:

Package: DataSimilarity
Type: Package
Title: Quantifying Similarity of Datasets and Multivariate Two- And k-Sample Testing
Version: 0.2.0
Date: 2025-06-14
Authors@R: c(person(given = "Marieke", family = "Stolte", email = "stolte@statistik.tu-dortmund.de", role = c("aut", "cre", "cph"), comment = c(ORCID = "0009-0002-0711-6789")), person(given = "Luca", family = "Sauer", role = c("aut"), comment = c(ORCID = "0009-0000-1086-023X")), person(given = "David", family = "Alvarez-Melis", role = c("ctb"), comment = "Original python implementation of OTDD, <https://github.com/microsoft/otdd.git>"), person(given = "Nabarun", family = "Deb", role = c("ctb"), comment = "Original implementation of rank-based Energy test (DS), <https://github.com/NabarunD/MultiDistFree.git>"), person(given = "Bodhisattva", family = "Sen", role = c("ctb"), comment = "Original implementation of rank-based Energy test (DS), <https://github.com/NabarunD/MultiDistFree.git>"))
Depends: R (>= 3.5.0)
Imports: boot, stats
Suggests: ade4, approxOT, Ball, caret, clue, cramer, crossmatch, dbscan, densratio, DWDLargeR, e1071, Ecume, energy, expm, FNN, gTests, gTestsMulti, HDLSSkST, hypoRF, kernlab, kerTests, KMD, knitr, LPKsample, Matrix, mvtnorm, nbpMatching, pROC, purrr, randtoolbox, rlemon, rpart, rpart.plot, testthat, nnet
Description: A collection of methods for quantifying the similarity of two or more datasets, many of which can be used for two- or k-sample testing. It provides newly implemented methods as well as wrapper functions for existing methods that enable calling many different methods in a unified framework. The methods were selected from the review and comparison of Stolte et al. (2024) <doi:10.1214/24-SS149>.
License: GPL (>=3)
LazyData: true
Author: Marieke Stolte [aut, cre, cph] (<https://orcid.org/0009-0002-0711-6789>), Luca Sauer [aut] (<https://orcid.org/0009-0000-1086-023X>), David Alvarez-Melis [ctb] (Original python implementation of OTDD, <https://github.com/microsoft/otdd.git>), Nabarun Deb [ctb] (Original implementation of rank-based Energy test (DS), <https://github.com/NabarunD/MultiDistFree.git>), Bodhisattva Sen [ctb] (Original implementation of rank-based Energy test (DS), <https://github.com/NabarunD/MultiDistFree.git>)
Maintainer: Marieke Stolte <stolte@statistik.tu-dortmund.de>

Index of help topics:

BF                      Baringhaus and Franz (2010) Rigid Motion
                        Invariant Multivariate Two-sample Test
BG                      Biau and Gyorfi (2005) Two-sample Homogeneity
                        Test
BG2                     Biswas and Ghosh (2014) Two-Sample Test
BMG                     Biswas et al. (2014) Two-sample Runs Test
BQS                     Barakat et al. (1996) Two-Sample Test
Bahr                    Bahr (1996) Multivariate Two-sample Test
BallDivergence          Ball Divergence Based Two- or k-sample Test
C2ST                    Classifier Two-Sample Test
CCS                     Weighted Edge-Count Two-Sample Test
CCS_cat                 Weighted Edge-Count Two-Sample Test for
                        Discrete Data
CF                      Generalized Edge-Count Test
CF_cat                  Generalized Edge-Count Test for Discrete Data
CMDistance              Constrained Minimum Distance
Cramer                  Cramér Two-Sample Test
DISCOB                  Distance Components (DISCO) Tests
DISCOF                  Distance Components (DISCO) Tests
DS                      Rank-Based Energy Test (Deb and Sen, 2021)
DataSimilarity          Dataset Similarity
DataSimilarity-package
                        Quantifying Similarity of Datasets and
                        Multivariate Two- And k-Sample Testing
DiProPerm               Direction-Projection-Permutation (DiProPerm)
                        Test
Energy                  Energy Statistic and Test
FR                      Friedman-Rafsky Test
FR_cat                  Friedman-Rafsky Test for Discrete Data
FStest                  Multisample FS Test
GGRL                    Decision-Tree Based Measure of Dataset Distance
                        and Two-Sample Test
GPK                     Generalized Permutation-Based Kernel (GPK)
                        Two-Sample Test
HMN                     Random Forest Based Two-Sample Test
HamiltonPath            Shortest Hamilton path
Jeffreys                Jeffreys Divergence
KMD                     Kernel Measure of Multi-Sample Dissimilarity
                        (KMD)
LHZ                     Empirical Characteristic Distance
LHZStatistic            Calculation of the Li et al. (2022) Empirical
                        Characteristic Distance
MMCM                    Multisample Mahalanobis Crossmatch (MMCM) Test
MMD                     Maximum Mean Discrepancy (MMD) Test
MST                     Minimum Spanning Tree (MST)
MW                      Nonparametric Graph-Based LP (GLP) Test
NKT                     Decision-Tree Based Measure of Dataset
                        Similarity (Ntoutsi et al., 2008)
OTDD                    Optimal Transport Dataset Distance
Petrie                  Multisample Crossmatch (MCM) Test
RItest                  Multisample RI Test
Rosenbaum               Rosenbaum Crossmatch Test
SC                      Graph-Based Multi-Sample Test
SH                      Schilling-Henze Nearest Neighbor Test
Wasserstein             Wasserstein Distance Based Test
YMRZL                   Yu et al. (2007) Two-Sample Test
ZC                      Maxtype Edge-Count Test
ZC_cat                  Maxtype Edge-Count Test for Discrete Data
dipro.fun               Direction-Projection Functions for DiProPerm
                        Test
engineerMetric          Engineer Metric
findSimilarityMethod    Selection of Appropriate Methods for
                        Quantifying the Similarity of Datasets
gTests                  Graph-Based Tests
gTestsMulti             Graph-Based Multi-Sample Test
gTests_cat              Graph-Based Tests for Discrete Data
kerTests                Generalized Permutation-Based Kernel (GPK)
                        Two-Sample Test
knn                     K-Nearest Neighbor Graph
method.table            List of Methods Included in the Package
rectPartition           Calculate a Rectangular Partition
stat.fun                Univariate Two-Sample Statistics for DiProPerm
                        Test

The package provides various methods for comparing two or more datasets or their underlying distributions. Often, a permutation or asymptotic test for the null hypothesis of equal distributions H_0: F_1 = F_2 or H_0: F_1 = \dots = F_k is performed.

Author(s)

Marieke Stolte [aut, cre, cph] (<https://orcid.org/0009-0002-0711-6789>), Luca Sauer [aut] (<https://orcid.org/0009-0000-1086-023X>), David Alvarez-Melis [ctb] (Original python implementation of OTDD, <https://github.com/microsoft/otdd.git>), Nabarun Deb [ctb] (Original implementation of rank-based Energy test (DS), <https://github.com/NabarunD/MultiDistFree.git>), Bodhisattva Sen [ctb] (Original implementation of rank-based Energy test (DS), <https://github.com/NabarunD/MultiDistFree.git>)

Maintainer: Marieke Stolte <stolte@statistik.tu-dortmund.de>

References

Stolte, M., Kappenberg, F., Rahnenführer, J., Bommert, A. (2024). Methods for quantifying dataset similarity: a review, taxonomy and comparison. Statist. Surv. 18, 163 - 298. doi:10.1214/24-SS149

Stolte, M., Kappenberg, F., Rahnenführer, J. & Bommert, A. (2024). A Comparison of Methods for Quantifying Dataset Similarity. https://shiny.statistik.tu-dortmund.de/data-similarity/


[Package DataSimilarity version 0.2.0 Index]