DataSimilarity-package {DataSimilarity} | R Documentation |
Quantifying Similarity of Datasets and Multivariate Two- And k-Sample Testing
Description
A collection of methods for quantifying the similarity of two or more datasets, many of which can be used for two- or k-sample testing. It provides newly implemented methods as well as wrapper functions for existing methods that enable calling many different methods in a unified framework. The methods were selected from the review and comparison of Stolte et al. (2024) <doi:10.1214/24-SS149>.
Details
The DESCRIPTION file:
Package: | DataSimilarity |
Type: | Package |
Title: | Quantifying Similarity of Datasets and Multivariate Two- And k-Sample Testing |
Version: | 0.2.0 |
Date: | 2025-06-14 |
Authors@R: | c(person(given = "Marieke", family = "Stolte", email = "stolte@statistik.tu-dortmund.de", role = c("aut", "cre", "cph"), comment = c(ORCID = "0009-0002-0711-6789")), person(given = "Luca", family = "Sauer", role = c("aut"), comment = c(ORCID = "0009-0000-1086-023X")), person(given = "David", family = "Alvarez-Melis", role = c("ctb"), comment = "Original python implementation of OTDD, <https://github.com/microsoft/otdd.git>"), person(given = "Nabarun", family = "Deb", role = c("ctb"), comment = "Original implementation of rank-based Energy test (DS), <https://github.com/NabarunD/MultiDistFree.git>"), person(given = "Bodhisattva", family = "Sen", role = c("ctb"), comment = "Original implementation of rank-based Energy test (DS), <https://github.com/NabarunD/MultiDistFree.git>")) |
Depends: | R (>= 3.5.0) |
Imports: | boot, stats |
Suggests: | ade4, approxOT, Ball, caret, clue, cramer, crossmatch, dbscan, densratio, DWDLargeR, e1071, Ecume, energy, expm, FNN, gTests, gTestsMulti, HDLSSkST, hypoRF, kernlab, kerTests, KMD, knitr, LPKsample, Matrix, mvtnorm, nbpMatching, pROC, purrr, randtoolbox, rlemon, rpart, rpart.plot, testthat, nnet |
Description: | A collection of methods for quantifying the similarity of two or more datasets, many of which can be used for two- or k-sample testing. It provides newly implemented methods as well as wrapper functions for existing methods that enable calling many different methods in a unified framework. The methods were selected from the review and comparison of Stolte et al. (2024) <doi:10.1214/24-SS149>. |
License: | GPL (>=3) |
LazyData: | true |
Author: | Marieke Stolte [aut, cre, cph] (<https://orcid.org/0009-0002-0711-6789>), Luca Sauer [aut] (<https://orcid.org/0009-0000-1086-023X>), David Alvarez-Melis [ctb] (Original python implementation of OTDD, <https://github.com/microsoft/otdd.git>), Nabarun Deb [ctb] (Original implementation of rank-based Energy test (DS), <https://github.com/NabarunD/MultiDistFree.git>), Bodhisattva Sen [ctb] (Original implementation of rank-based Energy test (DS), <https://github.com/NabarunD/MultiDistFree.git>) |
Maintainer: | Marieke Stolte <stolte@statistik.tu-dortmund.de> |
Index of help topics:
BF Baringhaus and Franz (2010) Rigid Motion Invariant Multivariate Two-sample Test BG Biau and Gyorfi (2005) Two-sample Homogeneity Test BG2 Biswas and Ghosh (2014) Two-Sample Test BMG Biswas et al. (2014) Two-sample Runs Test BQS Barakat et al. (1996) Two-Sample Test Bahr Bahr (1996) Multivariate Two-sample Test BallDivergence Ball Divergence Based Two- or k-sample Test C2ST Classifier Two-Sample Test CCS Weighted Edge-Count Two-Sample Test CCS_cat Weighted Edge-Count Two-Sample Test for Discrete Data CF Generalized Edge-Count Test CF_cat Generalized Edge-Count Test for Discrete Data CMDistance Constrained Minimum Distance Cramer Cramér Two-Sample Test DISCOB Distance Components (DISCO) Tests DISCOF Distance Components (DISCO) Tests DS Rank-Based Energy Test (Deb and Sen, 2021) DataSimilarity Dataset Similarity DataSimilarity-package Quantifying Similarity of Datasets and Multivariate Two- And k-Sample Testing DiProPerm Direction-Projection-Permutation (DiProPerm) Test Energy Energy Statistic and Test FR Friedman-Rafsky Test FR_cat Friedman-Rafsky Test for Discrete Data FStest Multisample FS Test GGRL Decision-Tree Based Measure of Dataset Distance and Two-Sample Test GPK Generalized Permutation-Based Kernel (GPK) Two-Sample Test HMN Random Forest Based Two-Sample Test HamiltonPath Shortest Hamilton path Jeffreys Jeffreys Divergence KMD Kernel Measure of Multi-Sample Dissimilarity (KMD) LHZ Empirical Characteristic Distance LHZStatistic Calculation of the Li et al. (2022) Empirical Characteristic Distance MMCM Multisample Mahalanobis Crossmatch (MMCM) Test MMD Maximum Mean Discrepancy (MMD) Test MST Minimum Spanning Tree (MST) MW Nonparametric Graph-Based LP (GLP) Test NKT Decision-Tree Based Measure of Dataset Similarity (Ntoutsi et al., 2008) OTDD Optimal Transport Dataset Distance Petrie Multisample Crossmatch (MCM) Test RItest Multisample RI Test Rosenbaum Rosenbaum Crossmatch Test SC Graph-Based Multi-Sample Test SH Schilling-Henze Nearest Neighbor Test Wasserstein Wasserstein Distance Based Test YMRZL Yu et al. (2007) Two-Sample Test ZC Maxtype Edge-Count Test ZC_cat Maxtype Edge-Count Test for Discrete Data dipro.fun Direction-Projection Functions for DiProPerm Test engineerMetric Engineer Metric findSimilarityMethod Selection of Appropriate Methods for Quantifying the Similarity of Datasets gTests Graph-Based Tests gTestsMulti Graph-Based Multi-Sample Test gTests_cat Graph-Based Tests for Discrete Data kerTests Generalized Permutation-Based Kernel (GPK) Two-Sample Test knn K-Nearest Neighbor Graph method.table List of Methods Included in the Package rectPartition Calculate a Rectangular Partition stat.fun Univariate Two-Sample Statistics for DiProPerm Test
The package provides various methods for comparing two or more datasets or their underlying distributions. Often, a permutation or asymptotic test for the null hypothesis of equal distributions H_0: F_1 = F_2
or H_0: F_1 = \dots = F_k
is performed.
Author(s)
Marieke Stolte [aut, cre, cph] (<https://orcid.org/0009-0002-0711-6789>), Luca Sauer [aut] (<https://orcid.org/0009-0000-1086-023X>), David Alvarez-Melis [ctb] (Original python implementation of OTDD, <https://github.com/microsoft/otdd.git>), Nabarun Deb [ctb] (Original implementation of rank-based Energy test (DS), <https://github.com/NabarunD/MultiDistFree.git>), Bodhisattva Sen [ctb] (Original implementation of rank-based Energy test (DS), <https://github.com/NabarunD/MultiDistFree.git>)
Maintainer: Marieke Stolte <stolte@statistik.tu-dortmund.de>
References
Stolte, M., Kappenberg, F., Rahnenführer, J., Bommert, A. (2024). Methods for quantifying dataset similarity: a review, taxonomy and comparison. Statist. Surv. 18, 163 - 298. doi:10.1214/24-SS149
Stolte, M., Kappenberg, F., Rahnenführer, J. & Bommert, A. (2024). A Comparison of Methods for Quantifying Dataset Similarity. https://shiny.statistik.tu-dortmund.de/data-similarity/