MW {DataSimilarity} | R Documentation |
Nonparametric Graph-Based LP (GLP) Test
Description
Performs the nonparametric graph-based LP (GLP) multisample test proposed by Mokhopadhyay and Wang (2020). The implementation here uses the GLP
implementation from the LPKsample package.
Usage
MW(X1, X2, ..., sum.all = FALSE, m.max = 4, components = NULL, alpha = 0.05,
c.poly = 0.5, clust.alg = "kmeans", n.perm = 0, combine.criterion = "kernel",
multiple.comparison = TRUE, compress.algorithm = FALSE, nbasis = 8, seed = NULL)
Arguments
X1 |
First dataset as matrix or data.frame |
X2 |
Second dataset as matrix or data.frame |
... |
Optionally more datasets as matrices or data.frames |
sum.all |
Should all components be summed up for calculating the test statistic? (default: |
m.max |
Maximum order of LP components to investigate (default: 4) |
components |
Vector specifying which components to test. If |
alpha |
Significance level |
c.poly |
Parameter for polynomial kernel (default: 0.5) |
clust.alg |
Character specifying the cluster algorithm used in graph community detection. possible options are |
n.perm |
Number of permutations for permutation test (default: 0, asymptotic test is performed). |
combine.criterion |
Character specifying how to obtain the overall test result based on the component-wise results. Possible options are |
multiple.comparison |
Should an adjustment for multiple comparisons be used when determining which components are significant? (default: |
compress.algorithm |
Should smooth compression of Laplacian spectra be used for testing? (default: |
nbasis |
Number of bases used for approximation when |
seed |
Random seed (default: NULL). A random seed will only be set if one is provided. |
Details
The GLP statistic is based on learning an LP graph kernel using a pre-specified number of LP components and performing clustering on the eigenvectors of the Laplacian matrix for this learned kernel. The cluster assignment is tested for association with the true dataset memberships for each component of the LP graph kernel. The results are combined by either constructing a super-kernel using specific components and performing the cluster and test step again or by using the combination of the significant components after adjustment for multiple testing.
Small values of the GLP statistic indicate dataset similarity. Therefore, the test rejects for large values.
Value
An object of class htest
with the following components:
statistic |
Observed value of the GLP test statistic |
p.value |
Asymptotic or permutation overall p value |
null.value |
Needed for pretty printing of results |
alternative |
Needed for pretty printing of results |
method |
Description of the test |
data.name |
The dataset names |
alternative |
The alternative hypothesis |
Applicability
Target variable? | Numeric? | Categorical? | K-sample? |
No | Yes | No | Yes |
Note
When sum.all = FALSE
and no components are significant, the test statistic value is always set to zero.
Note that the implementation cannot handle univariate data.
References
Mukhopadhyay, S. and Wang, K. (2020). A nonparametric approach to high-dimensional k-sample comparison problems, Biometrika, 107(3), 555-572, doi:10.1093/biomet/asaa015
Mukhopadhyay, S. and Wang, K. (2019). Towards a unified statistical theory of spectralgraph analysis, doi:10.48550/arXiv.1901.07090
Mukhopadhyay, S., Wang, K. (2020). LPKsample: LP Nonparametric High Dimensional K-Sample Comparison. R package version 2.1, https://CRAN.R-project.org/package=LPKsample
Stolte, M., Kappenberg, F., Rahnenführer, J., Bommert, A. (2024). Methods for quantifying dataset similarity: a review, taxonomy and comparison. Statist. Surv. 18, 163 - 298. doi:10.1214/24-SS149
Examples
set.seed(1234)
# Draw some data
X1 <- matrix(rnorm(1000), ncol = 10)
X2 <- matrix(rnorm(1000, mean = 0.5), ncol = 10)
# Perform GLP test
if(requireNamespace("LPKsample", quietly = TRUE)) {
MW(X1, X2, n.perm = 100)
}