SH {DataSimilarity} | R Documentation |
Schilling-Henze Nearest Neighbor Test
Description
Performs the Schilling-Henze two-sample test for multivariate data (Schilling, 1986; Henze, 1988).
Usage
SH(X1, X2, K = 1, graph.fun = knn.bf, dist.fun = stats::dist, n.perm = 0,
dist.args = NULL, seed = NULL)
Arguments
X1 |
First dataset as matrix or data.frame |
X2 |
Second dataset as matrix or data.frame |
K |
Number of nearest neighbors to consider (default: 1) |
graph.fun |
Function for calculating a similarity graph using the distance matrix on the pooled sample (default: |
dist.fun |
Function for calculating a distance matrix on the pooled dataset (default: |
n.perm |
Number of permutations for permutation test (default: 0, asymptotic test is performed). |
dist.args |
Named list of further arguments passed to |
seed |
Random seed (default: NULL). A random seed will only be set if one is provided. |
Details
The test statistic is the proportion of edges connecting points from the same dataset in a K
-nearest neighbor graph calculated on the pooled sample (standardized with expectation and SD under the null).
Low values of the test statistic indicate similarity of the datasets. Thus, the null hypothesis of equal distributions is rejected for high values.
For n.perm = 0
, an asymptotic test using the asymptotic normal approximation of the conditional null distribution is performed. For n.perm > 0
, a permutation test is performed.
Value
An object of class htest
with the following components:
statistic |
Observed value of the test statistic |
p.value |
Asymptotic or permutation p value |
estimate |
The number of within-sample edges |
alternative |
The alternative hypothesis |
method |
Description of the test |
data.name |
The dataset names |
Applicability
Target variable? | Numeric? | Categorical? | K-sample? |
No | Yes | No | No |
Note
The default of K=1
is chosen rather arbitrary based on computational speed as there is no good rule for chossing K
proposed in the literature so far. Typical values for K
chosen in the literature are 1 and 5.
References
Schilling, M. F. (1986). Multivariate Two-Sample Tests Based on Nearest Neighbors. Journal of the American Statistical Association, 81(395), 799-806. doi:10.2307/2289012
Henze, N. (1988). A Multivariate Two-Sample Test Based on the Number of Nearest Neighbor Type Coincidences. The Annals of Statistics, 16(2), 772-783.
Stolte, M., Kappenberg, F., Rahnenführer, J., Bommert, A. (2024). Methods for quantifying dataset similarity: a review, taxonomy and comparison. Statist. Surv. 18, 163 - 298. doi:10.1214/24-SS149
See Also
knn
, BQS
, FR
, CF
, CCS
, ZC
for other graph-based tests,
FR_cat
, CF_cat
, CCS_cat
, and ZC_cat
for versions of the test for categorical data
Examples
set.seed(1234)
# Draw some data
X1 <- matrix(rnorm(1000), ncol = 10)
X2 <- matrix(rnorm(1000, mean = 0.5), ncol = 10)
# Perform Schilling-Henze test
SH(X1, X2)