distGDM2 {flexord} | R Documentation |
Distance Functions for K-Centroids Clustering of (Ordinal) Categorical/Mixed Data
Description
Functions to calculate the distance between a matrix x
and a
matrix c
, which can be used for K-centroids clustering via
flexclust::kcca()
.
distSimMatch
implements Simple Matching Distance (most frequently
used for categorical, or symmetric binary data) for K-centroids
clustering.
distGower
implements Gower's Distance after Gower (1971) and
Kaufman & Rousseeuw (1990) for mixed-type data with missings for K-centroids
clustering.
distGDM2
implements GDM2 distance for ordinal data introduced by
Walesiak et al. (1993) and adapted to K-centroids clustering by
Ernst et al. (2025).
These functions are designed for use with flexclust::kcca()
or
functions that are built upon it. Their use is easiest via the
wrapper kccaExtendedFamily()
. However, they can also easily be
used to obtain a distance matrix of x
, see Examples.
Usage
distGDM2(x, centers, genDist, xrange = NULL)
distGower(x, centers, genDist)
distSimMatch(x, centers)
Arguments
x |
A numeric matrix or data frame. |
centers |
A numeric matrix with |
genDist |
Additional information on
|
xrange |
Range specification for the variables. Currently only
used for
|
Details
-
distSimMatch
: Simple Matching Distance between two observations is calculated as the proportion of disagreements acros all variables. Described, e.g., in Kaufman & Rousseeuw (1990), p. 24. If this is used in K-centroids analysis in combination with mode centroids (as implemented incentMode
), this results in the kModes algorithm. A wrapper for this algorithm is obtained withkccaExtendedFamily(which='kModes')
. -
distGower
: Distances are calculated for each column (Euclidean distance,distEuclidean
, is recommended for numeric, Manhattan distance,distManhattan
for ordinal, Simple Matching Distance,distSimMatch
for categorical, and Jaccard distance,distJaccard
for asymmetric binary variables), and they are summed up as:d(x_i, x_k) = \frac{\sum_{j=1}^p \delta_{ikj} d(x_{ij}, x_{kj})}{\sum_{j=1}^p \delta_{ikj}}
where
p
is the number of variables and with the weight\delta_{ikj}
being 1 if both valuesx_{ij}
andx_{kj}
are not missing, and in the case of asymmetric binary variables, at least one of them is not 0. Please note that for calculating Gower's distance, scaling of numeric/ordered variables is required (as f.i. by.ScaleVarSpecific
). A wrapper for K-centroids analysis using Gower's distance in combination with a numerically optimized centroid is found inkccaExtendedFamily(which='kGower')
. -
distGDM2
: GDM2 distance for ordinal variables conducts only relational operations on the variables, such as\leq
,\geq
and=
. By translatingx
to its relative frequencies and empirical cumulative distributions, we are able to extend this principle to compare two arbitrary values, and thus use it within K-centroids clustering. For more details, see Ernst et al. (2025). A wrapper for this algorithm in combination with a numerically optimized centroid is found inkccaExtendedFamily(which='kGDM2')
.
The distances functions presented here can also be used in clustering algorithms that rely on distance matrices (such as hierarchical clustering and PAM), if applied accordingly, see Examples.
Value
A matrix of dimensions c(nrow(x), nrow(centers))
that contains the distance
between each row of x
and each row of centers
.
References
Ernst, D, Ortega Menjivar, L, Scharl, T, GrĂ¼n, B (2025). Ordinal Clustering with the flex-Scheme. Austrian Journal of Statistics. Submitted manuscript.
Gower, JC (1971). A General Coefficient for Similarity and Some of Its Properties. Biometrics, 27(4), 857-871. doi:10.2307/2528823
Kaufman, L, Rousseeuw, P (1990). Finding Groups in Data: An Introduction to Cluster Analysis. Wiley Series in Probability and Statistics. doi:10.1002/9780470316801
Leisch, F (2006). A Toolbox for K-Centroids Cluster Analysis. Computational Statistics and Data Analysis, 17(3), 526-544. doi:10.1016/j.csda.2005.10.006
Kaufman, L, Rousseeuw, P (1990.) Finding Groups in Data: An Introduction to Cluster Analysis. Wiley Series in Probability and Statistics, New York: John Wiley & Sons. doi:10.1002/9780470316801
Walesiak, M (1993). Statystyczna Analiza Wielowymiarowa w Badaniach Marketingowych. Wydawnictwo Akademii Ekonomicznej, 44-46.
Weihs, C, Ligges, U, Luebke, K, Raabe, N (2005). klaR Analyzing German Business Cycles. In Baier D, Decker, R, Schmidt-Thieme, L (eds.). Data Analysis and Decision Support, 335-343. Berlin: Springer-Verlag. doi:10.1007/3-540-28397-8_36
See Also
flexclust::kcca()
,
klaR::kmodes()
,
cluster::daisy()
,
clusterSim::dist.GDM()
Examples
# Example 1: Simple Matching Distance
set.seed(123)
dat <- data.frame(question1 = factor(sample(LETTERS[1:4], 10, replace=TRUE)),
question2 = factor(sample(LETTERS[1:6], 10, replace=TRUE)),
question3 = factor(sample(LETTERS[1:4], 10, replace=TRUE)),
question4 = factor(sample(LETTERS[1:5], 10, replace=TRUE)),
state = factor(sample(state.name[1:10], 10, replace=TRUE)),
gender = factor(sample(c('M', 'F', 'N'), 10, replace=TRUE,
prob=c(0.45, 0.45, 0.1))))
datmat <- data.matrix(dat)
initcenters <- datmat[sample(1:10, 3),]
distSimMatch(datmat, initcenters)
## within kcca
flexclust::kcca(dat, k=3, family=kccaExtendedFamily('kModes'))
## as a distance matrix
as.dist(distSimMatch(datmat, datmat))
# Example 2: GDM2 distance
flexclust::kcca(dat, k=3, family=kccaExtendedFamily('kGDM2'))
# Example 3: Gower's distance
# Ex. 3.1: single variable type case with no missings:
flexclust::kcca(datmat, 3, kccaExtendedFamily('kGower'))
# Ex. 3.2: single variable type case with missing values:
nas <- sample(c(TRUE, FALSE), prod(dim(dat)), replace = TRUE,
prob=c(0.1, 0.9)) |>
matrix(nrow = nrow(dat))
dat[nas] <- NA
flexclust::kcca(dat, 3, kccaExtendedFamily('kGower', cent=centMode))
#Ex. 3.3: mixed variable types (with or without missings):
dat <- data.frame(cont = sample(1:100, 10, replace=TRUE)/10,
bin_sym = as.logical(sample(0:1, 10, replace=TRUE)),
bin_asym = as.logical(sample(0:1, 10, replace=TRUE)),
ord_levmis = factor(sample(1:5, 10, replace=TRUE),
levels=1:6, ordered=TRUE),
ord_levfull = factor(sample(1:4, 10, replace=TRUE),
levels=1:4, ordered=TRUE),
nom = factor(sample(letters[1:4], 10, replace=TRUE),
levels=letters[1:4]))
dat[nas] <- NA
flexclust::kcca(dat, 3, kccaExtendedFamily('kGower'))