kccaExtendedFamily {flexord} | R Documentation |
Extending K-Centroids Clustering to (Mixed-with-)Ordinal Data
Description
This wrapper creates objects of class "kccaFamily"
,
which can be used with flexclust::kcca()
to conduct K-centroids
clustering using the following methods:
-
kModes (after Weihs et al., 2005)
-
kGower (Gower's distance after Kaufman & Rousseeuw, 1990, and a user specified centroid)
-
kGDM2 (GDM2 distance after Walesiak et al., 1993, and a user specified centroid)
Usage
kccaExtendedFamily(which = c('kModes', 'kGDM2', 'kGower'),
cent = NULL,
preproc = NULL,
xrange = NULL,
xmethods = NULL,
trim = 0, groupFun = 'minSumClusters')
Arguments
which |
One of either |
cent |
Function for determining cluster centroids. This argument is ignored for `which='kModes'`, and `centMode` is used. For `'kGDM2'` and `'kGower'`, `cent=NULL` defaults to a general purpose optimizer. |
preproc |
Preprocessing function applied to the data before clustering. This argument is ignored for `which='kGower'`. In this case, the default preprocessing proposed by Gower (1971) and Kaufman & Rousseeuw (1990) is conducted. For `'kGDM2'` and `'kModes'`, users can specify preprocessing steps here, though this is not recommended. |
xrange |
The range of the data in
This argument is ignored for |
xmethods |
An optional character vector of length |
trim |
Proportion of points trimmed in robust clustering, wee
|
groupFun |
A character string specifying the function for clustering. Default is `'minSumClusters'`, see [flexclust::kccaFamily()]. |
Details
Wrappers for defining families are obtained by specifying
which
using:
-
which='kModes'
creates an object for kModes clustering, i.e., K-centroids clustering using Simple Matching Distance (counts of disagreements) and modes as centroids. Argumentcent
is ignored for this method. -
which='kGower'
creates an object for performing clustering using Gower's method as described in Kaufman & Rousseeuw (1990):Numeric and/or ordinal variables are scaled by
\frac{\mathbf{x}-\min{\mathbf{x}}}{\max{\mathbf{x}-\min{\mathbf{x}}}}
. Note that for ordinal variables the internal coding with values from 1 up to their maximum level is used.Distances are calculated for each column (Euclidean distance,
distEuclidean
, is recommended for numeric, Manhattan distance,distManhattan
for ordinal, Simple Matching Distance,distSimMatch
for categorical, and Jaccard distance,distJaccard
for asymmetric binary variables), and they are summed up as:d(x_i, x_k) = \frac{\sum_{j=1}^p \delta_{ikj} d(x_{ij}, x_{kj})}{\sum_{j=1}^p \delta_{ikj}}
where
p
is the number of variables and with the weight\delta_{ikj}
being 1 if both valuesx_{ij}
andx_{kj}
are not missing, and in the case of asymmetric binary variables, at least one of them is not 0.The columnwise distances used can be influenced in two ways: By passing a character vector of length
p
toxmethods
that specifies the distance for each column. Options are:distEuclidean
,distManhattan
,distJaccard
, anddistSimMatch
. Another option is to not specify any methods withinkccaExtendedFamily
, but rather pass a"data.frame"
as argumentx
inkcca
, where the class of the column is used to infer the distance measure.distEuclidean
is used on numeric and integer columns,distManhattan
on columns that are coded as ordered factors,distSimMatch
is the default for categorically coded columns, anddistJaccard
is the default for binary coded columns.For this method, if
cent=NULL
, a general purpose optimizer withNA
omission is applied for centroid calculation.
-
which='kGDM2'
creates an obejct for clustering using the GDM2 distance for ordinal variables. The GMD2 distance was first introduced by Walesiak et al. (1993), and adapted in Ernst et al. (2025), as the distance measure withinflexclust::kcca()
.This distance respects the ordinal nature of a variable by conducting only relational operations to compare values, such as
\leq
,\geq
and=
. By obtaining the relative frequencies and empirical cumulative distributions ofx
, we allow for comparison of two arbitrary values, and thus are able to conduct K-centroids clustering. For more details, see Ernst et al. (2025).
Also for this method, if cent=NULL
, a general purpose optimizer
with NA
omission will be applied for centroid calculation.
Scale handling.
In 'kModes'
, all variables are treated as unordered factors.
In 'kGDM2'
, all variables are treated as ordered factors, with strict assumptions
regarding their ordinality.
'kGower'
is currently the only method designed to handle mixed-type data. For ordinal
variables, the assumptions are more lax than with GDM2 distance.
NA handling.
NA handling via omission and upweighting non-missing variables is currently
only implemented for 'kGower'
. Within 'kModes'
, the omission of NA responses
can be avoided by coding missings as valid factor levels. For 'kGDM2'
, currently
the only option is to omit missing values completely.
Value
An object of class "kccaFamily"
.
References
Ernst, D, Ortega Menjivar, L, Scharl, T, GrĂ¼n, B (2025). Ordinal Clustering with the flex-Scheme. Austrian Journal of Statistics. Submitted manuscript.
Gower, JC (1971). A General Coefficient for Similarity and Some of Its Properties. Biometrics, 27(4), 857-871. doi:10.2307/2528823
Kaufman, L, Rousseeuw, P (1990). Finding Groups in Data: An Introduction to Cluster Analysis. Wiley Series in Probability and Statistics. doi:10.1002/9780470316801
Leisch, F (2006). A Toolbox for K-Centroids Cluster Analysis. Computational Statistics and Data Analysis, 17(3), 526-544. doi:10.1016/j.csda.2005.10.006
Walesiak, M (1993). Statystyczna Analiza Wielowymiarowa w Badaniach Marketingowych. Wydawnictwo Akademii Ekonomicznej, 44-46.
Weihs, C, Ligges, U, Luebke, K, Raabe, N (2005). klaR Analyzing German Business Cycles. In: Data Analysis and Decision Support, Springer: Berlin. 335-343. doi:10.1007/3-540-28397-8_36
See Also
flexclust::kcca()
,
flexclust::stepFlexclust()
,
flexclust::bootFlexclust()
Examples
# Example 1: kModes
set.seed(123)
dat <- data.frame(cont = sample(1:100, 10, replace=TRUE)/10,
bin_sym = as.logical(sample(0:1, 10, replace=TRUE)),
bin_asym = as.logical(sample(0:1, 10, replace=TRUE)),
ord_levmis = factor(sample(1:5, 10, replace=TRUE),
levels=1:6, ordered=TRUE),
ord_levfull = factor(sample(1:4, 10, replace=TRUE),
levels=1:4, ordered=TRUE),
nom = factor(sample(letters[1:4], 10, replace=TRUE),
levels=letters[1:4]))
flexclust::kcca(dat, k=3, family=kccaExtendedFamily('kModes'))
# Example 2: kGDM2
flexclust::kcca(dat, k=3, family=kccaExtendedFamily('kGDM2',
xrange='columnwise'))
# Example 3: kGower
flexclust::kcca(dat, 3, kccaExtendedFamily('kGower'))
nas <- sample(c(TRUE,FALSE), prod(dim(dat)), replace=TRUE, prob=c(0.1,0.9)) |>
matrix(nrow=nrow(dat))
dat[nas] <- NA
flexclust::kcca(dat, 3, kccaExtendedFamily('kGower',
xrange='all'))
flexclust::kcca(dat, 3, kccaExtendedFamily('kGower',
xmethods=c('distEuclidean',
'distEuclidean',
'distJaccard',
'distManhattan',
'distManhattan',
'distSimMatch')))
#the case where column 2 is a binary variable, but is symmetric