distGDM2 {flexord}R Documentation

Distance Functions for K-Centroids Clustering of (Ordinal) Categorical/Mixed Data

Description

Functions to calculate the distance between a matrix x and a matrix c, which can be used for K-centroids clustering via flexclust::kcca().

distSimMatch implements Simple Matching Distance (most frequently used for categorical, or symmetric binary data) for K-centroids clustering.

distGower implements Gower's Distance after Gower (1971) and Kaufman & Rousseeuw (1990) for mixed-type data with missings for K-centroids clustering.

distGDM2 implements GDM2 distance for ordinal data introduced by Walesiak et al. (1993) and adapted to K-centroids clustering by Ernst et al. (2025).

These functions are designed for use with flexclust::kcca() or functions that are built upon it. Their use is easiest via the wrapper kccaExtendedFamily(). However, they can also easily be used to obtain a distance matrix of x, see Examples.

Usage

distGDM2(x, centers, genDist, xrange = NULL)

distGower(x, centers, genDist)

distSimMatch(x, centers)

Arguments

x

A numeric matrix or data frame.

centers

A numeric matrix with ncol(centers) equal to ncol(x) and nrow(centers) smaller or equal to row(x).

genDist

Additional information on x required for distance calculation. Filled automatically if used within flexclust::kcca().

  • For distGower: A character vector of variable specific distances to be used with length equal to ncol(x). The following options are possible:

    • distEuclidean: Euclidean distance between the scaled variables.

    • distManhattan: absolute distance between the scaled variables.

    • distJaccard: counts of zero if both binary variables are equal to 1, and 1 otherwise.

    • distSimMatch: Simple Matching Distance, i.e. the number of agreements between variables.

  • For distGDM2: Function creating a distance function that will be primed on x.

  • For distSimMatch: not used.

xrange

Range specification for the variables. Currently only used for distGDM2 (as distGower expects x to be already scaled). Possible values are:

  • NULL (default): defaults to "all".

  • "all": uses the same minimum and maximum value for each column of x by determining the whole range of values in the data object x.

  • "columnwise": uses different minimum and maximum values for each column of x by determining the columnwise ranges of values in the data object x.

  • A vector of c(min, max): specifies the same minimum and maximum value for each column of x.

  • A list of vectors list(c(min1, max1), c(min2, max2),...) with length ncol(x): specifies different minimum and maximum values for each column of x.

Details

The distances functions presented here can also be used in clustering algorithms that rely on distance matrices (such as hierarchical clustering and PAM), if applied accordingly, see Examples.

Value

A matrix of dimensions c(nrow(x), nrow(centers)) that contains the distance between each row of x and each row of centers.

References

See Also

flexclust::kcca(), klaR::kmodes(), cluster::daisy(), clusterSim::dist.GDM()

Examples

# Example 1: Simple Matching Distance
set.seed(123)
dat <- data.frame(question1 = factor(sample(LETTERS[1:4], 10, replace=TRUE)),
                  question2 = factor(sample(LETTERS[1:6], 10, replace=TRUE)),
                  question3 = factor(sample(LETTERS[1:4], 10, replace=TRUE)),
                  question4 = factor(sample(LETTERS[1:5], 10, replace=TRUE)),
                  state = factor(sample(state.name[1:10], 10, replace=TRUE)),
                  gender = factor(sample(c('M', 'F', 'N'), 10, replace=TRUE,
                                         prob=c(0.45, 0.45, 0.1))))
datmat <- data.matrix(dat)
initcenters <- datmat[sample(1:10, 3),]
distSimMatch(datmat, initcenters)
## within kcca
flexclust::kcca(dat, k=3, family=kccaExtendedFamily('kModes'))
## as a distance matrix
as.dist(distSimMatch(datmat, datmat))

# Example 2: GDM2 distance
flexclust::kcca(dat, k=3, family=kccaExtendedFamily('kGDM2'))

# Example 3: Gower's distance
# Ex. 3.1: single variable type case with no missings:
flexclust::kcca(datmat, 3, kccaExtendedFamily('kGower'))

# Ex. 3.2: single variable type case with missing values:
nas <- sample(c(TRUE, FALSE), prod(dim(dat)), replace = TRUE,
   prob=c(0.1, 0.9)) |> 
   matrix(nrow = nrow(dat))
dat[nas] <- NA
flexclust::kcca(dat, 3, kccaExtendedFamily('kGower', cent=centMode))

#Ex. 3.3: mixed variable types (with or without missings): 
dat <- data.frame(cont = sample(1:100, 10, replace=TRUE)/10,
                  bin_sym = as.logical(sample(0:1, 10, replace=TRUE)),
                  bin_asym = as.logical(sample(0:1, 10, replace=TRUE)),                     
                  ord_levmis = factor(sample(1:5, 10, replace=TRUE),
                                      levels=1:6, ordered=TRUE),
                  ord_levfull = factor(sample(1:4, 10, replace=TRUE),
                                       levels=1:4, ordered=TRUE),
                  nom = factor(sample(letters[1:4], 10, replace=TRUE),
                               levels=letters[1:4]))
dat[nas] <- NA
flexclust::kcca(dat, 3, kccaExtendedFamily('kGower'))


[Package flexord version 1.0.0 Index]