kccaExtendedFamily {flexord}R Documentation

Extending K-Centroids Clustering to (Mixed-with-)Ordinal Data

Description

This wrapper creates objects of class "kccaFamily", which can be used with flexclust::kcca() to conduct K-centroids clustering using the following methods:

Usage

kccaExtendedFamily(which = c('kModes', 'kGDM2', 'kGower'),
                   cent = NULL,
                   preproc = NULL,
                   xrange = NULL,
                   xmethods = NULL,
                   trim = 0, groupFun = 'minSumClusters')

Arguments

which

One of either 'kModes', 'kGDM2' or 'kGower', the three predefined methods for K-centroids clustering. For more information on each of them, see the Details section.

cent

Function for determining cluster centroids.

This argument is ignored for `which='kModes'`, and `centMode`
is used.  For `'kGDM2'` and `'kGower'`, `cent=NULL` defaults to
a general purpose optimizer.
preproc

Preprocessing function applied to the data before clustering.

This argument is ignored for `which='kGower'`. In this case,
the default preprocessing proposed by Gower (1971) and Kaufman
& Rousseeuw (1990) is conducted. For `'kGDM2'` and `'kModes'`,
users can specify preprocessing steps here, though this is not
recommended.
xrange

The range of the data in x. Options are:

  • "all": uses the same minimum and maximum value for each column of x by determining the whole range of values in the data object x.

  • "columnwise": uses different minimum and maximum values for each column of x by determining the columnwise ranges of values in the data object x.

  • A vector of c(min, max): specifies the same minimum and maximum value for each column of x.

  • A list of vectors list(c(min1, max1), c(min2, max2),...) with length ncol(x): specifies different minimum and maximum values for each column of x.

This argument is ignored for which='kModes'. xrange=NULL defaults to "all" for 'kGDM2', and to "columnwise" for 'kGower'.

xmethods

An optional character vector of length ncol(x) that specifies the distance measure for each column of x. Currently only used for 'kGower'. For 'kGower', xmethods=NULL results in the use of default methods for each column of x. For more information on allowed input values, and default measures, see the Details section.

trim

Proportion of points trimmed in robust clustering, wee flexclust::kccaFamily().

groupFun

A character string specifying the function for clustering.

Default is `'minSumClusters'`, see [flexclust::kccaFamily()].

Details

Wrappers for defining families are obtained by specifying which using:

Also for this method, if cent=NULL, a general purpose optimizer with NA omission will be applied for centroid calculation.

Scale handling. In 'kModes', all variables are treated as unordered factors. In 'kGDM2', all variables are treated as ordered factors, with strict assumptions regarding their ordinality. 'kGower' is currently the only method designed to handle mixed-type data. For ordinal variables, the assumptions are more lax than with GDM2 distance.

NA handling. NA handling via omission and upweighting non-missing variables is currently only implemented for 'kGower'. Within 'kModes', the omission of NA responses can be avoided by coding missings as valid factor levels. For 'kGDM2', currently the only option is to omit missing values completely.

Value

An object of class "kccaFamily".

References

See Also

flexclust::kcca(), flexclust::stepFlexclust(), flexclust::bootFlexclust()

Examples

# Example 1: kModes
set.seed(123)
dat <- data.frame(cont = sample(1:100, 10, replace=TRUE)/10,
                  bin_sym = as.logical(sample(0:1, 10, replace=TRUE)),
                  bin_asym = as.logical(sample(0:1, 10, replace=TRUE)),                     
                  ord_levmis = factor(sample(1:5, 10, replace=TRUE),
                                      levels=1:6, ordered=TRUE),
                  ord_levfull = factor(sample(1:4, 10, replace=TRUE),
                                       levels=1:4, ordered=TRUE),
                  nom = factor(sample(letters[1:4], 10, replace=TRUE),
                               levels=letters[1:4]))
flexclust::kcca(dat, k=3, family=kccaExtendedFamily('kModes'))

# Example 2: kGDM2
flexclust::kcca(dat, k=3, family=kccaExtendedFamily('kGDM2',
                                                    xrange='columnwise'))
# Example 3: kGower
flexclust::kcca(dat, 3, kccaExtendedFamily('kGower'))
nas <- sample(c(TRUE,FALSE), prod(dim(dat)), replace=TRUE, prob=c(0.1,0.9)) |> 
   matrix(nrow=nrow(dat))
dat[nas] <- NA
flexclust::kcca(dat, 3, kccaExtendedFamily('kGower',
                                           xrange='all'))
flexclust::kcca(dat, 3, kccaExtendedFamily('kGower',
                                           xmethods=c('distEuclidean',
                                                      'distEuclidean',
                                                      'distJaccard',
                                                      'distManhattan',
                                                      'distManhattan',
                                                      'distSimMatch')))
#the case where column 2 is a binary variable, but is symmetric


[Package flexord version 1.0.0 Index]