MAXLEN_est {SONO}R Documentation

Estimate MAXLEN

Description

Function estimating the value of MAXLEN (stopping criterion) prior to running the SONO algorithm. The estimation is done using the ideas described in Costa and Papatsouma (2025), using simultaneous confidence intervals for Multinomial proportions, as done by Sison and Glaz (1995).

Usage

MAXLEN_est(data, probs, alpha = 0.01, frequent = FALSE)

Arguments

data

Dataset; needs to be of class data.frame and consist of factor variables only.

probs

List of probability vectors for each variable. Each element of the list must include as many probabilities as the number of levels associated with it in the dataset.

alpha

Significance level for the simultaneous Multinomial confidence intervals constructed, determining what the frequency thresholds should be for itemsets of different length, used for outlier detection for discrete features. Must be a positive real, at most equal to 0.50. A greater value leads to a much more conservative algorithm. Default value is 0.01.

frequent

Logical determining whether highly frequent or highly infrequent itemsets are considered as outliers. Defaults to FALSE, treating highly infrequent itemsets as outlying.

Value

Estimated MAXLEN value.

References

Costa E, Papatsouma I (2025). “A novel framework for quantifying nominal outlyingness.” doi:10.48550/arXiv.2408.07463, arXiv:2408.07463, http://arxiv.org/abs/2408.07463.

Sison CP, Glaz J (1995). “Simultaneous Confidence Intervals and Sample Size Determination for Multinomial Proportions.” Journal of the American Statistical Association, 90(429), 366–369. ISSN 0162-1459, doi:10.2307/2291162.

Examples

dt <- as.data.frame(sample(c(1:2), 100, replace = TRUE, prob = c(0.5, 0.5)))
dt <- cbind(dt, sample(c(1:3), 100, replace = TRUE, prob = c(0.5, 0.3, 0.2)))
dt[, 1] <- as.factor(dt[, 1])
dt[, 2] <- as.factor(dt[, 2])
colnames(dt) <- c('V1', 'V2')
MAXLEN_est(data = dt, probs = list(c(0.5, 0.5), c(1/3, 1/3, 1/3)), alpha = 0.01, frequent = FALSE)


[Package SONO version 1.2 Index]