MAXLEN_est {SONO} | R Documentation |
Estimate MAXLEN
Description
Function estimating the value of MAXLEN (stopping criterion) prior to running the SONO algorithm. The estimation is done using the ideas described in Costa and Papatsouma (2025), using simultaneous confidence intervals for Multinomial proportions, as done by Sison and Glaz (1995).
Usage
MAXLEN_est(data, probs, alpha = 0.01, frequent = FALSE)
Arguments
data |
Dataset; needs to be of class data.frame and consist of factor variables only. |
probs |
List of probability vectors for each variable. Each element of the list must include as many probabilities as the number of levels associated with it in the dataset. |
alpha |
Significance level for the simultaneous Multinomial confidence intervals constructed, determining what the frequency thresholds should be for itemsets of different length, used for outlier detection for discrete features. Must be a positive real, at most equal to 0.50. A greater value leads to a much more conservative algorithm. Default value is 0.01. |
frequent |
Logical determining whether highly frequent or highly infrequent itemsets are considered as outliers. Defaults to FALSE, treating highly infrequent itemsets as outlying. |
Value
Estimated MAXLEN value.
References
Costa E, Papatsouma I (2025). “A novel framework for quantifying nominal outlyingness.” doi:10.48550/arXiv.2408.07463, arXiv:2408.07463, http://arxiv.org/abs/2408.07463.
Sison CP, Glaz J (1995). “Simultaneous Confidence Intervals and Sample Size Determination for Multinomial Proportions.” Journal of the American Statistical Association, 90(429), 366–369. ISSN 0162-1459, doi:10.2307/2291162.
Examples
dt <- as.data.frame(sample(c(1:2), 100, replace = TRUE, prob = c(0.5, 0.5)))
dt <- cbind(dt, sample(c(1:3), 100, replace = TRUE, prob = c(0.5, 0.3, 0.2)))
dt[, 1] <- as.factor(dt[, 1])
dt[, 2] <- as.factor(dt[, 2])
colnames(dt) <- c('V1', 'V2')
MAXLEN_est(data = dt, probs = list(c(0.5, 0.5), c(1/3, 1/3, 1/3)), alpha = 0.01, frequent = FALSE)