disp_DKL {tlda}R Documentation

Calculate the dispersion measure D_{KL}

Description

This function calculates the dispersion measure D_{KL}, which is based on the Kullback-Leibler divergence (Gries 2020, 2021, 2024). It offers three options for standardization to the unit interval [0,1] (see Gries 2024: 90-92) and allows the user to choose the directionality of scaling, i.e. whether higher values denote a more even or a less even distribution. It also offers the option of calculating frequency-adjusted dispersion scores.

Usage

disp_DKL(
  subfreq,
  partsize,
  directionality = "conventional",
  standardization = "o2p",
  freq_adjust = FALSE,
  freq_adjust_method = "even",
  unit_interval = TRUE,
  digits = NULL,
  verbose = TRUE,
  print_score = TRUE,
  suppress_warning = FALSE
)

Arguments

subfreq

A numeric vector of subfrequencies, i.e. the number of occurrences of the item in each corpus part

partsize

A numeric vector specifying the size of the corpus parts

directionality

Character string indicating the directionality of scaling. See details below. Possible values are "conventional" (default) and "gries"

standardization

Character string indicating which standardization method to use. See details below. Possible values are "o2p" (default), "base_e", and "base_2".

freq_adjust

Logical. Whether dispersion score should be adjusted for frequency (i.e. whether frequency should be 'partialed out'); default is FALSE

freq_adjust_method

Character string indicating which method to use for devising dispersion extremes. See details below. Possible values are "even" (default) and "pervasive"

unit_interval

Logical. Whether frequency-adjusted scores that exceed the limits of the unit interval should be replaced by 0 and 1; default is TRUE

digits

Rounding: Integer value specifying the number of decimal places to retain (default: no rounding)

verbose

Logical. Whether additional information (on directionality, formulas, frequency adjustment) should be printed; default is TRUE

print_score

Logical. Whether the dispersion score should be printed to the console; default is TRUE

suppress_warning

Logical. Whether warning messages should be suppressed; default is FALSE

Details

The function calculates the dispersion measure D_{KL} based on a set of subfrequencies (number of occurrences of the item in each corpus part) and a matching set of part sizes (the size of the corpus parts, i.e. number of word tokens).

In the formulas given below, the following notation is used:

The first step is to calculate the Kullback-Leibler divergence based on the proportional subfrequencies (t_i) and the size of the corpus parts (w_i):

KLD = \sum_i^k t_i \log_2{\frac{t_i}{w_i}} with \log_2(0) = 0

This KLD score is then standardized (i.e. transformed) to the conventional unit interval [0,1]. Three options are discussed in Gries (2024: 90-92). The following formulas represents Gries scaling (0 = even, 1 = uneven):

(1) e^{-KLD} (Gries 2021: 20), represented by the value "base_e"

(2) 2^{-KLD} (Gries 2024: 90), represented by the value" "base_2"

(3) \frac{KLD}{1+KLD} (Gries 2024: 90), represented by the value "o2p" (default)

Value

A numeric value

Author(s)

Lukas Soenning

References

Carroll, John B. 1970. An alternative to Juilland’s usage coefficient for lexical frequencies and a proposal for a standard frequency index. Computer Studies in the Humanities and Verbal Behaviour 3(2). 61–65. doi:10.1002/j.2333-8504.1970.tb00778.x

Gries, Stefan Th. 2008. Dispersions and adjusted frequencies in corpora. International Journal of Corpus Linguistics 13(4). 403–437. doi:10.1075/ijcl.13.4.02gri

Gries, Stefan Th. 2022. What do (most of) our dispersion measures measure (most)? Dispersion? Journal of Second Language Studies 5(2). 171–205. doi:10.1075/jsls.21029.gri

Gries, Stefan Th. 2024. Frequency, dispersion, association, and keyness: Revising and tupleizing corpus-linguistic measures. Amsterdam: Benjamins. doi:10.1075/scl.115

Juilland, Alphonse G. & Eugenio Chang-Rodríguez. 1964. Frequency dictionary of Spanish words. The Hague: Mouton de Gruyter. doi:10.1515/9783112415467

Rosengren, Inger. 1971. The quantitative concept of language and its relation to the structure of frequency dictionaries. Études de linguistique appliquée (Nouvelle Série) 1. 103–127.

Examples

disp_DKL(
  subfreq = c(0,0,1,2,5), 
  partsize = rep(1000, 5),
  standardization = "base_e",
  directionality = "conventional")


[Package tlda version 0.1.0 Index]