disp_DA {tlda}R Documentation

Calculate the dispersion measure D_{A}

Description

This function calculates the dispersion measure D_{A}. It offers two computational procedures, the basic version as well as a computational shortcut. It allows the user to choose the directionality of scaling, i.e. whether higher values denote a more even or a less even distribution. It also provides the option of calculating frequency-adjusted dispersion scores.

Usage

disp_DA(
  subfreq,
  partsize,
  procedure = "basic",
  directionality = "conventional",
  freq_adjust = FALSE,
  freq_adjust_method = "even",
  unit_interval = TRUE,
  digits = NULL,
  verbose = TRUE,
  print_score = TRUE,
  suppress_warning = FALSE
)

Arguments

subfreq

A numeric vector of subfrequencies, i.e. the number of occurrences of the item in each corpus part

partsize

A numeric vector specifying the size of the corpus parts

procedure

Character string indicating which procedure to use for the calculation of D_{A}. See details below. Possible values are 'basic' (default), 'shortcut'.

directionality

Character string indicating the directionality of scaling. See details below. Possible values are "conventional" (default) and "gries"

freq_adjust

Logical. Whether dispersion score should be adjusted for frequency (i.e. whether frequency should be 'partialed out'); default is FALSE

freq_adjust_method

Character string indicating which method to use for devising dispersion extremes. See details below. Possible values are "even" (default) and "pervasive"

unit_interval

Logical. Whether frequency-adjusted scores that exceed the limits of the unit interval should be replaced by 0 and 1; default is TRUE

digits

Rounding: Integer value specifying the number of decimal places to retain (default: no rounding)

verbose

Logical. Whether additional information (on directionality, formulas, frequency adjustment) should be printed; default is TRUE

print_score

Logical. Whether the dispersion score should be printed to the console; default is TRUE

suppress_warning

Logical. Whether warning messages should be suppressed; default is FALSE

Details

The function calculates the dispersion measure D_{A} based on a set of subfrequencies (number of occurrences of the item in each corpus part) and a matching set of part sizes (the size of the corpus parts, i.e. number of word tokens).

In the formulas given below, the following notation is used:

The value "basic" implements the basic computational procedure (see Wilcox 1973: 329, 343; Burch et al. 2017: 194; Egbert et al. 2020: 98). The basic version can be applied to absolute frequencies and normalized frequencies. For dispersion analysis, absolute frequencies only make sense if the corpus parts are identical in size. Wilcox (1973: 343, 'MDA', column 1 and 2) gives both variants of the basic version. The first use of D_{A} for corpus-linguistic dispersion analysis appears in Burch et al. (2017: 194), a paper that deals with equal-sized parts and therefore uses the variant for absolute frequencies. Egbert et al. (2020: 98) rely on the variant using normalized frequencies. Since this variant of the basic version of D_{A} works irrespective of the length of the corpus parts (equal or variable), we will only give this version of the formula. Note that while the formula represents conventional scaling (0 = uneven, 1 = even), in the current function the directionality is controlled separately using the argument directionality.

1 - \frac{\sum_{i = 1}^{k-1} \sum_{j = i+1}^{k} |R_i - R_j|}{\frac{k(k-1)}{2}} \times \frac{1}{2\frac{\sum_i^k R_i}{k}} (Egbert et al. 2020: 98)

The function uses a different version of the same formula, which relies on the proportional r_i values instead of the normalized subfrequencies R_i. This version yields the identical result; the r_i quantities are also the key to using the computational shortcut given in Wilcox (1973: 343). This is the basic formula for D_{A} using r_i instead of R_i values:

1 - \frac{\sum_{i = 1}^{k-1} \sum_{j = i+1}^{k} |r_i - r_j|}{k-1} (Wilcox 1973: 343; see also Soenning 2022)

The value "shortcut" implements the computational shortcut given in Wilcox (1973: 343). Critically, the proportional quantities r_i must first be sorted in decreasing order. Only after this rearrangement can the shortcut version be applied. We will refer to this rearranged version of r_i as r_i^{sorted}:

\frac{2\left(\sum_{i = 1}^{k} (i \times r_i^{sorted}) - 1\right)}{k-1} (Wilcox 1973: 343)

The value "shortcut_mod" adds a minor modification to the computational shortcut to ensure D_{A} does not exceed 1 (on the conventional dispersion scale):

\frac{2\left(\sum_{i = 1}^{k} (i \times r_i^{sorted}) - 1\right)}{k-1} \times \frac{k - 1}{k}

Value

A numeric value

Author(s)

Lukas Soenning

References

Burch, Brent, Jesse Egbert & Douglas Biber. 2017. Measuring and interpreting lexical dispersion in corpus linguistics. Journal of Research Design and Statistics in Linguistics and Communication Science 3(2). 189–216. doi:10.1558/jrds.33066

Carroll, John B. 1970. An alternative to Juilland’s usage coefficient for lexical frequencies and a proposal for a standard frequency index. Computer Studies in the Humanities and Verbal Behaviour 3(2). 61–65. doi:10.1002/j.2333-8504.1970.tb00778.x

Egbert, Jesse, Brent Burch & Douglas Biber. 2020. Lexical dispersion and corpus design. International Journal of Corpus Linguistics 25(1). 89–115. doi:10.1075/ijcl.18010.egb

Gries, Stefan Th. 2008. Dispersions and adjusted frequencies in corpora. International Journal of Corpus Linguistics 13(4). 403–437. doi:10.1075/ijcl.13.4.02gri

Gries, Stefan Th. 2022. What do (most of) our dispersion measures measure (most)? Dispersion? Journal of Second Language Studies 5(2). 171–205. doi:10.1075/jsls.21029.gri

Gries, Stefan Th. 2024. Frequency, dispersion, association, and keyness: Revising and tupleizing corpus-linguistic measures. Amsterdam: Benjamins. doi:10.1075/scl.115

Juilland, Alphonse G. & Eugenio Chang-Rodríguez. 1964. Frequency dictionary of Spanish words. The Hague: Mouton de Gruyter. doi:10.1515/9783112415467

Rosengren, Inger. 1971. The quantitative concept of language and its relation to the structure of frequency dictionaries. Études de linguistique appliquée (Nouvelle Série) 1. 103–127.

Soenning, Lukas. 2022. Evaluation of text-level measures of lexical dispersion: Robustness and consistency. PsyArXiv preprint. https://osf.io/preprints/psyarxiv/h9mvs/

Wilcox, Allen R. 1973. Indices of qualitative variation and political measurement. The Western Political Quarterly 26 (2). 325–343. doi:10.2307/446831

Examples

disp_DA(
  subfreq = c(0,0,1,2,5), 
  partsize = rep(1000, 5),
  procedure = "basic",
  directionality = "conventional",
  freq_adjust = FALSE)


[Package tlda version 0.1.0 Index]