find_min_disp_tdm {tlda}R Documentation

Find the minimally dispersed distribution of each item in a term-document matrix

Description

This function takes as input a term-document matrix and returns, for each item (i.e. row), the (hypothetical) distribution of subfrequencies that represents the smallest possible level of dispersion for the item across the corpus parts. This distribution is required for the min-max transformation proposed by Gries (2022: 184-191; 2024: 196-208) to obtain frequency-adjusted dispersion scores.

Usage

find_min_disp_tdm(
  tdm,
  row_partsize = "first",
  freq_adjust_method = freq_adjust_method
)

Arguments

tdm

A term-document matrix, where rows represent items and columns represent corpus parts; must also contain a row giving the size of the corpus parts (first or last row in the term-document matrix)

row_partsize

Character string indicating which row in the term-document matrix contains the size of the corpus parts. Possible values are "first" (default) and "last"

freq_adjust_method

Character string indicating which method to use for devising dispersion extremes. See details below. Possible values are "even" (default) and "pervasive"

Details

This function takes as input a term-document matrix and creates, for each item in the matrix, a hypothetical distribution of the total number of occurrences of the item (i.e. the sum of the subfrequencies) across corpus parts. To obtain the lowest possible level of dispersion, the argument freq_adjust_method allows the user to choose between two distributional features: pervasiveness (pervasive) or evenness (even). For details and explanations, see vignette("frequency-adjustment"). To obtain the lowest possible level of dispersion, the occurrences are either allocated to as few corpus parts as possible (pervasiveness), or they are assigned to the smallest corpus part(s) (even). Since the dispersion of an item that occurs only once in the corpus (hapaxes) cannot be sensibly measured or manipulated, such items are disregarded; the function returns their observed subfrequencies. The function reuses code segments from Gries's (2025) 'KLD4C' package (from the function most.uneven.distr()).

Value

A matrix of integers with one row per item and one column per corpus part

Author(s)

Lukas Soenning

References

Gries, Stefan Th. 2022. What do (most of) our dispersion measures measure (most)? Dispersion? Journal of Second Language Studies 5(2). 171–205. doi:10.1075/jsls.21029.gri

Gries, Stefan Th. 2024. Frequency, dispersion, association, and keyness: Revising and tupleizing corpus-linguistic measures. Amsterdam: Benjamins. doi:10.1075/scl.115

Gries, Stefan Th. 2025. KLD4C: Gries 2024: Tupleization of corpus linguistics. R package version 1.01. (available from https://www.stgries.info/research/kld4c/kld4c.html)

See Also

find_min_disp()

Examples

find_min_disp_tdm(
  tdm = biber150_spokenBNC2014[1:10,],
  row_partsize = "first",
  freq_adjust_method = "even")


[Package tlda version 0.1.0 Index]