disp_R_tdm {tlda}R Documentation

Calculate the dispersion measure 'range' for a term-document matrix

Description

This function calculates the dispersion measure 'range'. It offers three different versions: 'absolute range' (the number of corpus parts containing at least one occurrence of the item), 'relative range' (the proportion of corpus parts containing at least one occurrence of the item), and 'relative range with size' (relative range that takes into account the size of the corpus parts). The function also offers the option of calculating frequency-adjusted dispersion scores.

Usage

disp_R_tdm(
  tdm,
  row_partsize = "first",
  type = "relative",
  freq_adjust = FALSE,
  freq_adjust_method = "pervasive",
  unit_interval = TRUE,
  digits = NULL,
  verbose = TRUE,
  print_scores = TRUE
)

Arguments

tdm

A term-document matrix, where rows represent items and columns represent corpus parts; must also contain a row giving the size of the corpus parts (first or last row in the term-document matrix)

row_partsize

Character string indicating which row in the term-document matrix contains the size of the corpus parts. Possible values are "first" (default) and "last"

type

Character string indicating which type of range to calculate. See details below. Possible values are "relative" (default), "absolute", "relative_withsize"

freq_adjust

Logical. Whether dispersion score should be adjusted for frequency (i.e. whether frequency should be 'partialed out'); default is FALSE

freq_adjust_method

Character string indicating which method to use for devising dispersion extremes. See details below. Possible values are "pervasive" (default) and "even"

unit_interval

Logical. Whether frequency-adjusted scores that exceed the limits of the unit interval should be replaced by 0 and 1; default is TRUE

digits

Rounding: Integer value specifying the number of decimal places to retain (default: no rounding)

verbose

Logical. Whether additional information (on directionality, formulas, frequency adjustment) should be printed; default is TRUE

print_scores

Logical. Whether the dispersion scores should be printed to the console; default is TRUE

Details

This function takes as input a term-document matrix and returns, for each item (i.e. each row) the dispersion measure 'range'. The rows in the matrix represent the items, and the columns the corpus parts. Importantly, the term-document matrix must include an additional row that records the size of the corpus parts. For a proper term-document matrix, which includes all items that appear in the corpus, this can be added as a column margin, which sums the frequencies in each column. If the matrix only includes a selection of items drawn from the corpus, this information cannot be derived from the matrix and must be provided as a separate row.

Three different types of range measures can be calculated:

Value

A numeric vector the same length as the number of items in the term-document matrix

Author(s)

Lukas Soenning

References

Gries, Stefan Th. 2022. What do (most of) our dispersion measures measure (most)? Dispersion? Journal of Second Language Studies 5(2). 171–205. doi:10.1075/jsls.21029.gri

Gries, Stefan Th. 2024. Frequency, dispersion, association, and keyness: Revising and tupleizing corpus-linguistic measures. Amsterdam: Benjamins. doi:10.1075/scl.115

Examples

disp_R_tdm(
  tdm = biber150_spokenBNC2014[1:20,],
  row_partsize = "first",
  type = "relative",
  freq_adjust = FALSE)


[Package tlda version 0.1.0 Index]