disp_R {tlda} | R Documentation |
Calculate the dispersion measure 'range'
Description
This function calculates the dispersion measure 'range'. It offers three different versions: 'absolute range' (the number of corpus parts containing at least one occurrence of the item), 'relative range' (the proportion of corpus parts containing at least one occurrence of the item), and 'relative range with size' (relative range that takes into account the size of the corpus parts). The function also offers the option of calculating frequency-adjusted dispersion scores.
Usage
disp_R(
subfreq,
partsize,
type = "relative",
freq_adjust = FALSE,
freq_adjust_method = "pervasive",
unit_interval = TRUE,
digits = NULL,
verbose = TRUE,
print_score = TRUE,
suppress_warning = FALSE
)
Arguments
subfreq |
A numeric vector of subfrequencies, i.e. the number of occurrences of the item in each corpus part |
partsize |
A numeric vector specifying the size of the corpus parts |
type |
Character string indicating which type of range to calculate. See details below. Possible values are |
freq_adjust |
Logical. Whether dispersion score should be adjusted for frequency (i.e. whether frequency should be 'partialed out'); default is |
freq_adjust_method |
Character string indicating which method to use for devising dispersion extremes. See details below. Possible values are |
unit_interval |
Logical. Whether frequency-adjusted scores that exceed the limits of the unit interval should be replaced by 0 and 1; default is |
digits |
Rounding: Integer value specifying the number of decimal places to retain (default: no rounding) |
verbose |
Logical. Whether additional information (on directionality, formulas, frequency adjustment) should be printed; default is |
print_score |
Logical. Whether the dispersion score should be printed to the console; default is |
suppress_warning |
Logical. Whether warning messages should be suppressed; default is |
Details
The function calculates the dispersion measure 'range' based on a set of subfrequencies (number of occurrences of the item in each corpus part) and a matching set of part sizes (the size of the corpus parts, i.e. number of word tokens). Three different types of range measures can be calculated:
Absolute range: The number of corpus parts containing at least one occurrence of the item
Relative range: The proportion of corpus parts containing at least one occurrence of the item; this version of 'range' follows the conventional scaling of dispersion measures (1 = widely dispersed)
Relative range with size (see Gries 2022: 179-180; Gries 2024: 27-28): Relative range that takes into account the size of the corpus parts. Each corpus part contributes to this version of range in proportion to its size. Suppose there are 100 corpus parts, and part 1 is relatively short, accounting for 1/200 of the words in the whole corpus. If the item occurs in part 1, ordinary relative range increases by 1/100, since each part receives the same weight. Relative range with size, on the other hand, increases by 1/200, i.e. the relative size of the corpus part; this version of range weights corpus parts proportionate to their size.
Frequency adjustment: Dispersion scores can be adjusted for frequency using the min-max transformation proposed by Gries (2022: 184-191; 2024: 196-208). The frequency-adjusted score for an item considers the lowest and highest possible level of dispersion it can obtain given its overall corpus frequency as well as the number (and size) of corpus parts. The unadjusted score is then expressed relative to these endpoints, where the dispersion minimum is set to 0, and the dispersion maximum to 1 (expressed in terms of conventional scaling). The frequency-adjusted score falls between these bounds and expresses how close the observed distribution is to the theoretical maximum and minimum. This adjustment therefore requires a maximally and a minimally dispersed distribution of the item across the parts. These hypothetical extremes can be built in different ways. The method used by Gries (2022, 2024) uses a computationally expensive procedure that finds the distribution that produces the highest value on the dispersion measure of interest. The current function constructs extreme distributions in a different way, based on the distributional features pervasiveness (
"pervasive"
) or evenness ("even"
). You can choose between these with the argumentfreq_adjust_method
; the default iseven
. For details and explanations, seevignette("frequency-adjustment")
.To obtain the lowest possible level of dispersion, the occurrences are either allocated to as few corpus parts as possible (
"pervasive"
), or they are assigned to the smallest corpus part(s) ("even"
).To obtain the highest possible level of dispersion, the occurrences are either spread as broadly across corpus parts as possible (
"pervasive"
), or they are allocated to corpus parts in proportion to their size ("even"
). The choice between these methods is particularly relevant if corpus parts differ considerably in size. See documentation forfind_max_disp()
.
Value
A numeric value
Author(s)
Lukas Soenning
References
Gries, Stefan Th. 2022. What do (most of) our dispersion measures measure (most)? Dispersion? Journal of Second Language Studies 5(2). 171–205. doi:10.1075/jsls.21029.gri
Gries, Stefan Th. 2024. Frequency, dispersion, association, and keyness: Revising and tupleizing corpus-linguistic measures. Amsterdam: Benjamins. doi:10.1075/scl.115
Examples
disp_R(
subfreq = c(0, 0, 1, 2, 5),
partsize = rep(1000, 5),
type = "relative",
freq_adjust = FALSE)