corpusStats,KorAPConnection-method {RKorAPClient} | R Documentation |
Get corpus size and statistics
Description
Retrieve information about corpus size (documents, tokens, sentences, paragraphs) for the entire corpus or a virtual corpus subset.
Usage
## S4 method for signature 'KorAPConnection'
corpusStats(kco, vc = "", verbose = kco@verbose, as.df = FALSE)
Arguments
kco |
|
vc |
string describing the virtual corpus. An empty string (default) means the whole corpus, as far as it is license-wise accessible. |
verbose |
logical. If |
as.df |
return result as data frame instead of as S4 object? |
Value
Object containing corpus statistics with the following information:
vc
Virtual corpus definition used (empty string for entire corpus)
documents
Total number of documents in the (virtual) corpus
tokens
Total number of word tokens in the (virtual) corpus
sentences
Total number of sentences in the (virtual) corpus
paragraphs
Total number of paragraphs in the (virtual) corpus
webUIRequestUrl
URL to view this corpus subset in KorAP web interface
When as.df=TRUE
, returns a data frame with these columns.
When as.df=FALSE
(default), returns a KorAPCorpusStats object with these values as slots.
Usage
# Get statistics for entire corpus kcon <- KorAPConnection() stats <- corpusStats(kcon) # Get statistics for a specific time period stats <- corpusStats(kcon, "pubDate in 2020") # Access the number of tokens stats@tokens
Examples
## Not run:
kco <- KorAPConnection()
# Get statistics for entire corpus (returns S4 object)
stats <- corpusStats(kco)
stats@tokens # Access number of tokens
# Get statistics for newspaper texts from 2017 (as data frame)
df <- corpusStats(kco, "pubDate in 2017 & textType=/Zeitung.*/", as.df = TRUE)
df$documents # Access number of documents
# Compare corpus sizes across years
years <- 2015:2020
sizes <- sapply(years, function(y) {
corpusStats(kco, paste("pubDate in", y))@tokens
})
## End(Not run)