mobydick {tall} | R Documentation |
Lemmatized Text of Moby-Dick (Chapters 1-10)
Description
This dataset contains the lemmatized version of the first 10 chapters of the novel Moby-Dick by Herman Melville. The data is structured as a dataframe with multiple linguistic annotations.
Usage
data(mobydick)
Format
A dataframe with multiple rows and 26 columns:
- doc_id
Character: Unique document identifier
- paragraph_id
Integer: Paragraph index within the document
- sentence_id
Integer: Sentence index within the paragraph
- sentence
Character: Original sentence text
- start
Integer: Start position of the token in the sentence
- end
Integer: End position of the token in the sentence
- term_id
Integer: Unique term identifier
- token_id
Integer: Token index in the sentence
- token
Character: Original token (word)
- lemma
Character: Lemmatized form of the token
- upos
Character: Universal POS tag
- xpos
Character: Language-specific POS tag
- feats
Character: Morphological features
- head_token_id
Integer: Head token in dependency tree
- dep_rel
Character: Dependency relation label
- deps
Character: Enhanced dependency relations
- misc
Character: Additional information
- folder
Character: Folder containing the document
- split_word
Character: The word used to separate the chapters in the original book
- filename
Character: Source file name
- doc_selected
Logical: Whether the document is selected
- POSSelected
Logical: Whether POS was selected
- sentence_hl
Character: Highlighted sentence
- docSelected
Logical: Whether the document was manually selected
- noHapax
Logical: Whether hapax legomena were removed
- noSingleChar
Logical: Whether single-character words were removed
- lemma_original_nomultiwords
Character: Lemmatized form without multi-word units
Source
Extracted and processed from the text of Moby-Dick by Herman Melville.
Examples
data(mobydick)
head(mobydick)