split_text {deeplr} | R Documentation |
Split Text into Byte-Limited Segments
Description
split_text
divides input text into smaller segments that do not exceed a specified maximum size in bytes.
Segmentation is based on sentence or word boundaries.
Usage
split_text(text, max_size_bytes = 29000, tokenize = "sentences")
Arguments
text |
A character vector containing the text(s) to be split. |
max_size_bytes |
An integer specifying the maximum size (in bytes) for each segment. |
tokenize |
A string indicating the level of tokenization. Must be either |
Details
This function uses tokenizers::tokenize_sentences
(or tokenize_words
if specified)
to split the text into natural language segments before assembling byte-limited blocks.
Value
A tibble with one row per text segment, containing the following columns:
-
text_id
: The index of the original text in the input vector. -
segment_id
: A sequential ID identifying the segment number. -
segment_text
: The resulting text segment, each within the specified byte limit.
Examples
## Not run:
long_text <- paste0(rep("This is a very long text. ", 10000), collapse = "")
split_text(long_text, max_size_bytes = 1000, tokenize = "sentences")
## End(Not run)