keyclust {keyclust} | R Documentation |
Algorithm designed to efficiently extract keywords from a cosine similarity matrix
Description
This function takes a cosine similarity matrix derived from a word embedding model, along with a set of seed words and outputs a semantically-related set of keywords of a length and cosimilarity determined by the user
Usage
keyclust(
sim_mat,
seed_words,
sim_thresh = 0.25,
max_n = 50,
dictionary = NULL,
exclude = NULL,
verbose = TRUE
)
Arguments
sim_mat |
A cosine similarity matrix produced by cosine. |
seed_words |
A set of user-provided seed words that best represent the target concept. |
sim_thresh |
Minimum cosine similarity a candidate word must have to the existing set of keywords for it to be added. |
max_n |
The maximum size of the output set of keywords. |
dictionary |
An optional dictionary that maps metadata, such as definitions, to keywords. |
exclude |
A vector of words that the user does not want included in the final keyword set. |
verbose |
If true, keyclust will produce live updates as it adds keywords. |
Value
A list containing a data frame of keywords and their cosine similarities, and a matrix of cosine similarities.
Examples
# Create a set of keywords using a pre-defined set of seeds
seeds <- c("october", "november")
# Create a cosine similarity matrix from a word embedding model
simmat_FasttextEng_sample <- wordemb_FasttextEng_sample |>
process_embed(words='words') |>
similarity_matrix(words = "words")
# Use keyclust to generate a set of keywords
months <- keyclust(simmat_FasttextEng_sample, seed_words = seeds, max_n = 8)