keyclust {keyclust}R Documentation

Algorithm designed to efficiently extract keywords from a cosine similarity matrix

Description

This function takes a cosine similarity matrix derived from a word embedding model, along with a set of seed words and outputs a semantically-related set of keywords of a length and cosimilarity determined by the user

Usage

keyclust(
  sim_mat,
  seed_words,
  sim_thresh = 0.25,
  max_n = 50,
  dictionary = NULL,
  exclude = NULL,
  verbose = TRUE
)

Arguments

sim_mat

A cosine similarity matrix produced by cosine.

seed_words

A set of user-provided seed words that best represent the target concept.

sim_thresh

Minimum cosine similarity a candidate word must have to the existing set of keywords for it to be added.

max_n

The maximum size of the output set of keywords.

dictionary

An optional dictionary that maps metadata, such as definitions, to keywords.

exclude

A vector of words that the user does not want included in the final keyword set.

verbose

If true, keyclust will produce live updates as it adds keywords.

Value

A list containing a data frame of keywords and their cosine similarities, and a matrix of cosine similarities.

Examples

# Create a set of keywords using a pre-defined set of seeds
seeds <- c("october", "november")
# Create a cosine similarity matrix from a word embedding model
simmat_FasttextEng_sample <- wordemb_FasttextEng_sample |>
    process_embed(words='words') |>
    similarity_matrix(words = "words")
# Use keyclust to generate a set of keywords
months <- keyclust(simmat_FasttextEng_sample, seed_words = seeds, max_n = 8)

[Package keyclust version 1.2.5 Index]