getMatches {nzilbb.labbcat} | R Documentation |
Search for tokens
Description
Searches through transcripts for tokens matching the given pattern.
Usage
getMatches(
labbcat.url,
pattern,
participant.expression = NULL,
transcript.expression = NULL,
main.participant = TRUE,
aligned = NULL,
matches.per.transcript = NULL,
words.context = 0,
max.matches = NULL,
overlap.threshold = NULL,
anchor.confidence.min = NULL,
page.length = 1000,
no.progress = FALSE
)
Arguments
labbcat.url |
URL to the LaBB-CAT instance |
pattern |
An object representing the pattern to search for. This can be:
Examples of valid pattern objects include: ## the word 'the' followed immediately by a word starting with an orthographic vowel pattern <- "the [aeiou].*" ## a word spelt with "k" but pronounced "n" word initially pattern <- list(orthography = "k.*", phonemes = "n.*") ## the word 'the' followed immediately by a word starting with a phonemic vowel pattern <- list( list(orthography = "the"), list(phonemes = "[cCEFHiIPqQuUV0123456789~#\\$@].*")) ## the word 'the' followed immediately or with one intervening word by ## a hapax legomenon (word with a frequency of 1) that doesn't start with a vowel pattern <- list(columns = list( list(layers = list( orthography = list(pattern = "the")), adj = 2), list(layers = list( phonemes = list(not = TRUE, pattern = "[cCEFHiIPqQuUV0123456789~#\\$@].*"), frequency = list(max = "2"))))) ## words that contain the /I/ phone followed by the /l/ phone ## (multiple patterns per word currently only works for segment layers) pattern <- list(segment = list("I", "l")) ## words that contain the /I/ phone followed by the /l/ phone, targeting the /l/ segment ## (multiple patterns per word currently only works for segment layers) pattern <- list(segment = list("I", list(pattern="l", target=T))) ## words where the spelling starts with "k", but the first segment is /n/ pattern <- list( orthography = "k.*", segment = list(pattern = "n", anchorStart = T) |
participant.expression |
An optional participant query expression for identifying participants to search the utterances of. This should be the output of expressionFromIds, expressionFromAttributeValue, or expressionFromAttributeValues, or more than one concatentated together and delimited by ' && '. If not supplied, utterances of all participants will be searched. |
transcript.expression |
An optional transript query expression for identifying transcripts to search in. This should be the output of expressionFromIds, expressionFromTranscriptTypes, expressionFromAttributeValue, or expressionFromAttributeValues, or more than one concatentated together and delimited by ' && '. If not supplied, all transcripts will be searched. |
main.participant |
TRUE to search only main-participant utterances, FALSE to search all utterances. |
aligned |
This parameter is deprecated and will be removed in future versions;
please use |
matches.per.transcript |
Optional maximum number of matches per transcript to return. NULL means all matches. |
words.context |
Number of words context to include in the 'Before.Match' and 'After.Match' columns in the results. |
max.matches |
The maximum number of matches to return, or null to return all. |
overlap.threshold |
The percentage overlap with other utterances before simultaneous speech is excluded, or null to include overlapping speech. |
anchor.confidence.min |
The minimum confidence for alignments, e.g.
|
page.length |
In order to prevent timeouts when there are a large number of matches or the network connection is slow, rather than retrieving matches in one big request, they are retrieved using many smaller requests. This parameter controls the number of results retrieved per request. |
no.progress |
TRUE to suppress visual progress bar. Otherwise, progress bar will be shown when interactive(). |
Value
A data frame identifying matches, containing the following columns:
-
Title The title of the LaBB-CAT instance
-
Version The current version of the LaBB-CAT instance
-
SearchName A name based on the pattern – the same for all rows
-
MatchId A unique ID for the matching target token
-
Transcript Name of the transcript in which the match was found
-
Participant Name of the speaker
-
Corpus The corpus of the transcript
-
Line The start offset of the utterance/line
-
LineEnd The end offset of the utterance/line
-
Before.Match Transcript text immediately before the match
-
Text Transcript text of the match
-
After.Match Transcript text immediately after the match
-
Number Row number
-
URL URL of the first matching word token
-
Target.word Text of the target word token
-
Target.word.start Start offset of the target word token
-
Target.word.end End offset of the target word token
-
Target.segment Label of the target segment (only present if the segment layer is included in the pattern)
-
Target.segment.start Start offset of the target segment (only present if the segment layer is included in the pattern)
-
Target.segment.end End offset of the target segment (only present if the segment layer is included in the pattern)
See Also
Examples
## Not run:
## the word 'the' followed immediately by a word starting with an orthographic vowel
theThenOrthVowel <- getMatches(labbcat.url, "the [aeiou]")
## a word spelt with "k" but pronounced "n" word initially
knWords <- getMatches(labbcat.url, list(orthography = "k.*", phonemes = "n.*"))
## the word 'the' followed immediately by a word starting with an phonemic vowel
theThenPhonVowel <- getMatches(
labbcat.url, list(
list(orthography = "the"),
list(phonemes = "[cCEFHiIPqQuUV0123456789~#\\$@].*")))
## the word 'the' followed immediately or with one intervening word by
## a hapax legomenon (word with a frequency of 1) that doesn't start with a vowel
results <- getMatches(
labbcat.url, list(columns = list(
list(layers = list(
orthography = list(pattern = "the")),
adj = 2),
list(layers = list(
phonemes = list(not=TRUE, pattern = "[cCEFHiIPqQuUV0123456789~#\\$@].*"),
frequency = list(max = "2"))))),
overlap.threshold = 5)
## all tokens of the KIT vowel, from the interview or monologue
## of the participants AP511_MikeThorpe and BR2044_OllyOhlson
results <- getMatches(labbcat.url, list(segment="I"),
participant.expression = expressionFromIds(c("AP511_MikeThorpe","BR2044_OllyOhlson")),
transcript.expression = expressionFromTranscriptTypes(c("interview","monologue")))
## all tokens of the KIT vowel for male speakers who speak English
results <- getMatches(labbcat.url, list(segment="I"),
participant.expression = paste(
expressionFromAttributeValue("participant_gender", "M"),
expressionFromAttributeValues("participant_languages_spoken", "en"),
sep=" && "))
## results$Text is the text that matched
## results$MatchId can be used to access results using other functions
## End(Not run)