aggregate_psd2_keyword_features {featForge} | R Documentation |
Aggregate PSD2 Keyword Features at the Application Level with Time Window Filtering
Description
This function extracts keyword features from a transaction descriptions column using the
extract_keyword_features
function and then aggregates these features at the application level
using the aggregate_applications
function. In addition, when the aggregation period is provided as a numeric
vector (e.g., c(30, 3)
), the function filters out transactions that fall outside the observation window
defined as the period between scrape_date - (period[1] * period[2])
and scrape_date
. This prevents spending
time processing keywords from transactions that would later be aggregated as zeros.
Usage
aggregate_psd2_keyword_features(
data,
id_col,
description_col,
amount_col = NULL,
time_col = NULL,
observation_window_start_col = NULL,
scrape_date_col = NULL,
ops = NULL,
period = "all",
separate_direction = if (!is.null(amount_col)) TRUE else FALSE,
group_cols = NULL,
min_freq = 1,
use_matrix = TRUE,
convert_to_df = TRUE,
period_agg = sum,
period_missing_inputs = 0
)
Arguments
data |
A data frame containing transaction records. |
id_col |
A character string specifying the column name that identifies each application (e.g., |
description_col |
A character string specifying the column name that contains the transaction descriptions.
Note that this column may contain |
amount_col |
Optional. A character string specifying the column name that contains transaction amounts.
If provided, the function aggregates a value for each keyword (default |
time_col |
Optional. A character string specifying the column name that contains the transaction date
(or timestamp). When |
observation_window_start_col |
Optional. A character string indicating the column name with the observation window start date.
If |
scrape_date_col |
Optional. A character string indicating the column name with the scrape date.
If |
ops |
A named list of functions used to compute summary features on the aggregated values.
If |
period |
Either a character string or a numeric vector controlling time aggregation.
The default is |
separate_direction |
Logical. If |
group_cols |
Optional. A character vector of additional grouping columns to use during aggregation.
If |
min_freq |
Numeric. The minimum frequency a token must have to be included in the keyword extraction. Default is 1. |
use_matrix |
Logical. Passed to |
convert_to_df |
Logical. Passed to |
period_agg |
A function used to aggregate values within each period (see |
period_missing_inputs |
A numeric value to replace missing aggregated values. Default is |
Details
The function supports two modes:
If
amount_col
is not provided (i.e.,NULL
), the function aggregates keyword counts (i.e., the number of transactions in which a keyword appears) for each application.If
amount_col
is provided, then for each transaction the keyword indicator is multiplied by the transaction amount. In this mode, the default aggregation operation is to sum these values (usingops = list(amount = sum)
), yielding the total amount associated with transactions that mention each keyword.
Additionally, if amount_col
is provided and separate_direction
is TRUE
(the default),
a new column named "direction"
is created to separate incoming ("in"
) and outgoing ("out"
)
transactions based on the sign of the amount. Any additional grouping columns can be provided via group_cols
.
The function performs the following steps:
Basic input checks are performed to ensure the required columns exist.
The full list of application IDs is stored from the original data.
If
amount_col
is provided andseparate_direction
isTRUE
, a"direction"
column is added to label transactions as incoming ("in"
) or outgoing ("out"
) based on the sign of the amount.When
period
is provided as a numeric vector, the function computes the observation window asscrape_date - (period[1] * period[2])
toscrape_date
and filters the dataset to include only transactions within this window. Transactions for applications with no records in the window will later be assigned zeros.Keyword features are extracted from the
description_col
usingextract_keyword_features
. If anamount_col
is provided, the binary indicators are weighted by the transaction amount.The extracted keyword features are combined with the (possibly filtered) original data.
For each keyword, the function calls
aggregate_applications
to aggregate the feature by application. The aggregation is performed over time periods defined byperiod
(if applicable) and, if requested, further split by direction.Aggregated results for each keyword are merged by application identifier.
Finally, the aggregated results are merged with the full list of application IDs so that applications with no transactions in the observation window appear with zeros.
Value
A data frame with one row per application and aggregated keyword features.
Examples
# Example: Aggregate keyword features for PSD2 transactions.
data(featForge_transactions)
# In this example, the 'description' field is parsed for keywords.
# Since the 'amount' column is provided, each keyword indicator is
# weighted by the transaction amount, and transactions are
# automatically split into incoming and outgoing via the 'direction' column.
# Additionally, the period is specified as c(30, 1), meaning only
# transactions occurring within the last 30 days.
# (scrape_date - 30 to scrape_date) are considered.
result <- aggregate_psd2_keyword_features(
data = featForge_transactions,
id_col = "application_id",
description_col = "description",
amount_col = "amount",
time_col = "transaction_date",
scrape_date_col = "scrape_date",
observation_window_start_col = "obs_start",
period = c(30, 1),
ops = list(amount = sum),
min_freq = 1,
use_matrix = TRUE,
convert_to_df = TRUE
)
# The resulting data frame 'result' contains one
# row per application with aggregated keyword features.
# For example, if keywords "casino" and "utilities" were detected,
# aggregated columns might be named:
# "casino_amount_direction_in",
# "casino_amount_direction_out",
# "utilities_amount_direction_in", etc.
result