aggregate_applications {featForge} | R Documentation |
Aggregate Numeric Data by Periods
Description
Aggregates any numeric variable(s) in a dataset over defined time periods and returns summary features computed from provided operation functions. E.g., aggregating and making features from transactional data, previous loan repayment behavior, credit bureau inquiries. Aggregation is performed by a specified grouping identifier (e.g., application, client, or agreement level) and is based on time-periods.
Usage
aggregate_applications(
data,
id_col,
amount_col,
time_col = NULL,
group_cols = NULL,
ops,
period,
observation_window_start_col = NULL,
scrape_date_col = NULL,
period_agg = sum,
period_missing_inputs = 0
)
Arguments
data |
A data frame containing the data to be aggregated. The dataset must include at least the columns specified by
|
id_col |
A character string specifying the column name used to define the aggregation level (e.g., |
amount_col |
A character string specifying the column in |
time_col |
A character string indicating the column name that contains the date (or timestamp) when the event occurred.
This column must be of class |
group_cols |
An optional character vector of column names by which to further subdivide the aggregation. For each unique value in these columns, separate summary features will be generated and appended as new columns. |
ops |
A named list of functions used to compute summary features on the aggregated period values. Each function must accept a single numeric vector as input. The names of the list elements are used to label the output columns. |
period |
Either a character string specifying the time period grouping ( |
observation_window_start_col |
A character string indicating the column name that contains the observation window start date.
This argument is required when |
scrape_date_col |
A character string indicating the column name that contains the scrape date (i.e., the end date for the observation
window). This is required when |
period_agg |
A function used to aggregate the numeric values within each period. The default is |
period_missing_inputs |
A numeric constant used to replace missing values in periods with no observed data. The default value is |
Details
When period
is provided as a character string (one of "daily"
, "weekly"
, or "monthly"
),
data are grouped into complete calendar periods. For example, if the scrape date falls mid-month, the incomplete last period
is excluded. Alternatively, period
may be specified as a numeric vector of length 2 (e.g., c(7, 8)
), in which case
the first element defines the cycle length in days and the second element the number of consecutive cycles. In this example,
if the scrape date is "2024-12-31"
, the periods span the last 56 days (8 consecutive 7-day cycles), with the first period
starting on "2024-11-05"
.
aggregate_applications
aggregates numeric data either by defined time periods or over the full observation window.
Data is first grouped by the identifier specified in id_col
(e.g., at the application, client, or agreement level).
When
period
is set to"daily"
,"weekly"
, or"monthly"
, transaction dates intime_col
are partitioned into complete calendar periods (incomplete periods are excluded).When
period
is set to a numeric vector of length 2 (e.g.,c(7, 8)
), consecutive cycles of fixed length are defined.When
period
is set to"all"
, time aggregation is disabled. All observations for an identifier (or group) are aggregated together.
For each period, the numeric values in amount_col
(or any other numeric measure) are aggregated using the function
specified by period_agg
. Then, for each unique group (if any group_cols
are provided) and for each application (or
other identifier), the summary functions specified in ops
are applied to the vector of aggregated period values.
When grouping is used, the resulting summary features are appended as new columns with names constructed in the format:
<operation>_<group_column>_<group_value>
. Missing aggregated values in periods with no observations are replaced
by period_missing_inputs
.
Value
A data frame where each row corresponds to a unique identifier (e.g., application, client, or agreement).
The output includes aggregated summary features for each period and, if applicable, additional columns for each group
defined in group_cols
.
Examples
data(featForge_transactions)
# Example 1: Aggregate outgoing transactions (amount < 0) on a monthly basis.
aggregate_applications(featForge_transactions[featForge_transactions$amount < 0, ],
id_col = 'application_id',
amount_col = 'amount',
time_col = 'transaction_date',
ops = list(
avg_momnthly_outgoing_transactions = mean,
last_month_transactions_amount = function(x) x[length(x)],
# In the aggregated numeric vector, the last observation represents the most recent period.
last_month_transaction_amount_vs_mean = function(x) x[length(x)] / mean(x)
),
period = 'monthly',
observation_window_start_col = 'obs_start',
scrape_date_col = 'scrape_date'
)
# Example 2: Aggregate transactions by category and direction.
featForge_transactions$direction <- ifelse(featForge_transactions$amount > 0, 'in', 'out')
aggregate_applications(featForge_transactions,
id_col = 'application_id',
amount_col = 'amount',
time_col = 'transaction_date',
group_cols = c('category', 'direction'),
ops = list(
avg_monthly_transactions = mean,
highest_monthly_transactions_count = max
),
period = 'monthly',
period_agg = length,
observation_window_start_col = 'obs_start',
scrape_date_col = 'scrape_date'
)
# Example 3: Aggregate using a custom numeric period:
# 30-day cycles for 3 consecutive cycles (i.e., the last 90 days).
aggregate_applications(featForge_transactions,
id_col = 'application_id',
amount_col = 'amount',
time_col = 'transaction_date',
ops = list(
avg_30_day_transaction_count_last_90_days = mean
),
period = c(30, 3),
period_agg = length,
observation_window_start_col = 'obs_start',
scrape_date_col = 'scrape_date'
)
# Example 4: Aggregate transactions without time segmentation.
aggregate_applications(featForge_transactions,
id_col = 'application_id',
amount_col = 'amount',
ops = list(
total_transactions_counted = length,
total_outgoing_transactions_counted = function(x) sum(x < 0),
total_incoming_transactions_counted = function(x) sum(x > 0)
),
period = 'all'
)