get_valid_subset {clinpubr} | R Documentation |
Get the subset that satisfies the missing rate condition.
Description
Get the subset of a data frame that satisfies the missing rate condition using a greedy algorithm.
Usage
get_valid_subset(
df,
row_na_ratio = 0.5,
col_na_ratio = 0.2,
row_priority = 1,
speedup_ratio = 0,
return_index = FALSE
)
Arguments
df |
A data frame. |
row_na_ratio |
The maximum acceptable missing rate of rows. |
col_na_ratio |
The maximum acceptable missing rate of columns. |
row_priority |
A positive numerical, the priority to keep rows. The higher the value, the higher the priority,
with |
speedup_ratio |
A positive numerical, the ratio of speedup. The higher the value, the greedier the algorithm. |
return_index |
A logical, whether to return only the row and column indices of the subset. |
Details
The function is based on a greedy algorithm. It iteratively removes the row or column with
the highest excessive missing rate weighted by the inverse of row_priority
until the missing rates
of all rows and columns are below the specified threshold. Then it reversely tries to add rows and columns that
do not break the conditions back and finalize the subset. The result depends on the row_priority
parameter
drastically, so it's recommended to try different row_priority
values to find the most satisfying one.
Value
The subset data frame, or a list that contains the row and column indices of the subset.
Examples
data(cancer, package = "survival")
dim(cancer)
max_missing_rates(cancer)
cancer_valid <- get_valid_subset(cancer, row_na_ratio = 0.2, col_na_ratio = 0.1, row_priority = 1)
dim(cancer_valid)
max_missing_rates(cancer_valid)