CDF {EFAfactors} | R Documentation |
the Comparison Data Forest (CDF) Approach
Description
The Comparison Data Forest (CDF; Goretzko & Ruscio, 2019) approach is a combination of Random Forest with the comparison data (CD) approach.
Usage
CDF(
response,
num.trees = 500,
mtry = "sqrt",
nfact.max = 10,
N.pop = 10000,
N.Samples = 500,
cor.type = "pearson",
use = "pairwise.complete.obs",
vis = TRUE,
plot = TRUE
)
Arguments
response |
A required |
num.trees |
the number of trees in the Random Forest. (default = 500) See details. |
mtry |
the maximum depth for each tree, can be a number or a character ( |
nfact.max |
The maximum number of factors discussed by CD approach. (default = 10) |
N.pop |
Size of finite populations of simulating.. (default = 10,000) |
N.Samples |
Number of samples drawn from each population. (default = 500) |
cor.type |
A character string indicating which correlation coefficient (or covariance) is to be computed. One of "pearson" (default), "kendall", or "spearman". @seealso cor. |
use |
an optional character string giving a method for computing covariances in the presence of missing values. This must be one of the strings "everything", "all.obs", "complete.obs", "na.or.complete", or "pairwise.complete.obs" (default). @seealso cor. |
vis |
A Boolean variable that will print the factor retention results when set to TRUE, and will not print when set to FALSE. (default = TRUE) |
plot |
A Boolean variable that will print the CDF plot when set to TRUE, and will not print it when set to FALSE. @seealso plot.CDF. (Default = TRUE) |
Details
The Comparison Data Forest (CDF; Goretzko & Ruscio, 2019) Approach is a combination of random forest with the comparison data (CD) approach. Its basic steps involve using the method of Ruscio & Roche (2012) to simulate data with different factor counts, then extracting features from this data to train a random forest model. Once the model is trained, it can be used to predict the number of factors in empirical data. The algorithm consists of the following steps:
1. **Simulation Data:**
- (1)
For each value of
nfact
in the range from 1 tonfact_{max}
, generate a population data using the GenData function.- (2)
Each population is based on
nfact
factors and consists ofN_{pop}
observations.- (3)
For each generated population, repeat the following for
N_{rep}
times, For thej
-th inN_{rep}
: a. Draw a sampleN_{sam}
from the population that matches the size of the empirical data; b. Compute a feature set\mathbf{fea}_{nfact,j}
from eachN_{sam}
.- (4)
Combine all the generated feature sets
\mathbf{fea}_{nfact,j}
into a data frame as\mathbf{data}_{train, nfact}
.- (5)
Combine all
\mathbf{data}_{train, nfact}
into a final data frame as the training dataset\mathbf{data}_{train}
.
2. **Training RF:**
Train a Random Forest model RF
using the combined \mathbf{data}_{train}
.
3. **Prediction the Empirical Data:**
- (1)
Calculate the feature set
\mathbf{fea}_{emp}
for the empirical data.- (2)
Use the trained Random Forest model
RF
to predict the number of factorsnfact_{emp}
for the empirical data:nfact_{emp} = RF(\mathbf{fea}_{emp})
According to Goretzko & Ruscio (2024) and Breiman (2001), the number of
trees in the Random Forest num.trees
is recommended to be 500.
The Random Forest in CDF performs a classification task, so the recommended maximum
depth for each tree mtry
is \sqrt{q}
(where q
is the number of features),
which results in m_{try}=\sqrt{181}=13
.
Since the CDF approach requires extensive data simulation and computation, which is much more time consuming than the CD Approach, C++ code is used to speed up the process.
Value
An object of class CDF
is a list
containing the following components:
nfact |
The number of factors to be retained. |
RF |
the trained Random Forest model |
probability |
A matrix containing the probabilities for factor numbers ranging from 1 to nfact.max (1xnfact.max), where the number in the f-th column represents the probability that the number of factors for the response is f. |
features |
A matrix (1×181) containing all the features for determining the number of factors. @seealso extractor.feature.FF |
Author(s)
Haijiang Qin <Haijiang133@outlook.com>
References
Breiman, L. (2001). Random Forests. Machine Learning, 45(1), 5-32. https://doi.org/10.1023/A:1010933404324
Goretzko, D., & Ruscio, J. (2024). The comparison data forest: A new comparison data approach to determine the number of factors in exploratory factor analysis. Behavior Research Methods, 56(3), 1838-1851. https://doi.org/10.3758/s13428-023-02122-4
Ruscio, J., & Roche, B. (2012). Determining the number of factors to retain in an exploratory factor analysis using comparison data of known factorial structure. Psychological Assessment, 24, 282–292. http://dx.doi.org/10.1037/a0025697.