GenData {EFAfactors} | R Documentation |
Simulating Data Following John Ruscio's RGenData
Description
This function simulates data with nfact
factors based on empirical data.
It represents the simulation data part of the CD function
and the CDF function. This function improves upon
GenDataPopulation by utilizing C++ code to achieve faster data simulation.
Usage
GenData(
response,
nfact = 1,
N.pop = 10000,
Max.Trials = 5,
lr = 1,
cor.type = "pearson",
use = "pairwise.complete.obs",
isSort = FALSE
)
Arguments
response |
A required |
nfact |
The number of factors to extract in factor analysis. (default = 1) |
N.pop |
Size of finite populations for simulating. (default = 10,000) |
Max.Trials |
The maximum number of consecutive trials without obtaining a lower RMSR. (default = 5) |
lr |
The learning rate for updating the correlation matrix during iteration. (default = 1) |
cor.type |
A character string indicating which correlation coefficient (or covariance) is to be computed. One of "pearson" (default), "kendall", or "spearman". @seealso cor. |
use |
An optional character string specifying a method for computing covariances in the presence of missing values. This must be one of the strings "everything", "all.obs", "complete.obs", "na.or.complete", or "pairwise.complete.obs" (default). @seealso cor. |
isSort |
Logical, determines whether the simulated data needs to be sorted in descending order. (default = FALSE) |
Details
The core idea of GenData
is to start with the empirical data's correlation matrix
and iteratively approach data with nfact
factors. Any value in the simulated data must come
from the empirical data. The specific steps of GenData
are as follows:
- (1)
Use the empirical data (
\mathbf{Y}_{emp}
) correlation matrix as the target,\mathbf{R}_{targ}
.- (2)
Simulate scores for
N.pop
examinees onnfact
factors using a multivariate standard normal distribution:\mathbf{S}_{(N.pop \times nfact)} \sim \mathcal{N}(0, 1)
Simulate noise for
N.pop
examinees onI
items:\mathbf{U}_{(N.pop \times I)} \sim \mathcal{N}(0, 1)
- (3)
Initialize
\mathbf{R}_{temp} = \mathbf{R}_{targ}
, and set the minimum Root Mean Square ResidualRMSR_{min} = \text{Inf}
. Start the iteration process.- (4)
Extract
nfact
factors from\mathbf{R}_{temp}
, and obtain the factor loadings matrix\mathbf{L}_{shar}
. Ensure that the first element of\mathbf{L}_{share}
is positive to standardize the direction.- (5)
Calculate the unique factor matrix
\mathbf{L}_{uniq, (I \times 1)}
:L_{uniq,i} = \sqrt{1 - \sum_{j=1}^{nfact} L_{share, i, j}^2}
- (6)
Calculate the simulated data
\mathbf{Y}_{sim}
:Y_{sim, i, j} = \mathbf{S}_{i} \mathbf{L}_{shar, j}^T + U_{i, j} L_{uniq,i}
- (7)
Compute the correlation matrix of the simulated data,
\mathbf{R}_{simu}
.- (8)
Calculate the residual correlation matrix
\mathbf{R}_{resi}
between the target matrix\mathbf{R}_{targ}
and the simulated data's correlation matrix\mathbf{R}_{simu}
:\mathbf{R}_{resi} = \mathbf{R}_{targ} - \mathbf{R}_{simu}
- (9)
Calculate the current RMSR:
RMSR_{cur} = \sqrt{\frac{\sum_{i < j} \mathbf{R}_{resi, i, j}^2}{0.5 \times (I^2 - I)}}
- (10)
If
RMSR_{cur} < RMSR_{min}
, update\mathbf{R}_{temp} = \mathbf{R}_{temp} + lr \times \mathbf{R}_{resi}
,RMSR_{min} = RMSR_{cur}
, set\mathbf{R}_{min, resi} = \mathbf{R}_{resi}
, and reset the count of consecutive trials without improvementcou = 0
. IfRMSR_{cur} \geq RMSR_{min}
, update\mathbf{R}_{temp} = \mathbf{R}_{temp} + 0.5 \times cou \times lr \times \mathbf{R}_{min, resi}
and incrementcou = cou + 1
.- (11)
Repeat steps (4) through (10) until
cou \geq Max.Trials
.
Of course C++ code is used to speed up.
Value
A N.pop
* I
matrix containing the simulated data.
References
Ruscio, J., & Roche, B. (2012). Determining the number of factors to retain in an exploratory factor analysis using comparison data of known factorial structure. Psychological Assessment, 24, 282–292. http://dx.doi.org/10.1037/a0025697.