method_nn {nonprobsvy} | R Documentation |
Mass imputation using nearest neighbours matching method
Description
Mass imputation using nearest neighbours approach as described in Yang et al. (2021).
The implementation is currently based on RANN::nn2 function and thus it uses
Euclidean distance for matching units from S_A
(non-probability) to S_B
(probability).
Estimation of the mean is done using S_B
sample.
Usage
method_nn(
y_nons,
X_nons,
X_rand,
svydesign,
weights = NULL,
family_outcome = NULL,
start_outcome = NULL,
vars_selection = FALSE,
pop_totals = NULL,
pop_size = NULL,
control_outcome = control_out(),
control_inference = control_inf(),
verbose = FALSE,
se = TRUE
)
Arguments
y_nons |
target variable from non-probability sample |
X_nons |
a |
X_rand |
a |
svydesign |
a svydesign object |
weights |
case / frequency weights from non-probability sample |
family_outcome |
a placeholder (not used in |
start_outcome |
a placeholder (not used in |
vars_selection |
whether variable selection should be conducted |
pop_totals |
a placeholder (not used in |
pop_size |
population size from the |
control_outcome |
controls passed by the |
control_inference |
controls passed by the |
verbose |
parameter passed from the main |
se |
whether standard errors should be calculated |
Details
Analytical variance
The variance of the mean is estimated based on the following approach
(a) non-probability part (S_A
with size n_A
; denoted as var_nonprob
in the result)
This may be estimated using
\hat{V}_1 = \frac{1}{N^2}\sum_{i=1}^{S_A}\frac{1-\hat{\pi}_B(\boldsymbol{x}_i)}{\hat{\pi}_B(\boldsymbol{x}_i)}\hat{\sigma}^2(\boldsymbol{x}_i),
where \hat{\pi}_B(\boldsymbol{x}_i)
is an estimator of propensity scores which
we currently estimate using n_A/N
(constant) and \hat{\sigma}^2(\boldsymbol{x}_i)
is
estimated using based on the average of (y_i - y_i^*)^2
.
Chlebicki et al. (2025, Algorithm 2) proposed non-parametric mini-bootstrap estimator
(without assuming that it is consistent) but with good finite population properties.
This bootstrap can be applied using control_inference(nn_exact_se=TRUE)
and
can be summarized as follows:
Sample
n_A
units fromS_A
with replacement to createS_A'
(if pseudo-weights are present inclusion probabilities should be proportional to their inverses).Match units from
S_B
toS_A'
to obtain predictionsy^*
={k}^{-1}\sum_{k}y_k
.Estimate
\hat{\mu}=\frac{1}{N} \sum_{i \in S_B} d_i y_i^*
.Repeat steps 1-3
M
times (we setM=50
in our simulations; this is hard-coded).Estimate
\hat{V}_1=\text{var}({\hat{\boldsymbol{\mu}}})
obtained from simulations and save it asvar_nonprob
.
(b) probability part (S_B
with size n_B
; denoted as var_prob
in the result)
This part uses functionalities of the {survey}
package and the variance is estimated using the following
equation:
\hat{V}_2=\frac{1}{N^2} \sum_{i=1}^n \sum_{j=1}^n \frac{\pi_{i j}-\pi_i \pi_j}{\pi_{i j}}
\frac{y_i^*}{\pi_i} \frac{y_j^*}{\pi_j},
where y^*_i
and y_j^*
are values imputed imputed as an average
of k
-nearest neighbour, i.e. {k}^{-1}\sum_{k}y_k
. Note that \hat{V}_2
in principle can be estimated in various ways depending on the type of the design and whether population size is known or not.
Value
an nonprob_method
class which is a list
with the following entries
- model_fitted
RANN::nn2
object- y_nons_pred
predicted values for the non-probablity sample (query to itself)
- y_rand_pred
predicted values for the probability sample
- coefficients
coefficients for the model (if available)
- svydesign
an updated
surveydesign2
object (new columny_hat_MI
is added)- y_mi_hat
estimated population mean for the target variable
- vars_selection
whether variable selection was performed (not implemented, for further development)
- var_prob
variance for the probability sample component (if available)
- var_nonprob
variance for the non-probability sample component
- var_tot
total variance, if possible it should be
var_prob+var_nonprob
if not, just a scalar- model
model type (character
"nn"
)- family
placeholder for the
NN approach
information
References
Yang, S., Kim, J. K., & Hwang, Y. (2021). Integration of data from probability surveys and big found data for finite population inference using mass imputation. Survey Methodology, June 2021 29 Vol. 47, No. 1, pp. 29-58
Chlebicki, P., Chrostowski, Ł., & Beręsewicz, M. (2025). Data integration of non-probability and probability samples with predictive mean matching. arXiv preprint arXiv:2403.13750.
Examples
data(admin)
data(jvs)
jvs_svy <- svydesign(ids = ~ 1, weights = ~ weight, strata = ~ size + nace + region, data = jvs)
res_nn <- method_nn(y_nons = admin$single_shift,
X_nons = model.matrix(~ region + private + nace + size, admin),
X_rand = model.matrix(~ region + private + nace + size, jvs),
svydesign = jvs_svy)
res_nn