Create Presentation
Download Presentation

Download Presentation
## Improved Variance Estimation for Fully Synthetic Datasets

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -

**Improved Variance Estimation for Fully Synthetic Datasets**UNECE Work Session on Statistical Data Confidentiality 27. October 2011, Tarragona Jörg Drechsler Institute for Employment Research**Fully synthetic datasets**• Originally proposed by Rubin (1993) • Closely related to the idea of multiple imputation for nonresponse • All values of the original dataset are replaced by synthetic values • Offer a very high level of data protection • Attractive for very sensitive data such as healthcare data**Fully synthetic datasets in theory**X Ynot observed Ysynthetisch Ysynthetisch Ysynthetisch Ysynthetisch Ysynthetic Yobserved**Fully synthetic datasets in practice**• Based on the original design, the synthetic populations consist of a large number of synthetic records and a small number of original records. • There is a small chance that the released samples from these populations also contain original records. • Main advantage of fully synthetic datasets is lost • In practice, intermediate step of generating populations is omitted • Synthetic samples are generated directly • All records are synthetic**Combining rules for fully synthetic datasets**• Raghunathan et al. (2003) developed the combining rules necessary to obtain valid inferences from fully synthetic datasets • Let be the point estimate obtained from dataset • Let be the estimated variance of • The following quantities are needed for inference**Combining rules for fully synthetic datasets**• Final point estimate • Final variance estimate • Two major disadvantages: • Variance estimate strictly valid only for the original synthesis design • Variance estimate can be negative • Reiter (2003) suggested an adjusted variance estimate that is always positive but conservative**Alternative variance estimate**• Closely related to the variance estimate for partially synthetic datasets • Only need to adjust for the potentially different sample sizes between the original sample and the synthetic sample where is the finite population correction factor for the original sample • Advantages • Can never be negative • Valid even if all records are synthesized • Disadvantages: • Only valid for - consistent estimators • Only valid under simple random sampling**Illustrative simulations**• Repeated simulation design • One standard normal variable • Population size N=10,000 • Repeatedly draw SRS of different sizes (1%, 5%, 10%, 20%) • Generate two versions of synthetic data with nsyn=2norgand m=5,20,100 • Based on original synthesis design (RRR approach) • Synthesizing all records directly (practical approach) • Quantity of interest • Compute the variance estimates and under both synthesis designs • Replicate 5,000 times**Conclusions**• Originally proposed variance estimate can be biased if all records are synthesized and the sampling rate is larger than 1%. • Alternative variance estimate • shows less variability than the original variance estimate • can never be negative • is always unbiased irrespective of the synthesis design • Alternative variance estimate is valid only • for –consistent estimates • under simple random sampling • Future work: Think about adjustments for complex sampling designs