1 / 28

Keeping Data Confidential in an Era of No Privacy

Keeping Data Confidential in an Era of No Privacy. Prof. Jerry Reiter Department of Statistical Science Duke University. Disclosure limitation setting. Agency seeks to release data on individuals Risk of re-identifications from matching to external databases

nerita
Download Presentation

Keeping Data Confidential in an Era of No Privacy

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Keeping Data Confidential in an Era of No Privacy Prof. Jerry ReiterDepartment of Statistical Science Duke University

  2. Disclosure limitation setting • Agency seeks to release data on individuals • Risk of re-identifications from matching to external databases • Statistical disclosure limitation applied to data before release

  3. Standard approaches to disclosure limitation • Recode variables • Suppress data • Swap data • Add random noise

  4. General issues with standard SDL • Recoding • Loses information in tails, disables fine spatial analysis, creates ecological fallacies • Suppression • Creates nonignorable missing data • May not be fully protective

  5. General issues with standard SDL • Swapping • Attenuates correlations • Protection based on perception • Adding noise • Inflates variances, distorts distributions, attenuates correlations • May need large noise variances

  6. Fully synthetic data Rubin (1993, JOS ): create multiple, fully synthetic datasets for public release so that: • No unit in released data has sensitive data from actual unit in population • Released data look like actual data • Statistical procedures valid for original data are valid for released data

  7. Generating fully synthetic data • Randomly sample new units from frame (can use simple random samples) • Impute survey variables for new units using models fit from observed data • Repeat multiple times and release m datasets

  8. Inferences from fully synthetic datasets Raghunathan, Reiter, Rubin (2003, Journal of Official Statistics) • Estimand: Q = Q (X , Y ) • In each synthetic dataset

  9. Quantities needed for inferences

  10. Inferences from fully synthetic data • Estimate of Q : • Estimate of variance is • For large n, s, m, use normal based inference for Q:

  11. Advantages of full synthesis • No sensitive data released: very high protection • No need to decide which values to alter nor which variables are quasi-identifiers • Potential to preserve associations, maintain geographies, release data in tails • Analysts can use standard methods on simple random samples • Protection does not depend on hiding nature of SDL to public

  12. Drawbacks of full synthesis • Analysts have to deal with multiple datasets (not a serious issue) • Quality of data highly dependent on quality of synthesis models • Relationships omitted in models are not in released data • Inaccurate distributions are passed on to analysts • Only possible for analysts to rediscover what is the synthesis models

  13. A modification of the proposal: Partially synthetic data Little (1993, JOS ): create multiple, partially synthetic datasets for public release so that: • Released data comprise mix of observed and synthetic values • Released data look like actual data • Statistical procedures valid for original data are valid for released data

  14. Observed Data Synthetic Datasets x y x y x y x y

  15. Observed Data Synthetic Datasets x y x y x y x y

  16. Observed Data Synthetic Datasets x y x y x y x y

  17. Existing applications • Replace sensitive values for selected units:Survey of Consumer FinancesCounty-to-county migration flows (current) • Replace values of identifiers for selected units:American Community Survey group quartersTract IDs for NCI SEER cancer registry data • Replace all values of sensitive variables:Longitudinal Business DatabaseOn the MapSurvey Income Program Participation

  18. Inference with partially synthetic datasets (no missing data) Reiter (2003, Survey Methodology) • Estimand: Q = Q (X , Y ) • In each synthetic dataset

  19. Inference with partially synthetic data (no missing data) • Estimate of Q : • Estimate of variance is • For large n and m, use normal based inference for Q:

  20. New units sampled Cannot match--low disclosure risk Full reliance on imputation models Released data SRS May need large synthetic sample sizes or m Collected units used Matches to observed data possible Partial reliance on imputation models Original design Small m can be adequate for replacements Fully synthetic Partially synthetic

  21. Open research questions • Synthesis models for specific data types: • Data nested within households • Longitudinal data • Social network data • And many more… • Record linkage with synthetic data

  22. Guide to literature:Overviews of synthetic data • Rubin (1993, Journal of Official Statistics ) • Little (1993, Journal of Official Statistics ) • Abowd and Woodcock (2001) in Confidentiality, Disclosure, and Data Access: Theory and Practical Applications for Statistical Agencies • Reiter (2004, Chance )

  23. Guide to literature: Inferences with synthetic data • Full synthesis: Raghunathan, Reiter, Rubin (2003, Journal of Official Statistics ) • Partial synthesis (no missing): Reiter (2003, Survey Methodology ) • Partial synthesis with missing data: Reiter (2004, Survey Methodology ) • Significance tests of multi-component hypotheses • Full synthesis and partial synthesis (no missing): Reiter (2005, Journal of Statistical Planning and Inference ) • Partial synthesis with missing: Kinney and Reiter (2010, Journal of Official Statistics ) • Model selection in regression: Kinney, Reiter, and Berger (forthcoming, Journal of Privacy and Confidentiality )

  24. Guide to literature: Generating synthetic data • Sequential regression approaches: Abowd and Woodcock (2004) in Privacy in Statistical Databases • Classification and regression trees: Reiter (2005, Journal of Official Statistics ) • Survey weights and partial synthesis:Mitra and Reiter (2006) in Privacy in Statistical Databases • Bayesian networks: Young, Graham, Penny (2009, Journal of Official Statistics ) • Regression with kernel density transformations: Woodcock and Benedetto (2009, Computational Statistics and Data Analysis ) • Random forests: Caiola and Reiter (2010, Transactions on Data Privacy ) • Support vector machines:Drechsler (2010) in Privacy in Statistical Databases

  25. Guide to literature:Disclosure risk estimation • Record linkage for partial synthesis:Abowd, Stinson, Benedetto (2006) technical report • Identification risks in partial synthesis • Reiter and Mitra (2009, Journal of Privacy and Confidentiality ) • Drechsler and Reiter (2008) in Privacy in Statistical Databases • Differential privacy and synthetic data:Abowd and Vilhuber (2008) in Privacy in Statistical Databases

  26. Guide to literature:Utility of synthetic data • Complex designs in full synthesis:Reiter (2002, Journal of Official Statistics ) • Impact of number of datasets on quality:Drechsler and Reiter (2009, Journal of Official Statistics ) • Verification servers: Reiter, Oganian, and Karr (2009, Computational Statistics and Data Analysis)

  27. Guide to literature: Genuine applications • Synthesis instead of topcoding:An and Little (2007, Journal of the Royal Statistical Society – A ) • Survey of Income and Program Participation linked data www.census.gov/sipp/synth_data.html • Longitudinal Business Database: Kinney and Reiter (2007, Proceedings of the Joint Statistical Meetings ) • American Community Survey group quarters:Hawala (2008, Proceedings of the Joint Statistical Meetings ) • OnTheMap: http://lehdmap4.did.census.gov/themap4/ • German Establishment Panel:Drechsler, Bender, and Rassler (2008, Transactions on Data Privacy )

  28. Guide to literature:Other adaptions • Combining two confidential datasets • Kohnen and Reiter (2009, Journal of the Royal Statistical Society - A) • Reiter 2009, International Statistical Review • Synthesize some variables mtimes and others r times (Reiter and Drechsler 2010, StatisticaSinica) • Sampling from a census followed by synthesis of confidential data (Drechsler and Reiter 2010, Journal of the American Statistical Association)

More Related