1 / 12

Assessing Disclosure for a Longitudinal Linked File

Assessing Disclosure for a Longitudinal Linked File. Sam Hawala – US Census Bureau Sam.hawala@census.gov November 9 th , 2005. Outline of the talk. Background of the project Confidentiality protection Disclosure analysis Conclusions. Linked SIPP-SSA-IRS Data.

kalona
Download Presentation

Assessing Disclosure for a Longitudinal Linked File

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Assessing Disclosure for a Longitudinal Linked File Sam Hawala – US Census Bureau Sam.hawala@census.gov November 9th, 2005

  2. Outline of the talk • Background of the project • Confidentiality protection • Disclosure analysis • Conclusions

  3. Linked SIPP-SSA-IRS Data • The Longitudinal Employer-Household Dynamics (LEHD) Program created a confidential data set that integrates five SIPP panels (1990, 1991, 1992, 1993, 1996), and Earnings Records and SSA benefits data • Data very useful to disability and retirement research communities • LEHD will provide public-use version (PUF) of the integrated microdata using the synthetic data approach

  4. Synthetic Data • Fully-synthetic micro data • Uses the population or record linkage structure of the gold standard micro data • Generates synthetic entities and data elements from appropriate probability models • Partially-synthetic micro data • Preserves the record structure or sampling frame of the gold standard micro data • Replaces the data elements with synthetic values sampled from an appropriate probability model

  5. Data Confidentiality Public product should prevent individuals from being re-identified in the current public use SIPP products Limit number of SIPP variables included Protect survey data, administrative data, and the links between the files

  6. Confidentiality Protection • Protection is based on the inability of PUF users to re-identify the SIPP record upon which the PUF record is based • This prevents wholesale addition of SIPP data to the IRS and SSA data in the PUF • Goal: re-identification of SIPP records from the PUF should result in true matches and false matches with equal probability

  7. Disclosure Analysis • Uses probabilistic record linking • Each synthetic implicate is matched back to the original file • All unsynthesized variables are used as blocking variables

  8. Matching the Files • Two files A (original confidential data file) and B (synthetic data file)… over 200,000 records in each • Blocking criterion (unsynthesized variables) • Matching set of variables • Agreement criterion (M and U probabilities)

  9. Basic Results

  10. Refinements Suggestd by the Disclosure Review Board • The ratios of true matches to false matches should be close to 1. • The overall count of matches should be reduced. • Investigate a method to optimally choose the probabilities for the conditional matching and non-matching agreements

  11. Conclusion • Confidentiality is an increasing problem for agencies releasing public use data • Linked longitudinal worker-employer data is difficult to protect through usual methods • Probabilistic record linkage technology can be a powerful way to assess when data may be at risk.

More Related