Assessing Disclosure for a Longitudinal Linked File

Assessing Disclosure for a Longitudinal Linked File Sam Hawala – US Census Bureau Sam.hawala@census.gov November 9th, 2005

Outline of the talk • Background of the project • Confidentiality protection • Disclosure analysis • Conclusions

Linked SIPP-SSA-IRS Data • The Longitudinal Employer-Household Dynamics (LEHD) Program created a confidential data set that integrates five SIPP panels (1990, 1991, 1992, 1993, 1996), and Earnings Records and SSA benefits data • Data very useful to disability and retirement research communities • LEHD will provide public-use version (PUF) of the integrated microdata using the synthetic data approach

Synthetic Data • Fully-synthetic micro data • Uses the population or record linkage structure of the gold standard micro data • Generates synthetic entities and data elements from appropriate probability models • Partially-synthetic micro data • Preserves the record structure or sampling frame of the gold standard micro data • Replaces the data elements with synthetic values sampled from an appropriate probability model

Data Confidentiality Public product should prevent individuals from being re-identified in the current public use SIPP products Limit number of SIPP variables included Protect survey data, administrative data, and the links between the files

Confidentiality Protection • Protection is based on the inability of PUF users to re-identify the SIPP record upon which the PUF record is based • This prevents wholesale addition of SIPP data to the IRS and SSA data in the PUF • Goal: re-identification of SIPP records from the PUF should result in true matches and false matches with equal probability

Disclosure Analysis • Uses probabilistic record linking • Each synthetic implicate is matched back to the original file • All unsynthesized variables are used as blocking variables

Matching the Files • Two files A (original confidential data file) and B (synthetic data file)… over 200,000 records in each • Blocking criterion (unsynthesized variables) • Matching set of variables • Agreement criterion (M and U probabilities)

Basic Results

Refinements Suggestd by the Disclosure Review Board • The ratios of true matches to false matches should be close to 1. • The overall count of matches should be reduced. • Investigate a method to optimally choose the probabilities for the conditional matching and non-matching agreements

Conclusion • Confidentiality is an increasing problem for agencies releasing public use data • Linked longitudinal worker-employer data is difficult to protect through usual methods • Probabilistic record linkage technology can be a powerful way to assess when data may be at risk.

Assessing Disclosure for a Longitudinal Linked File

Assessing Disclosure for a Longitudinal Linked File

Presentation Transcript

Estimating costs of diagnosis and treatment for lung cancer using linked longitudinal data

Assessing Disclosure Risk and Analytical Validity for the SIPP-SSA-IRS Public Use File Beta Version 4.1

NHANES III Linked Mortality File Overview

A longitudinal wave is:

Creating longitudinal analyses using linked education and workforce data

Pathways for Equity A Linked Learning Approach

Using linked data for assessing patterns of cancer care

How to Electronically file a Campaign Contribution Disclosure Report

Linked Data Visualizations for Eurostat Linked Data

Assessing Disclosure Risk in Sample Microdata Under Misclassification

Developing a strategy for Disclosure

A longitudinal wave is:

Developing a Statistical Disclosure Standard for Europe

Linked Longitudinal Administrative Data Sets (LLADS)

A Unified Approach for Assessing Agreement

A longitudinal wave is:

Assessing the need for a program

Advocates for Disclosure

Assessing File Format Robustness