Alternative Approaches to Data Dissemination and Data Sharing Jerome ReiterDuke Universityjerry@stat.duke.edu
Two general settings • Agency seeks to release confidential data to the public. • Multiple agencies seek to improve analyses by sharing their confidential data. For both settings, agencies seek strategies that:i) do not reveal identities or sensitive attributes,ii) are useful for a wide range of analyses,iii) are easy for analysts and agencies to use.
Some alternative approaches • Remote access servers • Synthetic (i.e. simulated) data • Secure computation techniques
Definition of servers • Server is any system that (i) allows users to submit queries for output from statistical analyses of microdata, but(ii) does not give direct access to microdata. • Table Servers / Model Servers
Queries and responses • Queries to model server:Users request results from fitting a statistical model to the data. • Response from model server: Answerable query: model output.Unanswerable query: no results. Model output also should include diagnostics.
Challenges in developing model servers • Non-statistical:Operation costs, server security, etc. • Statistical:-- Disclosure risks from smart queries (e.g., subsets, transformations).-- Inferential disclosure risks.-- Enabling complex model fitting.
Synthetic data Rubin (1993, JOS ): create multiple, fully synthetic datasets for public release so that: • No unit in released data has sensitive data from actual unit in population. • Released data look like actual data. • Statistical procedures valid for original data are valid for released data.
Generating fully synthetic data • Randomly sample new units from sampling frame. • Impute survey variables for new units using models fit from observed data. • Repeat multiple times and release datasets.
Modification: Release partially synthetic data Little (1993, JOS ): create multiple, partially synthetic datasets for public release so that: • Released data comprise mix of observed and synthetic values. • Released data look like actual data. • Statistical procedures valid for original data are valid for released data.
Existing applications • Kennickel (1997, Record Linkage Techniques): Replace sensitive values for selected units. • Liu and Little (2002, JSM Proceedings):Replace values of key identifiers for selected units. • Abowd and Woodcock (2001, Confidentiality, Disclosure, and Data Access):Replace all values of sensitive variables.
Sample of research agenda • Implement and compare various data generation approaches on genuine data in production settings. • Evaluate risk/usefulness profile on genuine data in production setting. • Develop packaged synthesizers for data disseminators to use.
Secure computations • Horizontally Partitioned:Agencies have different records but same variables. • Purely Vertically Partitioned:Agencies have same records but different variables. • Partially Overlapping, Vertically Partitioned:Agencies have different records and different variables, with some common records and variables.
Horizontally Partitioned Data:Secure Summation • Secure summation-- shares sums without sharing data -- allows regressions, clustering, classifications-- assumes semi-honest
Horizontal Partitioning:Secure summation Obtainwithout sharing individual values • Agency A passes (x + R) to 2nd agency. • Agency B adds its x to this value and passes sum to Agency C. • Process continues until all agencies have added their x. • Agency A subtracts R from the sum.
Purely vertical partitioning • Secure dot/matrix product-- shares dot/matrix products without sharing data.-- allows regressions, clustering, classification.-- assumes semi-honest. • Synthetic data approaches-- share synthetic copies of data across agencies.-- allows any analysis when distributions used to generate data are accurate.-- generates public use data file.
A research agenda for secure computation methods - How to specify models without viewing data?- What if sophisticated models needed?- How to incorporate matching errors, differences in data quality and definitions?- How to account for disclosure risks from models that “fit too well?”
Some References • Remote access servers- Rowland (2003, NAS Panel on Data Access). - Gomatam, Karr, Reiter, Sanil (2005, Stat. Science) • Synthetic data- Raghunathan, Reiter, and Rubin (2003, JOS )- Reiter (2003, Surv. Meth.; 2005, JRSSA) • Secure computation- Benaloh (1987, CRYPTO86 )- Karr, Lin, Sanil, and Reiter (2005, NISS tech. rep.)