1 / 56

Statistical Disclosure Control Mark Elliot Confidentiality and Privacy Group CCSR University of Manchester

Statistical Disclosure Control Mark Elliot Confidentiality and Privacy Group CCSR University of Manchester. Overview. CAPRI –who we are / what we do SDC – some basics SD Risk Assessment and Microdata General Concepts Our Approach SD Risk Assessment and Aggregate Data General Concepts

doli
Download Presentation

Statistical Disclosure Control Mark Elliot Confidentiality and Privacy Group CCSR University of Manchester

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Statistical Disclosure Control Mark Elliot Confidentiality and Privacy Group CCSR University of Manchester

  2. Overview • CAPRI –who we are / what we do • SDC – some basics • SD Risk Assessment and Microdata • General Concepts • Our Approach • SD Risk Assessment and Aggregate Data • General Concepts • Our Approach • Statistical Disclosure and the Grid

  3. Confidentiality And PRIvacy group www.ccsr.ac.uk/capri University of Manchester

  4. Purpose To investigate the Confidentiality and Privacy issues that arise from the collection, dissemination and analysis of data.

  5. Multidisciplinary Approach • Mark Elliot, Knowledge and Data Engineering • Kingsley Purdam, Politics and Information Society • Anna Manning, Data Mining and HPC • Elaine Mackey, Social Policy • Duncan Smith, Statistics and Stochastic Systems • Karen McCullagh, the Law and Social Policy

  6. Associate Members in Manchester C S: Alan Rector, John Gurd, Len Freeman, Adel Taweel. Computation: John Keane. Psychology: Karen Lander, Lee Wickham. Medicine: Iain Buchan. Manchester Computing Centre: Stephen Pickles. Law: Joseph Jakaneli, John Harris.

  7. Research Programmes The Social and Political Aspects of Confidentiality and Privacy The Detection of Risky Records: Special Uniqueness The Disclosure risk issues posed by the Grid High Performance Computing and statistical Disclosure Medical Records: Clinical E-Science Framework The SAMDIT methodology: Data Monitoring Centre

  8. Consultancy ONS Census Social Survey Neighbourhood statistics US Census Bureau Australian Bureau of Statistics Statistics New Zealand

  9. Statistical Disclosure Control

  10. Sub Fields • Disclosure risk assessment. • Disclosure control methodology. • Analytical validity. • Microdata and Aggregate data. • Business and Personal data. • Intentional and Consequential data

  11. Our General Approach:The SAMDIT method • Scenario Analysis (Elliot and Dale 1999) • Metric Development • Implementation • Testing

  12. Microdata

  13. The Microdata Disclosure Risk Problem:An Example Identification file Name Address Sex Age .. Sex Age .. Income .. .. Target file Target variables ID variables Key variables

  14. Risk Assessment methods • File Level • Population Uniqueness e.g Bethlehem(1990), Samuels(1998) • DIS; Skinner and Elliot(2002) • Record level • Statistical modelling (Fienberg and Makov 1998, Skinner and Holmes 1998) • Computational Search Elliot et al (2002)

  15. Data Intrusion Simulation • Uses microdata set (or table) itself to estimate risk - no population data. • An estimate of the probability of a correct match (given a unique match). • Special method: sub-sampling and re-sampling. • General method: derivation from the equivalence class structure.

  16. The DIS Method Remove a small number of records Microdata sample

  17. The DIS Method II Copy back a random number of the removed records (at a probability equivalent to the original sampling fraction)

  18. The DIS Method III Match the removed fragment against the truncated microdata file

  19. Validation • Empirical validation studies comparing with the results obtained using population data: Empirical results: No bias and small error. Elliot (2001) • Mathematical proof: Skinner and Elliot (2002).

  20. Pr(cm|um) for 2% sample with basic key (age sex marital status)

  21. Levels of Risk Analysis • DIS • Works at the file level • Very good for comparative analyses • e.g. SAMs

  22. Levels of Risk Analysis • Record level risk is important • Variations in risk topography • Risky records

  23. Special Uniques • Original concept • Counterintuitive geographical effect, indicated two types of sample uniques. • Random and Special • Special • Epidemiological peculiarity • Random • Effect of sampling and variable definition

  24. Special Uniques • Changing definition: • Sample uniques which remain unique despite geographical aggregation • Sample uniques which remain unique through any variable aggregation • Sample uniques on subset of key variables • Dichotomy to Dimension

  25. Minimal Sample Unique • A set of sample unique set of variable values • for which no subset is also unique.

  26. Risk Signatures: combinations of minimal uniques • Example • Unique pairs 0 • Unique triples 5 • Unique fourfolds 1 • Unique fivefolds 3 • Unique sixfolds 0 • Unique sevenfolds 0 • ………

  27. Special Uniques • Problem: how to look at all the variables? • File may contain hundreds • Even with scenario keys individual records can contain hundreds of minimal sample uniques • Combinatorial explosion

  28. HIPERSTAD Projects • Funded by ESRC, ONS and EPSRC • Use of high performance computing • Enables comprehensive analysis of patterns of uniqueness within each record • Has allowed investigation of more complex grading systems

  29. Risk Signatures II • Allow grading and classification of records • Differential treatment • Low impact high efficacy disclosure control

  30. Combining DIS and SUDA • A heuristic method for combining the two methods to provide a per record matching confidence has proved very effective • ONS evaluation studies show that combined method picks out high probability risk very well

  31. SUDA software • Available free under licence • Used at ONS, ABS and Stats new Zealand

  32. Aggregate Data

  33. Introduction • Measurement of Disclosure Risk is an important precursor for its control • Intruder/scenario based metrics are better than abstract ones • Such metrics are available for microdata but not for aggregate data

  34. Overview • Overview of the issues and introducing the method on a conceptual level • Details of the algorithms • Ongoing and Future Work

  35. The Issues • Aggregate data is usually 100% data, so measures based on identification disclosure and sampling are meaningless • A better approach is to evaluate what can be inferred through attribute disclosure

  36. Attribute Disclosure

  37. Attribute Disclosure

  38. The Approach • Rather than assess the risk of actual attribute disclosure we propose estimating the probability of producing a potentially disclosive table, which wedefine as any table containing at least one zero • The method/measure we propose can be applied to: • Single tables • Groups of tables • Unperturbed and perturbed tables • Unpublished tables

  39. The Bounds Problem • In a general sense any set of tables can be viewed as a set of bounds on the full table. For example if we release two one way frequency tables:

  40. The Bounds Problem We are effectively releasing the marginals to a two-way frequency table where the entire joint distribution has been suppressed

  41. The cells in the joint distribution can be expressed as a set of bounds (or ranges of feasible values)

  42. The Subtraction – Attribution Probability (SAP) Method • The risk associated with a table release depends on the set of tables jointly, rather than on the individual tables. • SAP can be used on single tables, groups of tables, perturbed or unperturbed tables. • Bounds are calculated and then the probability of an intruder producing one or more upper bounds of zero by subtractingk random individuals from the table is calculated • The output can be set for user defined levels of k

  43. Original cell counts can be recovered from the marginal tables

  44. Subtraction • We consider that an intruder might have knowledge of the relevant population, as well as information in the table release • We assume (at least initially) that the intruder has perfect knowledge of k randomly selected individuals

  45. Single exact tables • The lower / upper bounds are equal to the published counts • The probability of an intruder recovering at least one zero by subtracting known individuals is found by calculating Hypergeometric probabilities and applying the inclusion / exclusion principle

  46. The marginal probability of observing all individuals in a cell is calculated for each individual cell, and the sum is added to a total (initially zero) • The marginal probability of observing all individuals in a pair of cells is calculated for each pair of cells, and subtracted from the total • The marginal probability of observing all individuals in a ‘triple’ of cells is calculated for each triple of cells, and added to the total • And so on, until we have considered the table total, or all subsequent probabilities are zero

  47. For example, For k = 3 and the following table (and not showing zero probability terms),

  48. Example output

  49. Confidentiality and the Grid • What new data possibilities does the Grid provide and what confidentiality implications do they have? • How could the Grid (or a Grid) be used to enable disclosure risk assessment and control? • How could a grid enable a data intruder? • What are the possibilities and issues provided by remote access?

More Related