1 / 21

The fallacy of de-identification and its impact on Open Data

The fallacy of de-identification and its impact on Open Data. Dr Chris Culnane School of Computing and Information Systems, The University of Melbourne. Based on joint work with Dr Benjamin Rubinstein & Dr Vanessa Teague. Overview. The ambiguity of de-identification and its fallacy

elbertk
Download Presentation

The fallacy of de-identification and its impact on Open Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The fallacy of de-identification and its impact on Open Data Dr Chris CulnaneSchool of Computing and Information Systems, The University of Melbourne Based on joint work with Dr Benjamin Rubinstein & Dr Vanessa Teague

  2. Overview • The ambiguity of de-identification and its fallacy • What is re-identification and examples of it • k-anonymity, what is it and why does it fail • MBS/PBS

  3. What is de-identification? “A common error when thinking about de-identification is to focus on a fixed end state of the data.” “De-identification, then, is a process of risk management but it is also a decision-making process: should we release this data or not and if so in what form?” “De-identification is a process to produce safe data but it only makes sense if what you are producing is safe useful data…” Personal information is de-identified if the information is no longer about an identifiable individual or an individual who is reasonably identifiable. (Privacy Act 1988) “This National Statement avoids the term 'de-identified data', as its meaning is unclear. While it is sometimes used to refer to a record that cannot be linked to an individual ('non-identifiable'), it is also used to refer to a record in which identifying information has been removed but the means still exist to re-identify the individual. When the term 'de-identified data' is used, researchers and those reviewing research need to establish precisely which of these possible meanings is intended.” (Australian National Data Service: National Statement on Ethical Conduct in Human Research http://www.ands.org.au/working-with-data/sensitive-data/de-identifying-data) (The De-Identification Decision-Making Framework OAIC/CSIRO 2017)

  4. Fallacy de-identification? “…scientists have demonstrated that they can often ‘reidentify’ or ‘deanonymize’ individuals hidden in anonymized data with astonishing ease. By understanding this research, we realize we have made a mistake, labored beneath a fundamental misunderstanding, which has assured us much less privacy than we have assumed. This mistake pervades nearly every information privacy law, regulation, and debate, yet regulators and legal scholars have paid it scant attention.” Paul Ohm: “Broken promises of privacy: responding to the surprising failure of anonymization” 2009

  5. What is Re-identification? • Linking one or many de-identified datasets to infer additional information about individual records • Iterative, harm can be caused by partial re-identification or without recovering name • Ultimately culminating in personally identifying a person • Many sources of auxiliary data: • Public datasets (voter registration, public records) • Unstructured data (news media, social media) • Private datasets or knowledge (insurance companies, banks, personal knowledge) • Not rocket science – fundamentally just an inner join

  6. How hard is re-identification? “…with advances of technology, methods that were sufficient to de-identify data in the past may become susceptible to re-identification in the future.” Attorney General 2016 • It’s child’s play • Guess Who? • Twenty Questions

  7. What is Re-identification? • It is not only easy, it is surprisingly common: • AOL • Netflix – IMDB • New York Taxi Service • Massachusetts Group Insurance Commission • Predicting SSNs by Birth date and State • MBS/PBS

  8. K-anonymity – why it fails • Data is said to have the k-anonymity property if the data for one individual is indistinguishable from k-1 other individuals in the release • Quasi-identifiers • Suppression • Grouping • Basis of the evaluation of the risk of release

  9. K-anonymity – why it fails Quasi-identifiers: Name, Age, Gender, State, Illness

  10. K-anonymity – why it fails Quasi-identifiers: Age, Gender, State, Illness

  11. K-anonymity – why it fails Quasi-identifiers: Age, Gender, State, Illness

  12. K-anonymity– why it fails (k=2) Quasi-identifiers: Age, Gender, State, Illness

  13. K-anonymity – why it fails • Accurate determination of quasi-identifiers is critical • Is there any data that isn’t identifiable? • Correlation between quasi-identifiers and non-identifiers • Attribute identification • If everyone in the group shares an attribute

  14. K-anonymity – Release environment • Release environment provides the auxiliary data • Public datasets (voter registration, public records) • Unstructured data (news media, social media) • Private datasets or knowledge (insurance companies, banks, personal knowledge) • Impacts on what should be considered a quasi-identifier • Impossible to know the true state of the release environment • At best an evaluation of the past, not a predictor of the future • What data will be released tomorrow?

  15. K-anonymity – Curse of dimensionality • K-anonymity does not work for high dimensional datasets • Often the most interesting datasets • Implicit, but often ignored, assumption that rows are independent • Longitudinal data • Data over time (transactional, episodic)

  16. K-anonymity – Longitudinal Data

  17. MBS/PBS Open Data Release • 10% sample of 30 years MBS/PBS • Longitudinal data release • Approx. 1 billion rows • MBS linked to PBS • Supplier ID and Patient ID Encrypted • We broke the encryption scheme covering Supplier ID • Re-Identification Amendment • Announced by AG on eve of the public statement (retrospectively introduced)

  18. MBS/PBS Open Data Release • Indexed by Patient ID (YOB, Gender) – 2,919,379 patients • Each MBS event consists of 18 fields (excluding index) • Each PBS event consists of 13 fields (excluding index) • Average number of records: MBS – 262, PBS – 78 • Max number of records: MBS – 12,733, PBS – 5,257 • Combined (MBS & PBS): Max – 13,720, AVG – 340 • Data Points (MBS & PBS): Max – 242,025, AVG – 5,737

  19. Sampling won’t protect you • Mistaken belief that sampling protects privacy • Based on difficulty of determining whether a particular individual is in the sample • Auxiliary data helps determine presence in the sample • Full population auxiliary data • Aggregate data releases over the entire population (DHS) • Techniques for determining probability of presence in the sample

  20. Summary • De-identification does not work for unit record level data • Yet it continues to be advocated • Using k-anonymity to calculate the risk of re-identification is fundamentally flawed • Requires knowledge of the release environment that cannot be obtained • Past risk, not future risk • The real risk is not borne by those releasing the data • Motivates compliance as opposed to protection

More Related