1 / 48

Data Anonymization – Introduction and k-anonymity

Data Anonymization – Introduction and k-anonymity. Li Xiong CS573 Data Privacy and Security. Outline. Problem definition K-Anonymity Property Disclosure Control Methods Complexity Attacks against k-Anonymity. Defining Privacy in DB Publishing.

barryholt
Download Presentation

Data Anonymization – Introduction and k-anonymity

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data Anonymization – Introduction and k-anonymity Li Xiong CS573 Data Privacy and Security

  2. Outline • Problem definition • K-Anonymity Property • Disclosure Control Methods • Complexity • Attacks against k-Anonymity

  3. Defining Privacy in DB Publishing • Traditional information security: protecting information and information systems from unauthorized access and use. • Privacy preserving data publishing: protecting private data while publishing useful information NO FOUL PLAY    Modify Data 

  4. Problem: Disclosure Control • Disclosure Control is the discipline concerned with the modification of data, containing confidential information about individual entities such as persons, households, businesses, etc. in order to prevent third parties working with these data to recognize individuals in the data • Privacy preserving data publishing, anonymization, de-identification Types of disclosure • Identity disclosure - identification of an entity (person, institution) • Attribute disclosure - the intruder finds something new about the target person • Disclosure – identity, attribute disclosure or both.

  5. Microdata and External Information • Microdata represents a series of records, each record containing information on an individual unit such as a person, a firm, an institution, etc • In contrast to computed tables • Masked Microdata names and other identifying information are removed or modified from microdata • External Information any known information by an presumptive intruder related to some individuals from initial microdata

  6. Disclosure Risk and Information Loss • Disclosure risk- the risk that a given form of disclosure will arise if a masked microdata is released • Information loss - the quantity of information which exist in the initial microdata and because of disclosure control methods does not occur in masked microdata

  7. Disclosure Control Problem Individuals Submit Collect Data Masking Process Data Owner Release Receive Masked Data Researcher Intruder

  8. Disclosure Control Problem Individuals Submit Collect Data Confidentiality of Individuals Disclosure Risk / Anonymity Properties Masking Process Data Owner Preserve Data Utility Information Loss Release Receive Masked Data Researcher Intruder

  9. Disclosure Control Problem Individuals Submit Collect Data Confidentiality of Individuals Disclosure Risk / Anonymity Properties Masking Process Data Owner Preserve Data Utility Information Loss Release Receive Masked Data Researcher Intruder Use Masked Data for Statistical Analysis Use Masked Data and External Data to disclose confidential information External Data

  10. Types of Disclosure Initial Microdata Masked Microdata Data Owner

  11. Types of Disclosure Initial Microdata Masked Microdata Data Owner External Information Intruder

  12. Types of Disclosure Initial Microdata Masked Microdata Data Owner External Information Identity Disclosure: Charlie is the third record Attribute Disclosure: Alice has AIDS Intruder

  13. Types of Disclosure Initial Microdata Masked Microdata Data Owner External Information Identity Disclosure: Charlie is the third record Attribute Disclosure: Alice has AIDS Intruder

  14. Disclosure Control for Tables vs. Microdata • Microdata • Precomputed statistics tables

  15. Disclosure Control For Microdata

  16. Disclosure Control for Tables

  17. Outline Problem definition K-Anonymity Property Disclosure Control Methods Complexity and algorithms Attacks against k-Anonymity

  18. History • The term was introduced in 1998 by Samarati and Sweeney. • Important papers: • Sweeney L. (2002), K-Anonymity: A Model for Protecting Privacy, International Journal on Uncertainty, Fuzziness and Knowledge-based Systems, Vol. 10, No. 5, 557-570 • Sweeney L. (2002), Achieving K-Anonymity Privacy Protection using Generalization and Suppression, International Journal on Uncertainty, Fuzziness and Knowledge-based Systems, Vol. 10, No. 5, 571-588 • Samarati P. (2001), Protecting Respondents Identities in Microdata Release, IEEE Transactions on Knowledge and Data Engineering, Vol. 13, No. 6, 1010-1027

  19. More recently • Many new research papers in the last 4-5 years • Theoretical results • Many algorithms achieving k-anonymity • Many improved principles and algorithms

  20. Terminology (Sweeney) • Data or Microdata - a person-specific information that is conceptually organized as a table of rows and columns. • Tuple, Row, or Record – a row from the microdata. • Attribute or Field – a column from the microdata • Inference – Belief on a new fact based on some other information • Disclosure – Explicit and inferable information about a person

  21. Attribute Classification • I1, I2,..., Im - identifier attributes • Ex: Name and SSN • Found in IM only • Information that leads to a specific entity. • K1, K2,.…, Kp- key attributes (quasi-identifiers) • Ex: Zip Code and Age • Found in IM and MM. • May be known by an intruder. • S1, S2,.…, Sq - confidential attributes • Ex: Principal Diagnosis and Annual Income • Found in IM and MM. • Assumed to be unknown to an intruder.

  22. Attribute Types • Identifier, Key (Quasi-Identifiers) and Confidential Attributes

  23. Motivating Example  Modify Data 

  24. Attacker’s Knowledge: Voter registration list # Name Zip Age Nationality 1 John 13067 45 US 2 Paul 13067 22 US 3 Bob 13067 29 US 4 Chris 13067 23 US The Optimization Problem Motivating Example (continued) Published Data: Alice publishes data without the Name  Modify Data 

  25. The Optimization Problem Motivating Example (continued) Published Data: Alice publishes data without the Name  Modify Data  Attacker’s Knowledge: Voter registration list Data Leak !

  26. Disease Birthdate Zip Sex Data Re-identification Name 87% of the population in the USA can be uniquely identified using Birthdate, Sex, and Zipcode

  27. Source of the Problem Even if we do not publish the individuals:• There are some fields that may uniquely identify some individual Quasi Identifier • The attacker can use them to join with other sources and identify the individuals

  28. K-Anonymity Definition • The k-anonymity property for a masked microdata (MM) is satisfied if with respect to Quasi-identifier set (QID) if every count in the frequency set of MM with respect to QID is greater or equal to k

  29. K-Anonymity Example • QID = { Age, Zip, Sex } • SELECT COUNT(*) FROM Patient GROUP BY Sex, Zip, Age; • If the results include groups with count less than k, the relation Patient does not have k-anonymity property with respect to QID.

  30. Outline Problem definition K-Anonymity Property Disclosure Control Methods Attacks against k-Anonymity

  31. Disclosure Control Techniques • Remove Identifiers • Generalization • Suppression • Sampling • Microaggregation • Perturbation / randomization • Rounding • Data Swapping • Etc.

  32. Disclosure Control Techniques • Different disclosure control techniques are applied to the following initial microdata:

  33. Remove Identifiers • Identifiers such as Names, SSN etc. are removed

  34. Sampling • Sampling is the disclosure control method in which only a subset of records is released • If n is the number of elements in initial microdata and t the released number of elements we call sf = t / n the sampling factor • Simple random sampling is more frequently used. In this technique, each individual is chosen entirely by chance and each member of the population has an equal chance of being included in the sample

  35. Microaggregation • Order records from the initial microdata by an attribute, create groups of consecutive values, replace those values by the group average • Microaggregation for attribute Income and minimum size 3 • The total sum for all Income values remains the same.

  36. Data Swapping • In this disclosure method a sequence of so-called elementary swaps is applied to a microdata • An elementary swap consists of two actions: • A random selection of two records i and j from the microdata • A swap (interchange) of the values of the attribute being swapped for records i and j

  37. Generalization and Suppression • • Generalization • Replace the value with a less specific but semantically consistent value • Suppression • Do not release a value at all

  38. 410** 4107* Person 4109* Z2 = {410**} Z1 = {4107*. 4109*} S1 = {Person} Female Male Domain and Value Generalization Hierarchies Z0 = {41075, 41076, 41095, 41099} 41076 41095 41099 41075 S0 = {Male, Female}

  39. [1, 1] <S1, Z1> <S0, Z2> [0, 2] <S1, Z0> [1, 0] [0, 1] <S0, Z1> Z2 = {410**} S1 = {Person} Z1 = {4107*, 4109*} Generalization Lattice <S1, Z2> S0 = {Male, Female} [1, 2] <S0, Z0> Generalization Lattice Z0 = {41075, 41076, 41095, 41099} [0, 0] Distance Vector Generalization Lattice

  40. Generalization Tables

  41. Outline Problem definition K-Anonymity Property Disclosure Control Methods Attacks and problems with k-Anonymity

  42. Attacks against k-Anonymity • Unsorted Matching Attack • Complementary Release Attack • Temporal Attack • Homogeneity Attack for Attribute Disclosure

  43. Unsorted Matching Attack Solution - Random shuffling of rows

  44. Complementary Release Attack

  45. black 9/7/65 male 02139 headache black 11/4/65 male 02139 rash black 1965 male 02139 headache black 1965 male 02139 rash Temporal Attack PTt1 GTt1

  46. Homogeneity Attack k-Anonymity can create groups that leak information due to lack of diversity in sensitive attribute.

  47. Coming up • Algorithms on k-anonymity • Improved principles on k-anonymity

More Related