1 / 39

K-Anonymity and Other Cluster-Based Methods

K-Anonymity and Other Cluster-Based Methods. Ge Ruan Oct. 11,2007. Data Publishing and Data Privacy. Society is experiencing exponential growth in the number and variety of data collections containing person-specific information.

MikeCarlo
Download Presentation

K-Anonymity and Other Cluster-Based Methods

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. K-Anonymity and Other Cluster-Based Methods Ge RuanOct. 11,2007

  2. Data Publishing and Data Privacy • Society is experiencing exponential growth in the number and variety of data collections containing person-specific information. • These collected information is valuable both in research and business. Data sharing is common. • Publishing the data may put the respondent’s privacy in risk. • Objective: • Maximize data utility while limiting disclosure risk to an acceptable level

  3. Related Works • Statistical Databases • The most common way is adding noise and still maintaining some statistical invariant. Disadvantages: • destroy the integrity of the data

  4. Related Works(Cont’d) • Multi-level Databases • Data is stored at different security classifications and users having different security clearances. (Denning and Lunt) • Eliminating precise inference. Sensitive information is suppressed, i.e. simply not released. (Su and Ozsoyoglu) Disadvantages: • It is impossible to consider every possible attack • Many data holders share same data. But their concerns are different. • Suppression can drastically reduce the quality of the data.

  5. Related Works (Cont’d) • Computer Security • Access control and authentication ensure that right people has right authority to the right object at right time and right place. • That’s not what we want here. A general doctrine of data privacy is to release all the information as much as the identities of the subjects (people) are protected.

  6. K-Anonymity Sweeny came up with a formal protection model named k-anonymity • What is K-Anonymity? • If the information for each person contained in the release cannot be distinguished from at least k-1 individuals whose information also appears in the release. • Ex. If you try to identify a man from a release, but the only information you have is his birth date and gender. There are k people meet the requirement. This is k-Anonymity.

  7. Classification of Attributes • Key Attribute: • Name, Address, Cell Phone • which can uniquely identify an individual directly • Always removed before release. • Quasi-Identifier: • 5-digit ZIP code,Birth date, gender • A set of attributes that can be potentially linked with external information to re-identify entities • 87% of the population in U.S. can be uniquely identified based on these attributes, according to the Census summary data in 1991. • Suppressed or generalized

  8. Classification of Attributes(Cont’d) Hospital Patient Data Vote Registration Data • Andre has heart disease!

  9. Classification of Attributes(Cont’d) • Sensitive Attribute: • Medical record, wage,etc. • Always released directly. These attributes is what the researchers need. It depends on the requirement.

  10. K-Anonymity Protection Model • PT: Private Table • RT,GT1,GT2: Released Table • QI: Quasi Identifier (Ai,…,Aj) • (A1,A2,…,An): Attributes Lemma:

  11. Attacks Against K-Anonymity • Unsorted Matching Attack • This attack is based on the order in which tuples appear in the released table. • Solution: • Randomly sort the tuples before releasing.

  12. Attacks Against K-Anonymity(Cont’d) • Complementary Release Attack • Different releases can be linked together to compromise k-anonymity. • Solution: • Consider all of the released tables before release the new one, and try to avoid linking. • Other data holders may release some data that can be used in this kind of attack. Generally, this kind of attack is hard to be prohibited completely.

  13. Attacks Against K-Anonymity(Cont’d) • Complementary Release Attack (Cont’d)

  14. Attacks Against K-Anonymity(Cont’d) • Complementary Release Attack (Cont’d)

  15. Attacks Against K-Anonymity(Cont’d) • Temporal Attack (Cont’d) • Adding or removing tuples may compromise k-anonymity protection.

  16. Attacks Against K-Anonymity(Cont’d) • k-Anonymity does not provide privacy if: • Sensitive values in an equivalence class lack diversity • The attacker has background knowledge Homogeneity Attack A 3-anonymous patient table Background Knowledge Attack A. Machanavajjhala et al. l-Diversity: Privacy Beyond k-Anonymity. ICDE 2006

  17. l-Diversity • Distinct l-diversity • Each equivalence class has at least l well-represented sensitive values • Limitation: • Doesn’t preventthe probabilistic inference attacks • Ex. In one equivalent class, there are ten tuples. In the “Disease” area, one of them is “Cancer”, one is “Heart Disease” and the remaining eight are “Flu”. This satisfies 3-diversity, but the attacker can still affirm that the target person’s disease is “Flu” with the accuracy of 70%. A. Machanavajjhala et al. l-Diversity: Privacy Beyond k-Anonymity. ICDE 2006

  18. l-Diversity(Cont’d) • Entropy l-diversity • Each equivalence class not only must have enough different sensitive values, but also the different sensitive values must be distributed evenly enough. • In the formal language of statistic, it means the entropy of the distribution of sensitive values in each equivalence class is at least log(l) • Sometimes this maybe too restrictive. When some values are very common, the entropy of the entire table may be very low. This leads to the less conservative notion of l-diversity. A. Machanavajjhala et al. l-Diversity: Privacy Beyond k-Anonymity. ICDE 2006

  19. l-Diversity(Cont’d) • Recursive (c,l)-diversity • The most frequent value does not appear too frequently • r1<c(rl+rl+1+…+rm) A. Machanavajjhala et al. l-Diversity: Privacy Beyond k-Anonymity. ICDE 2006

  20. Limitations of l-Diversity l-diversity may be difficult and unnecessary to achieve. • A single sensitive attribute • Two values: HIV positive (1%) and HIV negative (99%) • Very different degrees of sensitivity • l-diversity is unnecessary to achieve • 2-diversity is unnecessary for an equivalence class that contains only negative records • l-diversity is difficult to achieve • Suppose there are 10000 records in total • To have distinct 2-diversity, there can be at most 10000*1%=100 equivalence classes

  21. Limitations of l-Diversity(Cont’d) l-diversity is insufficient to prevent attribute disclosure. Skewness Attack • Two sensitive values • HIV positive (1%) and HIV negative (99%) • Serious privacy risk • Consider an equivalence class that contains an equal number of positive records and negative records • l-diversity does not differentiate: • Equivalence class 1: 49 positive + 1 negative • Equivalence class 2: 1 positive + 49 negative l-diversity does not consider the overall distribution of sensitive values

  22. Limitations of l-Diversity(Cont’d) l-diversity is insufficient to prevent attribute disclosure. Similarity Attack A 3-diverse patient table Conclusion • Bob’s salary is in [20k,40k], which is relative low. • Bob has some stomach-related disease. l-diversity does not consider semantic meanings of sensitive values

  23. t-Closeness: A New Privacy Measure • Rationale A completely generalized table ExternalKnowledge Overall distribution Q of sensitive values

  24. t-Closeness: A New Privacy Measure • Rationale A released table ExternalKnowledge Overall distribution Q of sensitive values Distribution Pi of sensitive values in each equi-class

  25. t-Closeness: A New Privacy Measure • Rationale • Observations • Q should be public • Knowledge gain in two parts: • Whole population (from B0 to B1) • Specific individuals (from B1 to B2) • We bound knowledge gain between B1 and B2 instead • Principle • The distance between Q and Pi should be bounded by a threshold t. ExternalKnowledge Overall distribution Q of sensitive values Distribution Pi of sensitive values in each equi-class

  26. Distance Measures • P=(p1,p2,…,pm), Q=(q1,q2,…,qm) • Trace-distance • KL-divergence • None of these measures reflect the semantic distance among values. • Q:{3K,4K,5K,6K,7K,8K,9K,10K,11k} P1:{3K,4K,5k} P2:{5K,7K,10K} • Intuitively, D[P1,Q]>D[P2,Q] • Ground distance for any pair of values D[P,Q] is dependent upon the ground distances.

  27. Earth Mover’s Distance • Formulation • P=(p1,p2,…,pm), Q=(q1,q2,…,qm) • dij: the ground distance between element i of P and element j of Q. • Find a flow F=[fij] where fij is the flow of mass from element i of P to element j of Q that minimizes the overall work: subject to the constraints:

  28. Earth Mover’s Distance • Example • {3k,4k,5k} and {3k,4k,5k,6k,7k,8k,9k,10k,11k} • Move 1/9 probability for each of the following pairs • 3k->6k,3k->7k cost: 1/9*(3+4)/8 • 4k->8k,4k->9k cost: 1/9*(4+5)/8 • 5k->10k,5k->11k cost: 1/9*(5+6)/8 • Total cost: 1/9*27/8=0.375 • With P2={6k,8k,11k} , we can get the total cost is 0.167 < 0.375. This make more sense than the other two distance calculation method.

  29. How to calculate EMD • EMD for numerical attributes • Ordered distance • Ordered-distance is a metric • Non-negative, symmetry, triangle inequality • Let ri=pi-qi, then D[P,Q] is calculated as:

  30. How to calculate EMD • EMD for categorical attributes • Equal distance • Equal-distance is a metric • D[P,Q] is calculated as:

  31. How to calculate EMD(Cont’d) • EMD for categorical attributes • Hierarchical distance • Hierarchical distance is a metric

  32. How to calculate EMD(Cont’d) • EMD for categorical attributes • D[P,Q] is calculated as:

  33. Experiments • Goal • To show l-diversity does not provide sufficient privacy protection (the similarity attack). • To show the efficiency and data quality of using t-closeness are comparable with other privacy measures. • Setup • Adult dataset from UC Irvine ML repository • 30162 tuples, 9 attributes (2 sensitive attributes) • Algorithm: Incognito

  34. Experiments • Similarity attack (Occupation) • 13 of 21 entropy 2-diversity tables are vulnerable • 17 of 26 recursive (4,4)-diversity tables are vulnerable • Comparisons of privacy measurements • k-Anonymity • Entropy l-diversity • Recursive (c,l)-diversity • k-Anonymity with t-closeness

  35. Experiments • Efficiency • The efficiency of using t-closeness is comparable with other privacy measurements

  36. Experiments • Data utility • Discernibility metric; Minimum average group size • The data quality of using t-closeness is comparable with other privacy measurements

  37. Conclusion • Limitations of l-diversity • l-diversity is difficult and unnecessary to achieve • l-diversity is insufficient in preventing attribute disclosure • t-Closeness as a new privacy measure • The overall distribution of sensitive values should be public information • The separation of the knowledge gain • EMD to measure distance • EMD captures semantic distance well • Simple formulas for three ground distances

  38. Questions? Thank you!

More Related