Beyond k-Anonymity

Beyond k-Anonymity Arik Friedman November 2008 Seminar in Databases (236826)

Outline • Recap – privacy and k-anonymity • l-diversity (beyond k-anonymity) • t-closeness (beyond k-anonymity and l-diversity) • Privacy?

Name Address Date registered Party affiliation Date last voted Ethnicity Visit date Diagnosis Procedure Medication Total charge Zip Birthdate Gender Recap - k-Anonymity Using medical data without disclosing patients’ identity: Quasi-identifier Voter List Medical data The problem: the ability of an attacker to cross the released data with external data.

K-Anonymity – Formal Definition • RT - Released Table • (A1,A2,…,An) - Attributes • QIRT - Quasi Identifier • RT[QIRT] – Projection of RT on QIRT

Example – original data

Example - 4-anonymized Table

Example - 4-anonymized Table We have 4-anonymity!!! We have privacy!!!! Or do we?

Example - 4-anonymized Table Suppose attacker knows the non-sensitive attributes of And the fact that Japanese have very low incidence of heart disease

Example - 4-anonymized Table Suppose attacker knows the non-sensitive attributes of And the fact that Japanese have very low incidence of heart disease Umeko has viral infection! Bob has cancer!

k-Anonymity Drawbacks • Basic Reasons for leak: • Sensitive attributes lack diversity in values • Homogeneity Attack • Attacker has additional background knowledge • Background knowledge Attack • Hence a new solution has been proposed in-addition to k-anonymity – l-diversity

Adversary’s background knowledge • Has access to published table T* and knows that it is a generalization of some base table T • Instance-level background knowledge: • Some individuals are present in the table. • Knowledge about sensitive attributes of specific individuals. • Demographic background knowledge • Partial knowledge about the distribution of sensitive and non-sensitive attributes in the population. • Diversity in the sensitive attribute values should mitigate both!

Some notation… • T = {t1, t2,…, tn} : • A table with attributes A1, A2,…, Am • Subset of some population  • t[C] = (t[C1, C2, …, Cp]) : • Projection of t onto a set of attributes CA • SA – sensitive attributes • QIA – quasi-identifier attributes • T*: anonymized table • q*-block – the set of records that were generalized to the same value q* in T*

Bayes Optimal Privacy • Ideal notion of privacy: models background knowledge as probability distribution over attributes • Uses Bayesian Inference techniques • Simplifying assumptions: • A single, multi-dimensional quasi-identifier attribute Q • A single sensitive attribute S • T is a simple random sample from  • Adversary Alice knows complete joint distribution f of Q and S (worst case assumption)

Bayes Optimal Privacy • Assume Bob appears in generalized table T*. • Alice’s prior beliefof Bob’s sensitive attribute: (q,s)=Pf ( t[S] = s | t[Q] = q) • After seeing T*,Alice’s belief changes to its posteriorvalue (or observed belief): (q,s,T*)=Pf ( t[S] = s | t[Q] = q  t*T*, t* generalizes t) We wouldn’t want Alice to learn “much”: (q,s)(q,s,T*)

Bayes Optimal Privacy - Example • Bob, Alice’s neighbor, is a 62 years old state employee. • Alice’s prior belief: 10% of men over 60 have cancer: (age60 ZIPcode=02138,cancer) =  (age60,cancer) = 0.1 • In k-anonymized GIC data T*, the following lines could relate to Bob: • Alice’s belief changes to its posterior value: (age60  ZIPcode=02138,cancer,T*) = 0.5

Bayes Optimal Privacy • Theorem 3.1: where n(q*,s’) is the number of tuples in T* with t*[Q] = q* and t*[S] = s’

Privacy principles • Positive disclosure: the adversary can correctly identify the value of a sensitive attribute: q,s such that (q,s,T*)>1- for a given  • Negative disclosure: the adversary can correctly eliminate the value of a sensitive attribute: (q,s,T*)< for a given  and tT such that t[Q]=q but t[S]s

Privacy principles • Note not all positive and negative disclosures are bad • If Alice already knew Bob has Cancer, there is nothing much one can do! • Uninformative principle: there should not be a large difference between the prior and posterior beliefs

Bayes Optimal Privacy • Limitations in practice • Insufficient knowledge: data publisher unlikely to know f • Publisher does not know how much the adversary actually knows • He may have instance level knowledge • No way to model non-probabilistic knowledge • Multiple adversaries having different levels of knowledge • Hence a practical definition is needed

l-diversity principle • Revisit: • Positive disclosure can occur when:

l-diversity principle • Could occur due to combination of: • Lack of diversity • Strong background Knowledge Mitigate by requiring l “well-represented” sensitive values At least l-1 damaging pieces of background knowledge required to succeed

l-diversity principle A q*-block is l-diverse if it contains at least lwell-represented values for the sensitive attribute S. A table is l-diverse if every q*-block is l-diverse. Example – distinct l-diversity: there are at least l distinct values for the sensitive attribute in each q*-block.

Example – 3-distinct diverse Table We have 3-distinct diversity!!! We have privacy!!!! Or do we?

Example - 3-distinct diverse table Suppose attacker knows the non-sensitive attributes of And the fact that Japanese have very low incidence of heart disease Still very likely that Umeko has viral infection!

Entropy l-diversity • A table is Entropy l-Diverse if for every q*-block: where Example with 2 sensitive attribute values • Not feasible when one value is very common

Recursive (c,l)-diversity • None of the sensitive values should occur too frequently. • Let ri be the ith most frequent sensitive value • Given const c, recursive (c, l)-diversity is satisfied if r1 < c ( rl + rl+1 + … + rm ) For example, with 3 attributes (m=3): • (2,2)-diversity: r1<2(r2+r3) • (2,3)-diversity: r1<2r3 • Equivalently: even if we eliminate a sensitive value, we still have (2,2)-diversity

An algorithm for l-diversity? • Monotonicity property: If T* preserves privacy, then so does every generalization of it • Satisfied by k-anonymity • Most k-anonymization algorithms work for any privacy measure that satisfies monotonicity - We can re-use previous algorithms directly • Bayes optimal privacy is not monotonic • l-diversity variants are monotonic!

35 40 45 50 55 60 65 70 50 55 60 65 70 75 80 85 Example: Mondrian l-entropy diverse, l= 1.89 (for two sensitive attributes, equivalent to limiting prevalence to up to 2/3. Also equivalent to recursive (2,2)-diversity) Mondrian(partition) • if (no allowable multidimensional cut for partition) return : partition  summary • else • dim choose dimension() • fs frequency set(partition, dim) • splitVal find median(fs) • lhs  {t  partition : t.dim  splitVal} • rhs  {t  partition : t.dim > splitVal} • return Mondrian(rhs) Mondrian(lhs) Age Weight

Experiments • Used Incognito (a popular generalization algorithm) • Adult dataset (Census data) from the UCI machine learning repository (http://archive.ics.uci.edu/ml/datasets/Adult) Adult Database Description Experiment results refer to this sensitive attribute

Experiments - Utility • Intuitively: “usefulness” of the l-diverse and k-anonymizedtables. Used k, l= 2, 4, 6, 8 Number of generalization steps that were performed vs. k,l Average size of q*-blocks generated (similar to CAVG) vs. k,l

Example – 3-diverse Table We have 3-diversity!!! We have privacy!!!! Or do we?

Similarity attack l-diversity is insufficient to prevent attribute disclosure. A 3-diverse patient table Conclusion Bob’s salary is in [20k,40k], which is relative low. Bob has some stomach-related disease. l-diversity does not consider semantic meanings of sensitive values

Skewness attack entropy: 2 • Two sensitive values in : • Cancer (1%) and Healthy (99%) • (entropy: 1.0576) Attacker learned a lot! entropy: 1.65 Equivalent in terms of l-diversity, but very different semantically entropy: 1.65

t-Closeness: the main idea • Rationale A completely generalized table ExternalKnowledge Overall distribution Q of sensitive values

t-Closeness: the main idea • Rationale A released table ExternalKnowledge Overall distribution Q of sensitive values Distribution Pi of sensitive values in each equivalence class

t-Closeness: the main idea • Rationale • Observations • Q should be treated as public • Knowledge gain in two parts: • Whole population (from B0 to B1) • Specific individuals (from B1 to B2) • We bound knowledge gain between B1 and B2 instead • Principle • The distance between Q and Pi should be bounded by a threshold t. ExternalKnowledge Overall distribution Q of sensitive values Distribution Pi of sensitive values in each equivalence class

t-closeness An equivalence class is said to have t-closeness if the distance between the distribution of a sensitive attribute in this class and the distribution of the attribute in the whole table is no more than a threshold t. A table is said to have t-closeness if all equivalence classes have t-closeness. A distance measure called Earth Movers Distance is used. It maintains monotonicity!

Example – t-closeness We have 0.167-closeness w.r.t. Salary and 0.278-closeness w.r.t. Disease!!! We have privacy!!!! Or do we?

Netflix privacy breach(Robust De-anonymization of Large Sparse Datasets, Narayanan and Shmatikov, 2008) • Released for the Netflix Prize contest • 17,770 movie titles • 480,189 users with random customer IDs • Ratings: 1-5 • For each movie we have the ratings: • (MovieID, CustomerID, Rating, Date) • Re-arrange by customerID:

Netflix privacy breach(Robust De-anonymization of Large Sparse Datasets, Narayanan and Shmatikov, 2008) • Can be linked, e.g., with IMDB data, to re-identify individuals! IMDB data Netflix data (This example is made up. Possibly, James Hitchcock has nothing to do with Netflix)

Epilogue “You have zero privacy anyway. Get over it.” Scott McNeally (SUN CEO, January 1999)

HIPAA excerptHealth Insurance Portability and Accountability Act of 1996

Thank you!

Bibliography • “Mondrian Multidimensional k-Anonymity”,K. LeFevre, D.J. DeWitt, R. Ramakrishnan,2006 • l-diversity: Privacy beyond k-anonymity, A. Machanavajjhala, Johannes Gehrke, Daniel Kifer, 2006 • T-closeness: Privacy beyond k-anonymity and l-diversity, Ninghui Li, Tiancheng Li, Suresh Venkatasubramanian, 2006 • Presentations: • “Privacy In Databases”, B. AdityaPrakash • “K-Anonymity and Other Cluster-Based Methods”, Ge. Ruan

Beyond k-Anonymity