Anatomy: Simple and Effective Privacy Preservation

Anatomy:Simple and Effective Privacy Preservation Xiaokui Xiao, Yufei Tao Chinese University of Hong Kong

Privacy preserving data publishing Microdata • Purposes: • Allow researchers to effectively study the correlation between various attributes • Protect the privacy of every patient

A naïve solution • It does not work. See next. publish

Inference attack Published table • An adversary knows that Bob • has been hospitalized before • is 23 years old • lives in an area with zipcode 11000 Quasi-identifier (QI) attributes

Generalization • Transform each QI value into a less specific form A generalized table How much generalization do we need?

l-diversity • A QI-group with m tuples is l-diverse, iff each sensitive value appears no more than m /l times in the QI-group. • A table is l-diverse, iff all of its QI-groups are l-diverse. • The above table is 2-diverse. Quasi-identifier (QI) attributes Sensitive attribute 2 QI-groups

What l-diversity guarantees • From an l-diverse generalized table, an adversary (without any prior knowledge) can infer the sensitive value of each individual with confidence at most 1/l A 2-diverse generalized table A. Machanavajjhala et al. l-Diversity: Privacy Beyond k-Anonymity. ICDE 2006

Defect of generalization • Query A: SELECT COUNT(*) from Unknown-Microdata WHERE Disease = ‘pneumonia’ AND Age in [0, 30] AND Zipcode in [10001, 20000] • Estimated answer: 2 * p, where p is the probability that each of the two tuples satisfies the query conditions

Defect of generalization (cont.) • Query A: SELECT COUNT(*) from Unknown-Microdata WHERE Disease = ‘pneumonia’ AND Age in [0, 30] AND Zipcode in [10001, 20000] • p = Area( R1∩Q) / Area( R1 ) = 0.05 • Estimated answer for query A: 2 * p = 0.1

Defect of generalization (cont.) • Query A: SELECT COUNT(*) from Unknown-Microdata WHERE Disease = ‘pneumonia’ AND Age in [0, 30] AND Zipcode in [10001, 20000] • Estimated answer from the generalized table: 0.1 • The exact answer should be: 1

Research Works on Generalization • V. S. Iyengar. Transforming data to satisfy privacy constraints. KDD 2002. • K. Wang, P. S. Yu and S. Chakraborty. Bottom-Up Generalization: A Data Mining Solution to Privacy Protection. ICDM 2004. • R. J. Bayardo Jr. and R. Agrawal. Data Privacy through Optimal k-Anonymization. ICDE 2005. • B. C. M. Fung, K. Wang and P. S. Yu. Top-Down Specialization for Information and Privacy Preservation. ICDE 2005. • K. LeFevre, D. J. DeWitt and R. Ramakrishnan. Incognito: Efficient Full-Domain K-Anonymity. SIGMOD 2005. • K. LeFevre, D. J. DeWitt and R. Ramakrishnan. Mondrian Multidimensional K-Anonymity. ICDE 2006. • D. Kifer and J. Gehrke. Injecting utility into anonymized datasets. SIGMOD 2006. • X. Xiao and Y. Tao. Personalized privacy preservation. SIGMOD 2006. • K. Wang and B. C. M. Fung. Anonymization for Sequential Releases. KDD 2006. • K. LeFevre, D. DeWitt and R. Ramakrishnan. Workload-Aware Anonymization. KDD 2006. • J. Xu, Wei Wang, J. Pei, etc. Utility-Based Anonymization Using Local Recodings. KDD 2006. • …

Contributions • We propose an alternative technique for generalization called Anatomy, which allows much more accurate data analysis while still preserving privacy. • We develop an algorithm for computing anatomized tables that • runs in linear I/Os • (nearly) minimizes information loss

Outline • Basic Idea of Anatomy • Preserving Correlation • Algorithm for Anatomy • Experimental Results

Basic Idea of Anatomy • For a given microdata table, Anatomy releases a quasi-identifier table (QIT) and a sensitive table (ST) Sensitive Table (ST) Quasi-identifier Table (QIT) microdata

Basic Idea of Anatomy (cont.) 1. Select a partition of the tuples QI group 1 QI group 2 a 2-diverse partition

Basic Idea of Anatomy (cont.) 2. Generate a quasi-idnetifier table (QIT) and a sensitive table (ST) based on the selected partition group 1 group 2 quasi-identifier table (QIT) sensitive table (ST)

Basic Idea of Anatomy (cont.) 2. Generate a quasi-idnetifier table (QIT) and a sensitive table (ST) based on the selected partition quasi-identifier table (QIT) sensitive table (ST)

Basic Idea of Anatomy (cont.) 2. Generate a quasi-idnetifier table (QIT) and a sensitive table (ST) based on the selected partition sensitive table (ST) quasi-identifier table (QIT)

Privacy Preservation • From a pair of QIT and ST generated from an l-diverse partition, the adversary can infer the sensitive value of each individual with confidence at most 1/l sensitive table (ST) quasi-identifier table (QIT)

Accuracy of Data Analysis • Query A: SELECT COUNT(*) from Unknown-Microdata WHERE Disease = ‘pneumonia’ AND Age in [0, 30] AND Zipcode in [10001, 20000] sensitive table (ST) quasi-identifier table (QIT)

Accuracy of Data Analysis (cont.) • Query A: SELECT COUNT(*) from Unknown-Microdata WHERE Disease = ‘pneumonia’ AND Age in [0, 30] AND Zipcode in [10001, 20000] • 2 patients have contracted pneumonia • 2 out of 4 patients satisfies the query condition on Age and Zipcode • Estimated answer for query A: 2 * 2 / 4 = 1, which is also the actual result from the original microdata t1t2 t3 t4

Outline • Rationale of Anatomy • Preserving Correlation • Algorithm for Anatomy • Experimental Results

Preserving Correlation • Let us first examine the correlation between Age and Disease in our running example • Each tuple in the microdata can be mapped to a point in the (Age, Disease) domain • The above tuple can be mapped to (23, pneumonia). t1

Preserving Correlation (cont.) • We model this tuple using a probability density function (pdf):

Preserving Correlation (cont.) • In the generalized table, the tuple becomes: • Its corresponding pdf becomes:

Preserving Correlation (cont.) • In the anatomized tables, the tuple becomes: • Its corresponding pdf becomes:

Preserving Correlation (cont.)

Quality Metric the original pdf the approximated pdf • For each approximated pdf , we measure its error from the original pdf by their “L2 distance”: • We aim at obtaining anatomized tables that minimize the following re-construction error (RCE):

Anatomize • An algorithm for computing anatomized tables that • runs in I/O cost linear to the cardinality n of the microdata table • minimizes the RCE when n is a multiple of l, otherwise achieves an RCE that is higher than the lower-bound by a factor of at most 1 + 1/n

Experimental Settings • Goal: to compare the accuracy of data analysis on the generalized / anatomized tables. • Real dataset with 9 attributes: • Age, Gender, Education, Marital-status, Race, Work-class, Country, • Occupation, Salary-class • OCC-d, SAL-d, (d = 3, 4, 5, 6, 7) • OCC-3: • SAL-4: • Cardinality: 100k, 200k, 300k, 400k, 500k

Experimental Settings (cont.) • competitor: multi-dimensional generalization • l = 10 • avg. relative error for 10000 aggregate queries: |act – est| / act • qd = 1, 2, …, d • s = 1%, …, 5%, …, 10%

Accuracy of Data Analysis (cont.) C.C. Aggarwal. On k-anonymity and the curse of dimensionality. VLDB 2005

Accuracy of Data Analysis (cont.)

Computation Overhead

Summary • Anatomy outperforms generalization by allowing much more accurate data analysis on the published data. • Anatomized tables (with nearly optimal quality guarantee) can be computed in I/O cost linear to the database cardinality.

Thank you! Datasets and implementation are available for download at http://www.cse.cuhk.edu.hk/~taoyf

Anatomy vs. Generalization Revisit • Sometimes the adversary is not sure whether an individual appears in the microdata or not A Voter Registration List A 2-diverse generalized table

Anatomy vs. Generalization Revisit • From the adversary’s perspective: • Bob has 4 / 6 probability to be in the microdata • If Bob indeed appears the microdata, there is 2 / 4 probability that he has contracted pneumonia • So Bob has 4/6 * 2/4 = 1/3 probability to have contracted pneumonia A Voter Registration List A 2-diverse generalized table

Anatomy vs. Generalization Revisit • The adversary knows that • Bob must appear the microdata • There is 1/2 probability that Bob has contracted pneumonia 2-diverse ST 2-diverse QIT

Anatomy vs. Generalization Revisit • For a given value of l, l-diverse generalization may lead to higher privacy protection than l-diverse anatomy does. • But is not always the case, since: • the external database may not contain any irrelevant individuals • the adversary may know that some individuals indeed appear in the microdata

Anatomy: Simple and Effective Privacy Preservation