1 / 43

Anatomy: Simple and Effective Privacy Preservation

Anatomy: Simple and Effective Privacy Preservation. Xiaokui Xiao, Yufei Tao Chinese University of Hong Kong. Privacy preserving data publishing. Microdata Purposes: Allow researchers to effectively study the correlation between various attributes Protect the privacy of every patient.

Download Presentation

Anatomy: Simple and Effective Privacy Preservation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Anatomy:Simple and Effective Privacy Preservation Xiaokui Xiao, Yufei Tao Chinese University of Hong Kong

  2. Privacy preserving data publishing Microdata • Purposes: • Allow researchers to effectively study the correlation between various attributes • Protect the privacy of every patient

  3. A naïve solution • It does not work. See next. publish

  4. Inference attack Published table • An adversary knows that Bob • has been hospitalized before • is 23 years old • lives in an area with zipcode 11000 Quasi-identifier (QI) attributes

  5. Generalization • Transform each QI value into a less specific form A generalized table How much generalization do we need?

  6. l-diversity • A QI-group with m tuples is l-diverse, iff each sensitive value appears no more than m /l times in the QI-group. • A table is l-diverse, iff all of its QI-groups are l-diverse. • The above table is 2-diverse. Quasi-identifier (QI) attributes Sensitive attribute 2 QI-groups

  7. What l-diversity guarantees • From an l-diverse generalized table, an adversary (without any prior knowledge) can infer the sensitive value of each individual with confidence at most 1/l A 2-diverse generalized table A. Machanavajjhala et al. l-Diversity: Privacy Beyond k-Anonymity. ICDE 2006

  8. Defect of generalization • Query A: SELECT COUNT(*) from Unknown-Microdata WHERE Disease = ‘pneumonia’ AND Age in [0, 30] AND Zipcode in [10001, 20000] • Estimated answer: 2 * p, where p is the probability that each of the two tuples satisfies the query conditions

  9. Defect of generalization (cont.) • Query A: SELECT COUNT(*) from Unknown-Microdata WHERE Disease = ‘pneumonia’ AND Age in [0, 30] AND Zipcode in [10001, 20000] • p = Area( R1∩Q) / Area( R1 ) = 0.05 • Estimated answer for query A: 2 * p = 0.1

  10. Defect of generalization (cont.) • Query A: SELECT COUNT(*) from Unknown-Microdata WHERE Disease = ‘pneumonia’ AND Age in [0, 30] AND Zipcode in [10001, 20000] • Estimated answer from the generalized table: 0.1 • The exact answer should be: 1

  11. Research Works on Generalization • V. S. Iyengar. Transforming data to satisfy privacy constraints. KDD 2002. • K. Wang, P. S. Yu and S. Chakraborty. Bottom-Up Generalization: A Data Mining Solution to Privacy Protection. ICDM 2004. • R. J. Bayardo Jr. and R. Agrawal. Data Privacy through Optimal k-Anonymization. ICDE 2005. • B. C. M. Fung, K. Wang and P. S. Yu. Top-Down Specialization for Information and Privacy Preservation. ICDE 2005. • K. LeFevre, D. J. DeWitt and R. Ramakrishnan. Incognito: Efficient Full-Domain K-Anonymity. SIGMOD 2005. • K. LeFevre, D. J. DeWitt and R. Ramakrishnan. Mondrian Multidimensional K-Anonymity. ICDE 2006. • D. Kifer and J. Gehrke. Injecting utility into anonymized datasets. SIGMOD 2006. • X. Xiao and Y. Tao. Personalized privacy preservation. SIGMOD 2006. • K. Wang and B. C. M. Fung. Anonymization for Sequential Releases. KDD 2006. • K. LeFevre, D. DeWitt and R. Ramakrishnan. Workload-Aware Anonymization. KDD 2006. • J. Xu, Wei Wang, J. Pei, etc. Utility-Based Anonymization Using Local Recodings. KDD 2006. • …

  12. Contributions • We propose an alternative technique for generalization called Anatomy, which allows much more accurate data analysis while still preserving privacy. • We develop an algorithm for computing anatomized tables that • runs in linear I/Os • (nearly) minimizes information loss

  13. Outline • Basic Idea of Anatomy • Preserving Correlation • Algorithm for Anatomy • Experimental Results

  14. Basic Idea of Anatomy • For a given microdata table, Anatomy releases a quasi-identifier table (QIT) and a sensitive table (ST) Sensitive Table (ST) Quasi-identifier Table (QIT) microdata

  15. Basic Idea of Anatomy (cont.) 1. Select a partition of the tuples QI group 1 QI group 2 a 2-diverse partition

  16. Basic Idea of Anatomy (cont.) 2. Generate a quasi-idnetifier table (QIT) and a sensitive table (ST) based on the selected partition group 1 group 2 quasi-identifier table (QIT) sensitive table (ST)

  17. Basic Idea of Anatomy (cont.) 2. Generate a quasi-idnetifier table (QIT) and a sensitive table (ST) based on the selected partition quasi-identifier table (QIT) sensitive table (ST)

  18. Basic Idea of Anatomy (cont.) 2. Generate a quasi-idnetifier table (QIT) and a sensitive table (ST) based on the selected partition sensitive table (ST) quasi-identifier table (QIT)

  19. Privacy Preservation • From a pair of QIT and ST generated from an l-diverse partition, the adversary can infer the sensitive value of each individual with confidence at most 1/l sensitive table (ST) quasi-identifier table (QIT)

  20. Accuracy of Data Analysis • Query A: SELECT COUNT(*) from Unknown-Microdata WHERE Disease = ‘pneumonia’ AND Age in [0, 30] AND Zipcode in [10001, 20000] sensitive table (ST) quasi-identifier table (QIT)

  21. Accuracy of Data Analysis (cont.) • Query A: SELECT COUNT(*) from Unknown-Microdata WHERE Disease = ‘pneumonia’ AND Age in [0, 30] AND Zipcode in [10001, 20000] • 2 patients have contracted pneumonia • 2 out of 4 patients satisfies the query condition on Age and Zipcode • Estimated answer for query A: 2 * 2 / 4 = 1, which is also the actual result from the original microdata t1t2 t3 t4

  22. Outline • Rationale of Anatomy • Preserving Correlation • Algorithm for Anatomy • Experimental Results

  23. Preserving Correlation • Let us first examine the correlation between Age and Disease in our running example • Each tuple in the microdata can be mapped to a point in the (Age, Disease) domain • The above tuple can be mapped to (23, pneumonia). t1

  24. Preserving Correlation (cont.) • We model this tuple using a probability density function (pdf):

  25. Preserving Correlation (cont.) • In the generalized table, the tuple becomes: • Its corresponding pdf becomes:

  26. Preserving Correlation (cont.) • In the anatomized tables, the tuple becomes: • Its corresponding pdf becomes:

  27. Preserving Correlation (cont.)

  28. Outline • Rationale of Anatomy • Preserving Correlation • Algorithm for Anatomy • Experimental Results

  29. Quality Metric the original pdf the approximated pdf • For each approximated pdf , we measure its error from the original pdf by their “L2 distance”: • We aim at obtaining anatomized tables that minimize the following re-construction error (RCE):

  30. Anatomize • An algorithm for computing anatomized tables that • runs in I/O cost linear to the cardinality n of the microdata table • minimizes the RCE when n is a multiple of l, otherwise achieves an RCE that is higher than the lower-bound by a factor of at most 1 + 1/n

  31. Outline • Rationale of Anatomy • Preserving Correlation • Algorithm for Anatomy • Experimental Results

  32. Experimental Settings • Goal: to compare the accuracy of data analysis on the generalized / anatomized tables. • Real dataset with 9 attributes: • Age, Gender, Education, Marital-status, Race, Work-class, Country, • Occupation, Salary-class • OCC-d, SAL-d, (d = 3, 4, 5, 6, 7) • OCC-3: • SAL-4: • Cardinality: 100k, 200k, 300k, 400k, 500k

  33. Experimental Settings (cont.) • competitor: multi-dimensional generalization • l = 10 • avg. relative error for 10000 aggregate queries: |act – est| / act • qd = 1, 2, …, d • s = 1%, …, 5%, …, 10%

  34. Accuracy of Data Analysis (cont.) C.C. Aggarwal. On k-anonymity and the curse of dimensionality. VLDB 2005

  35. Accuracy of Data Analysis (cont.)

  36. Accuracy of Data Analysis (cont.)

  37. Computation Overhead

  38. Summary • Anatomy outperforms generalization by allowing much more accurate data analysis on the published data. • Anatomized tables (with nearly optimal quality guarantee) can be computed in I/O cost linear to the database cardinality.

  39. Thank you! Datasets and implementation are available for download at http://www.cse.cuhk.edu.hk/~taoyf

  40. Anatomy vs. Generalization Revisit • Sometimes the adversary is not sure whether an individual appears in the microdata or not A Voter Registration List A 2-diverse generalized table

  41. Anatomy vs. Generalization Revisit • From the adversary’s perspective: • Bob has 4 / 6 probability to be in the microdata • If Bob indeed appears the microdata, there is 2 / 4 probability that he has contracted pneumonia • So Bob has 4/6 * 2/4 = 1/3 probability to have contracted pneumonia A Voter Registration List A 2-diverse generalized table

  42. Anatomy vs. Generalization Revisit • The adversary knows that • Bob must appear the microdata • There is 1/2 probability that Bob has contracted pneumonia 2-diverse ST 2-diverse QIT

  43. Anatomy vs. Generalization Revisit • For a given value of l, l-diverse generalization may lead to higher privacy protection than l-diverse anatomy does. • But is not always the case, since: • the external database may not contain any irrelevant individuals • the adversary may know that some individuals indeed appear in the microdata

More Related