1 / 30

Privacy Preserving Data Mining: An Overview and Examination of Euclidean Distance Preserving Data Transformation

Privacy Preserving Data Mining: An Overview and Examination of Euclidean Distance Preserving Data Transformation. Chris Giannella cgiannel AT acm DOT org. Talk Outline. Introduction Privacy preserving data mining – what problem is it aimed to address?

marrim
Download Presentation

Privacy Preserving Data Mining: An Overview and Examination of Euclidean Distance Preserving Data Transformation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Privacy Preserving Data Mining:An Overview and Examination of Euclidean Distance Preserving Data Transformation Chris Giannella cgiannel AT acm DOT org

  2. Talk Outline • Introduction • Privacy preserving data mining – what problem is it aimed to address? • Focus of this talk: data transformation • Some data transformation approaches • My current research: Euclidean distance preserving data transformation • Wrap-up summary

  3. An Example Problem • The U.S. Census Bureau collects lots of data • If released in raw form, this data would provide • a wealth of valuable information regarding broad population patterns  • Access to private information regarding individuals  How to allow analysts to extract population patterns without learning private information?

  4. Privacy-Preserving Data Mining “The study of how to produce valid mining models and patterns without disclosing private information.”- F. Giannotti and F. Bonchi,“Privacy Preserving Data Mining,” KDUbiq Summer School, 2006. Several broad approaches … this talk  data transformation (the “census model”)

  5. Data Transformation(the “Census Model”) Researcher Private Data Transformed Data Data Miner

  6. DT Objectives Minimize risk of disclosing private information Maximize the analytical utility of the transformed data DT is also studied in the field of Statistical Disclosure Control.

  7. Some things DT does not address… Preventing unauthorized access to the private data (e.g. hacking). Securely communicating private data. DT and cryptography are quite different. (Moreover, standard encryption does not solve the DT problem)

  8. Assessing Transformed Data Utility How accurately does a transformation preserve certain kinds of patterns, e.g.: • data mean, covariance • Euclidean distance between data records • Underlying generating distribution? How useful are the patterns at drawing conclusions/inferences?

  9. Assessing Privacy Disclosure Risk • Some efforts in the literature to develop rigorous definitions of disclosure risk • no widely accepted agreement • This talk will take an ad-hoc approach: • for a specific attack, how closely can any private data record be estimated?

  10. Talk Outline • Introduction • Privacy preserving data mining – what problem is it aimed to address? • Focus of this talk: data transformation • Some data transformation approaches • My current research: Euclidean distance preserving data transformation • Wrap-up summary

  11. Some DT approaches • Discussed in this talk: • Additive independent noise • Euclidean distance preserving transformation • My current research • Others: • Data swapping/shuffling, multiplicative noise, micro-aggregation, K-anonymization, replacement with synthetic data, etc…

  12. Additive Independent Noise For each private data record, (x1,…,xn), add independent random noise to each entry: (y1,…,yn) = (x1+e1,…,xn+en) ei is generated independently as N(0, d*Var(i)) Increasing d reduces privacy disclosure risk

  13. Additive Independent Noise d = 0.5

  14. Additive Independent Noise Difficult to set d producing low privacy disclosure risk high data utility Some enhancements on the basic idea exist E.g.Muralidhar et al.

  15. Talk Outline • Introduction • Privacy preserving data mining – what problem is it aimed to address? • Focus of this talk: data transformation • Some data transformation approaches • My current research: Euclidean distance preserving data transformation (EDPDT) • Wrap-up summary

  16. EDPDT – High Data Utility! • Many data clustering algorithms use Euclidean distance to group records, e.g. • K-means clustering, hierarchical agglomerative clustering • If Euclidean distance is accurately preserved, these algorithms will produce the same clusters on the transformed data as the original data.

  17. EDPDT – High Data Utility! Original data Transformed data

  18. EDPDT – Unclear Privacy Disclosure Risk • Focus of the research ... approach  • Develop attacks combining the transformed data with plausible prior knowledge. • How well can these attacks estimate private data records?

  19. Two Different Prior Knowledge Assumptions • Known input: The attacker knows a small subset of the private data records. • Focus of this talk. • Known sample: The attacker knows a set of data records drawn independently from the same underlying distribution as the private data records. • happy to discuss “off-line”.

  20. Known Input Prior Knowledge Underlying assumption: Individuals know a) if there is a record for them along the private data records, and b) know the attributes of the private data records.  Each individual knows one private record.  A small group of malicious individuals could cooperate to produce a small subset of the private data records.

  21. Known Input Attack Given: {Y1,…,Ym} (transformed data records) {X1,…,Xk} (known private data records) 1) Determine the transformation constraints i.e. which transformed records came from which known private records. 2) Choose T randomly from the set of all distance preserving transformations that satisfy the constraints. 3) Apply T-1 to the transformed data.

  22. Know Input Attack – 2D data, 1 known private data record

  23. Known Input Attack – General Case Y = MX • Each column of X (Y) is a private (transformed) data record. • M is an orthogonal matrix. [Ykn Yun] = M[Xknown Xunkown] Attack: Choose T randomly from {T an orthogonal matrix: TXknown = Ykn}. Produce T-1(Yun). 23

  24. Known Input Attack -- Experiments 18,000 record, 16-attribute real data set. Given k known private data records, computed Pk, the probability that the attack estimates one unknown private record with > 85% accuracy. P2 = 0.16 P4 = 1 … P16 = 1

  25. Wrap-Up Summary • Introduction • Privacy preserving data mining – what problem is it aimed to address? • Focus of this talk: data transformation • Some data transformation approaches • My current research: Euclidean distance preserving data transformation

  26. Thanks to … • You: • for your attention • Kun Liu: • joint research & some material used in this presentation • Krish Muralidhar: • some material used in this presentation • Hillol Kargupta: • joint research

  27. original original perturbed Distance Preserving Perturbation Attributes perturbed Records

  28. ID 1001 1002 ID 1001 1002 -0.2507 0.4556 -0.8542 × Wages -26,326 -22,613 Wages 98,563 83,821 = -0.9653 -0.0514 0.2559 Rent -94,502 -80,324 Rent 1,889 1,324 0.0726 0.8887 0.4527 Tax 10,151 8,432 Tax 2,899 2,578 M X Distance Preserving Perturbation Y

  29. Known Sample Attack [more]

  30. Known Sample Attack Experiments backup Fig. Known sample attack for Adult data with 32,561 private tuples. The attacker has 2% samples from the same distribution. The average relative error of the recovered data is 0.1081 (10.81%).

More Related