1 / 27

On the Use of Spectral Filtering for Privacy Preserving Data Mining

On the Use of Spectral Filtering for Privacy Preserving Data Mining. Songtao Guo UNC Charlotte Xintao Wu UNC Charlotte. Source: http://www.privacyinternational.org/issues/foia/foia-laws.jpg. PIPEDA 2000. European Union (Directive 94/46/EC). HIPAA for health care

hyatt-black
Download Presentation

On the Use of Spectral Filtering for Privacy Preserving Data Mining

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. On the Use of Spectral Filtering for Privacy Preserving Data Mining Songtao Guo UNC Charlotte Xintao Wu UNC Charlotte

  2. Source: http://www.privacyinternational.org/issues/foia/foia-laws.jpg April 23-27, 2006

  3. PIPEDA 2000 European Union (Directive 94/46/EC) • HIPAA for health care • California State Bill 1386 • Grann-Leach-Bliley Act for financial • COPPA for childern’s online privacy Source: http://www.privacyinternational.org/survey/dpmap.jpg April 23-27, 2006

  4. Mining vs. Privacy • Data mining • The goal of data mining is summary results (e.g., classification, cluster, association rules etc.) from the data (distribution) • Individual Privacy • Individual values in database must not be disclosed, or at least no close estimation can be derived by attackers • Privacy Preserving Data Mining (PPDM) • How to “perturb” data such that • we can build a good data mining model (data utility) • while preserving individual’s privacy at the record level (privacy)? April 23-27, 2006

  5. Outline • Additive Randomization • Distribution Reconstruction • Bayesian Method Agrawal & Srikant SIGMOD00 • EM Method Agrawal & Aggawal PODS01 • Individual Value Reconstruction • Spectral Filtering H. Kargupta ICDM03 • PCA Technique Du et al. SIGMOD05 • Error Bound Analysis for Spectral Filtering • Upper Bound • Conclusion and Future Work April 23-27, 2006

  6. Additive Randomization • To hide the sensitive data by randomly modifying the data values using some additive noise • Privacy preserving aims at and • Utility preserving aims at • The aggregate characteristics remain unchanged or can be recovered April 23-27, 2006

  7. Distribution Reconstruction • The original density distribution can be reconstructed effectively given the perturbed data and the noise's distribution --–Agrawal & Srikant SIGMOD2000 • Independent random noises with any distribution • fX0 := Uniform distribution • j := 0 // Iteration number • repeat • fXj+1(a) := • j := j+1 • until (stopping criterion met) • It can not reconstruct individual value April 23-27, 2006

  8. Individual Value Reconstruction • Spectral Filtering, Kargupta et al. ICDM 2003 • Apply EVD : • Using some published information about V, extract the first k components of as the principal components. • and are the corresponding eigenvectors. • forms an orthonormal basis of a subspace . • Find the orthogonal projection on to : • Get estimate data set: PCA Technique, Huang, Du and Chen, SIGMOD 05 April 23-27, 2006

  9. Motivation • Previous work on individual reconstruction are only empirical • The relationship between the estimation accuracy and the noise was not clear • Two questions • Attacker question: How close the estimated data using SF is from the original one? • Data owner question: How much noise should be added to preserve privacy at a given tolerated level? April 23-27, 2006

  10. Our Work • Investigate the explicit relationship between the estimation accuracy and the noise • Derive one upper bound of in terms of V • The upper bound determines how close the estimated data achieved by attackers is from the original one • It imposes a serious threat of privacy breaches April 23-27, 2006

  11. Preliminary • F-norm and 2-norm • Some properties • and • ,the square root of the largest eigenvalue of ATA • If A is symmetric, then ,the largest eigenvalue of A April 23-27, 2006

  12. Matrix Perturbation • Traditional Matrix perturbation theory • How the derived perturbation E affects the co-variance matrix A • Our scenario • How the primary perturbation V affects the data matrix U A + E April 23-27, 2006

  13. Error Bound Analysis • Prop 1. Let covariance matrix of the perturbed data be . Given and • Prop 2. (eigenvalue of E) (eigengap) April 23-27, 2006

  14. Theorem • Given a date set and a noise set we have the perturbed data set . Let be the estimation obtained from the Spectral Filtering, then where is the derived perturbation on the original covariance matrix A = UUT • Proof is skipped April 23-27, 2006

  15. Special Cases • When the noise matrix is generated by i.i.d. Gaussian distribution with zero mean and known variance • When the noise is completely correlated with data April 23-27, 2006

  16. Experimental Results • Artificial Dataset • 35 correlated variables • 30,000 tuples April 23-27, 2006

  17. Experimental Results • Scenarios of noise addition • Case 1: i.i.d. Gaussian noise • N(0,COV), where COV = diag(σ2,…, σ2) • Case 2: Independent Gaussian noise • N(0,COV), where COV = c * diag(σ12, …, σn2) • Case 3: Correlated Gaussian noise • N(0,COV), where COV = c * ΣU (or c * A……) • Measure • Absolute error • Relative error April 23-27, 2006

  18. Determining k • Determine k in Spectral Filtering • According to Matrix Perturbation Theory • Our heuristic approach: • check • K = April 23-27, 2006

  19. Effect of varying k (case 1) • N(0,COV), where COV = diag(σ2,…, σ2) relative error April 23-27, 2006

  20. Effect of varying k (case 2) • N(0,COV), where COV = c * diag(σ12, σ22 …, σn2) relative error April 23-27, 2006

  21. Effect of varying k (case 3) • N(0,COV), where COV = c * ΣU April 23-27, 2006

  22. σ2=0.1 σ2=1.0 σ2=0.5 Effect of varying noise ||V||F/||U||F = 87.8% April 23-27, 2006

  23. Case 1 Case 3 Case 2 Effect of covariance matrix ||V||F/||U||F = 39.1% April 23-27, 2006

  24. Conclusion • Spectral filtering based technique has been investigated as a major means of point-wise data reconstruction. • We present the upper bound • which enables attackers to determines how close the estimated data achieved by attackers is from the original one April 23-27, 2006

  25. Future Work • We are working on the lower bound • which represents the best estimate the attacker can achieve using SF • which can be used by data owners to determine how much noise should be added to preserve privacy • Bound analysis at point-wise level April 23-27, 2006

  26. Acknowledgement • NSF Grant • CCR-0310974 • IIS-0546027 • Personnel • Xintao Wu • Songtao Guo • Ling Guo • More Info • http://www.cs.uncc.edu/~xwu/ • xwu@uncc.edu, April 23-27, 2006

  27. Questions? Thank you! April 23-27, 2006

More Related