1 / 33

Achieving Anonymity via Clustering

Achieving Anonymity via Clustering. G. Aggarwal, T. Feder, K. Kenthapadi, S. Khuller, R. Panigrahy, D. Thomas , A. Zhu. Talk outline. k-Anonymity model Achieving Anonymity via Clustering r-Gather clustering Cellular clustering Future Work. Medical Records. De-identified Medical Records.

reegan
Download Presentation

Achieving Anonymity via Clustering

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Achieving Anonymity via Clustering G. Aggarwal, T. Feder, K. Kenthapadi, S. Khuller, R. Panigrahy, D. Thomas, A. Zhu Dilys Thomas PODS 2006

  2. Talk outline • k-Anonymity model • Achieving Anonymity via Clustering • r-Gather clustering • Cellular clustering • Future Work Dilys Thomas PODS 2006

  3. Medical Records Dilys Thomas PODS 2006

  4. De-identified Medical Records 03/04/76 Dilys Thomas PODS 2006

  5. k-Anonymity model Uniquely identify you! Quasi-identifiers: approximate foreign keys Dilys Thomas PODS 2006

  6. k-Anonymity Model [Swe00] • Suppress some entries of quasi-identifiers • each modified row becomes identical to at least k-1 other rows with respect to quasi-identifiers • Individual records hidden in a crowd of size k Dilys Thomas PODS 2006

  7. 2-Anonymized Table Dilys Thomas PODS 2006

  8. k-Anonymity Optimization • Minimize the number of generalizations/ suppressions to achieve k-Anonymity • NP-hard to come up with minimum suppressions/ generalizations.[MW04] • (k) approximation for k-anonymity [AFK+05] • (k) lower bound on approximation ratio with graph assumption Dilys Thomas PODS 2006

  9. Talk outline • k-Anonymity model • Achieving Anonymity via Clustering • r-Gather clustering • Cellular Clustering • Future Work Dilys Thomas PODS 2006

  10. Original Table Dilys Thomas PODS 2006

  11. 2-Anonymity with Suppression All attributes suppressed Dilys Thomas PODS 2006

  12. Original Table Dilys Thomas PODS 2006

  13. 2-Anonymity with Generalization Generalization allows pre-specified ranges Dilys Thomas PODS 2006

  14. Original Table Dilys Thomas PODS 2006

  15. 2-Anonymity with Clustering 27=(25+27+29)/3 70=(50+60+100)/3 37=(35+39)/2 115=(110+120)/2 Cluster centers published Dilys Thomas PODS 2006

  16. Advantages of Clustering • Clustering reduces the amount of distortion introduced as compared to suppressions / generalizations • Clustering allows constant factor approximation algorithms Dilys Thomas PODS 2006

  17. Quasi-Identifiers form a Metric Space • Convert quasi-identifiers into points in a metric space • Distance function, D, on points • D(X,X)=0 Reflexive • D(X,Y)=D(Y,X) Symmetric • D(X,Z) <= D(X,Y) + D(Y,Z) Triangle Inequality Dilys Thomas PODS 2006

  18. Metric Space • Converting (gender, zip code, DOB) into points in a metric space not easy. • Define distance function on each attribute. • E.g. on Zip code: • D (Zip1,Zip2)= physical distance between locations Zip1 and Zip2. • Weight attributes, weighted sum of attribute distances gives metric. Dilys Thomas PODS 2006

  19. Clustering for Anonymity • Cluster Quasi-identifiers so that each cluster has at least r members for anonymity. • Publish cluster centers for anonymity with number of point and radius • Tight clusters  Usefulness of data for mining • Large number of points per cluster Anonymity Dilys Thomas PODS 2006

  20. Quasi-identifiers: Metric Space Assume further that the distance metric has been already defined on quasi-identifiers Dilys Thomas PODS 2006

  21. Talk outline • k-Anonymity model • Achieving Anonymity via Clustering • r-Gather clustering • Cellular Clustering • Future Work Dilys Thomas PODS 2006

  22. 10 points, radius5 50 points, radius 20 20 points, radius 10 r-Gather Clustering Minimize the maximum radius: 20 Dilys Thomas PODS 2006

  23. Results • 2 Approximation to minimize maximum radius with cluster size constraint • Matching Lower bound of 2 for maximum radius minimization Dilys Thomas PODS 2006

  24. 2d 2d 2d r-Gather Clustering Dilys Thomas PODS 2006

  25. C1=X1Æ X2 X1T X2T C1 r-2 points r-2 points X1F X2F Lower Bound: Reduction from 3-SAT • r-gather with radius 1 iff formula satisfiable Else radius ¸ 2 Dilys Thomas PODS 2006

  26. Talk outline • k-Anonymity model • Achieving Anonymity via Clustering • r-Gather clustering • Cellular Clustering • Future Work Dilys Thomas PODS 2006

  27. Cellular Clustering 10 points, radius5 50 points, radius 20 20 points, radius 10 Dilys Thomas PODS 2006

  28. Cellular Clustering Metric 10 points, radius5 50 points, radius 20 20 points, radius 10 Cellular Clustering Metric: 10*5 + 20*10 + 50*20 = 50 + 200 + 1000 = 1250 Dilys Thomas PODS 2006

  29. Cellular Clustering • Primal dual 4-approximation algorithm for cellular clustering • Constant factor approximation to minimum cluster size • Each cluster has at least r points Dilys Thomas PODS 2006

  30. Cellular Clustering: Linear Program Minimize c ( i xicdc + fc yc) Sum of Cellular cost and facility cost Subject to: c xic¸ 1 Each Point belongs to a cluster xic· yc Cluster must be opened for point to belong 0 · xic· 1 Points belong to clusters positively 0 · yc· 1 Clusters are opened positively Dilys Thomas PODS 2006

  31. Dual Program • Maximize ii • Subject to: iic· fc (1) i - ic· dc (2) i¸ 0 ic¸ 0 Overview of Algorithm: First grow i keeping ic=0 till (2) becomes tight then grow ic at same rate till (1) becomes tight Dilys Thomas PODS 2006

  32. Future Work • Improve approximation ratio for Cellular Clustering • Improve Running time. Presently r-gather is O(n2) while cellular clustering is a linear program over n2 variables. • Linear or even sub-linear time algorithms • Weaker guarantees on anonymity, e.g. at least k/2 points per cluster instead of k. Dilys Thomas PODS 2006

  33. THANK YOU! QUESTIONS? Dilys Thomas PODS 2006

More Related