Towards Achieving Anonymity - PowerPoint PPT Presentation

towards achieving anonymity n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Towards Achieving Anonymity PowerPoint Presentation
Download Presentation
Towards Achieving Anonymity

play fullscreen
1 / 76
Towards Achieving Anonymity
149 Views
Download Presentation
morton
Download Presentation

Towards Achieving Anonymity

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. Towards Achieving Anonymity An Zhu

  2. Introduction • Collect and analyze personal data • Infer trends and patterns • Making the personal data “public” • Joining multiple sources • Third party involvement • Privacy concerns • Q: How to share such data?

  3. Example: Medical Records

  4. De-identified Records

  5. Not Sufficient! [Sweeney 00’] Unique Identifiers! Public Database

  6. Not Sufficient! [Sweeney 00’] Unique Identifiers! Public Database

  7. Anonymize the Quasi-Identifiers! Unique Identifiers! Public Database

  8. Q: How to share such data? • Anonymize the quasi-identifiers • Suppress information • Privacy guarantee: anonymity • Quality: the amount of suppressed information • Clustering • Privacy guarantee: cluster size • Quality: various clustering measures

  9. Q: How to share such data? • Anonymize the quasi-identifiers • Suppress information • Privacy guarantee: anonymity • Quality: the amount of suppressed information • Clustering • Privacy guarantee: cluster size • Quality: various clustering measures

  10. k-anonymized Table [Samarati 01’]

  11. k-anonymized Table [Samarati 01’] Each row is identical to at least k-1 other rows

  12. Definition: k-anonymity • Input: a table consists of n row, each with m attributes (quasi-identifiers) • Output: suppress some entries such that each row is identical to at least k-1 other rows • Objective: minimize the number of suppressed entries

  13. Past Work and New Results • [MW 04’] • NP-hardness for a large size alphabet • O(k logk)-approximation • [AFKMPTZ 05’] • NP-hardness even for ternary alphabet • O(k)-approximation • 1.5-approximation for 2-anonymity • 2-approximation for 3-anonymity

  14. Past Work and New Results • [MW 04’] • NP-hardness for a large size alphabet • O(k logk)-approximation • [AFKMPTZ 05’] • NP-hardness even for ternary alphabet • O(k)-approximation • 1.5-approximation for 2-anonymity • 2-approximation for 3-anonymity

  15. Graph Representation 4 A: B: C: D: E: F: A B 2 3 F C 3 4 E D 6 W(e)=Hamming distance between the two rows

  16. Edge Selection I A: B: C: D: E: F: A B 2 2 0 3 F C 2 E D Each node selects the lightest weight edge k=3

  17. Edge Selection II A: B: C: D: E: F: A B 2 3 0 3 F C 2 E D For components with <k vertices, add more edges k=3

  18. Lemma • Total weight of edges selected is no more than OPT • In the optimal solution, each vertex pays at least the weight of the (k-1)st lightest weight edge • Forest: at most one edge per vertex • By construction, the edge weight is no more than the (k-1)st lightest weight edge per vertex

  19. Grouping • Ideally, each connected component forms a group • Anonymize vertices within a group • Total cost of a group: (total edge weights) (number of nodes) • (2+2+3+3)6 A B 2 3 0 3 F C 2 E D Small groups: O(k)

  20. Dividing a Component • Root tree arbitrarily • Divide if Sub-trees & rest k • Aim: all sub-trees <k <k <k <k <k k k k k

  21. Dividing a Component • Root tree arbitrarily • Divide if Sub-trees & rest k • Rotate the tree if necessary k k k

  22. Dividing a Component • Root tree arbitrarily • Divide if Sub-trees & rest k • T. condition: max(2k-1, 3k-5) <k <k <k <k <k

  23. An Example A B A: B: C: D: E: F: 2 3 0 3 F C 2 E D

  24. An Example C A: B: C: D: E: F: 2 2 3 B E F 3 A 0 D

  25. An Example C A: B: C: D: E: F: 2 2 B E F 3 A 0 D Estimated cost: 43+33 Optimal cost: 33+33

  26. Past Work and New Results • [MW 04’] • NP-hardness for a large size alphabet • O(k logk)-approximation • [AFKMPTZ 05’] • NP-hardness even for ternary alphabet • O(k)-approximation • 1.5-approximation for2-anonymity • 2-approximation for 3-anonymity

  27. 1.5-approximation 1 A: B: C: D: E: F: A B 6 6 F C 0 5 E D 6 W(e)=Hamming distance between the two rows

  28. Minimum {1,2}-matching 1 A: B: C: D: E: F: A B 0 F C 0 1 E D Each vertex is matched to 1 or 2 other vertices

  29. Properties • Each component has 3 nodes >3 Not possible (degree  2) Not Optimal

  30. Qualities • Cost  2OPT • For binary alphabet: 1.5OPT a p q r  p,q OPT pays: 2a We pay: 2a OPT pays: p+q+r We pay: 3(p+q)  2(p+q+r)

  31. Past Work and New Results • [MW 04’] • NP-hardness for a large size alphabet • O(k logk)-approximation • [AFKMPTZ 05’] • NP-hardness even for ternary alphabet • O(k)-approximation • 1.5-approximation for 2-anonymity • 2-approximation for 3-anonymity

  32. Open Problems • Can we improve O(k)? • (k) for graph representation

  33. Open Problems • Can we improve O(k)? • (k) for graph representation 1111111100000000000000000000000000000000 0000000011111111000000000000000000000000 0000000000000000111111110000000000000000 0000000000000000000000001111111100000000 0000000000000000000000000000000011111111 k = 5, d = 16,c = k  d / 2

  34. Open Problems • Can we improve O(k)? • (k) for graph representation 1111111100000000000000000000000000000000 0000000011111111000000000000000000000000 0000000000000000111111110000000000000000 0000000000000000000000001111111100000000 0000000000000000000000000000000011111111 k = 5, d = 16,c = k  d / 2

  35. Open Problems • Can we improve O(k)? • (k) for graph representation 10101010101010101010101010101010 11001100110011001100110011001100 11110000111100001111000011110000 11111111000000001111111100000000 11111111111111110000000000000000 k = 5, d = 16,c = 2  d

  36. Open Problems • Can we improve O(k)? • (k) for graph representation 10101010101010101010101010101010 11001100110011001100110011001100 11110000111100001111000011110000 11111111000000001111111100000000 11111111111111110000000000000000 k = 5, d = 16,c = 2  d

  37. Q: How to share such data? • Anonymize the quasi-identifiers • Suppress information • Privacy guarantee: anonymity • Quality: the amount of suppressed information • Clustering • Privacy guarantee: cluster size • Quality: various clustering measures

  38. Clustering Approach [AFKKPTZ 06’]

  39. Transfers into a Metric…

  40. Clusters and Centers

  41. Clusters and Centers

  42. Measure • How good are the clusters • “Tight” clusters are better • Minimize max radius: Gather-k • Minimize max distortion error: Cellular-k •  radius  num_nodes Cost: Gather-k: 10 Cellular-k: 624

  43. Measure • How good are the clusters • “Tight” clusters are better • Minimize max radius: Gather-k • Minimize max distortion error: Cellular-k •  radius  num_nodes • Handle outliers • Constant approximations!

  44. Comparison • k = 5 • 5-anonymity • Suppress all entries • More distortion • Clustering • Can pick R5 as the center • Less distortion • Distortion is directly related with pair-wise distances

  45. Results [AFKKPTZ 06’] • Gather-k • Tight 2-approximation • Extension to outlier: 4-approximation • Cellular-k • Primal-dual const. approximation • Extensions as well

  46. Results [AFKKPTZ 06’] • Gather-k • Tight 2-approximation • Extension to outlier: 4-approximation • Cellular-k • Primal-dual const. approximation • Extensions as well

  47. 2-approximation • Assume an optimal value R • Make sure each node has at least k – 1 neighbors within distance 2R. R 2R A

  48. 2-approximation • Assume an optimal value R • Make sure each node has at least k – 1 neighbors within distance 2R. • Pick an arbitrary node as a center and remove all remaining nodes within distance 2R. Repeat until all nodes are gone. • Make sure we can reassign nodes to the selected centers.

  49. Example: k = 5

  50. Optimal Solution R 1 2