# Towards Achieving Anonymity - PowerPoint PPT Presentation

Download Presentation
Towards Achieving Anonymity

1 / 76
Towards Achieving Anonymity
Download Presentation

## Towards Achieving Anonymity

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
##### Presentation Transcript

1. Introduction • Collect and analyze personal data • Infer trends and patterns • Making the personal data “public” • Joining multiple sources • Third party involvement • Privacy concerns • Q: How to share such data?

2. Example: Medical Records

3. De-identified Records

4. Not Sufficient! [Sweeney 00’] Unique Identifiers! Public Database

5. Not Sufficient! [Sweeney 00’] Unique Identifiers! Public Database

6. Anonymize the Quasi-Identifiers! Unique Identifiers! Public Database

7. Q: How to share such data? • Anonymize the quasi-identifiers • Suppress information • Privacy guarantee: anonymity • Quality: the amount of suppressed information • Clustering • Privacy guarantee: cluster size • Quality: various clustering measures

8. Q: How to share such data? • Anonymize the quasi-identifiers • Suppress information • Privacy guarantee: anonymity • Quality: the amount of suppressed information • Clustering • Privacy guarantee: cluster size • Quality: various clustering measures

9. k-anonymized Table [Samarati 01’]

10. k-anonymized Table [Samarati 01’] Each row is identical to at least k-1 other rows

11. Definition: k-anonymity • Input: a table consists of n row, each with m attributes (quasi-identifiers) • Output: suppress some entries such that each row is identical to at least k-1 other rows • Objective: minimize the number of suppressed entries

12. Past Work and New Results • [MW 04’] • NP-hardness for a large size alphabet • O(k logk)-approximation • [AFKMPTZ 05’] • NP-hardness even for ternary alphabet • O(k)-approximation • 1.5-approximation for 2-anonymity • 2-approximation for 3-anonymity

13. Past Work and New Results • [MW 04’] • NP-hardness for a large size alphabet • O(k logk)-approximation • [AFKMPTZ 05’] • NP-hardness even for ternary alphabet • O(k)-approximation • 1.5-approximation for 2-anonymity • 2-approximation for 3-anonymity

14. Graph Representation 4 A: B: C: D: E: F: A B 2 3 F C 3 4 E D 6 W(e)=Hamming distance between the two rows

15. Edge Selection I A: B: C: D: E: F: A B 2 2 0 3 F C 2 E D Each node selects the lightest weight edge k=3

16. Edge Selection II A: B: C: D: E: F: A B 2 3 0 3 F C 2 E D For components with <k vertices, add more edges k=3

17. Lemma • Total weight of edges selected is no more than OPT • In the optimal solution, each vertex pays at least the weight of the (k-1)st lightest weight edge • Forest: at most one edge per vertex • By construction, the edge weight is no more than the (k-1)st lightest weight edge per vertex

18. Grouping • Ideally, each connected component forms a group • Anonymize vertices within a group • Total cost of a group: (total edge weights) (number of nodes) • (2+2+3+3)6 A B 2 3 0 3 F C 2 E D Small groups: O(k)

19. Dividing a Component • Root tree arbitrarily • Divide if Sub-trees & rest k • Aim: all sub-trees <k <k <k <k <k k k k k

20. Dividing a Component • Root tree arbitrarily • Divide if Sub-trees & rest k • Rotate the tree if necessary k k k

21. Dividing a Component • Root tree arbitrarily • Divide if Sub-trees & rest k • T. condition: max(2k-1, 3k-5) <k <k <k <k <k

22. An Example A B A: B: C: D: E: F: 2 3 0 3 F C 2 E D

23. An Example C A: B: C: D: E: F: 2 2 3 B E F 3 A 0 D

24. An Example C A: B: C: D: E: F: 2 2 B E F 3 A 0 D Estimated cost: 43+33 Optimal cost: 33+33

25. Past Work and New Results • [MW 04’] • NP-hardness for a large size alphabet • O(k logk)-approximation • [AFKMPTZ 05’] • NP-hardness even for ternary alphabet • O(k)-approximation • 1.5-approximation for2-anonymity • 2-approximation for 3-anonymity

26. 1.5-approximation 1 A: B: C: D: E: F: A B 6 6 F C 0 5 E D 6 W(e)=Hamming distance between the two rows

27. Minimum {1,2}-matching 1 A: B: C: D: E: F: A B 0 F C 0 1 E D Each vertex is matched to 1 or 2 other vertices

28. Properties • Each component has 3 nodes >3 Not possible (degree  2) Not Optimal

29. Qualities • Cost  2OPT • For binary alphabet: 1.5OPT a p q r  p,q OPT pays: 2a We pay: 2a OPT pays: p+q+r We pay: 3(p+q)  2(p+q+r)

30. Past Work and New Results • [MW 04’] • NP-hardness for a large size alphabet • O(k logk)-approximation • [AFKMPTZ 05’] • NP-hardness even for ternary alphabet • O(k)-approximation • 1.5-approximation for 2-anonymity • 2-approximation for 3-anonymity

31. Open Problems • Can we improve O(k)? • (k) for graph representation

32. Open Problems • Can we improve O(k)? • (k) for graph representation 1111111100000000000000000000000000000000 0000000011111111000000000000000000000000 0000000000000000111111110000000000000000 0000000000000000000000001111111100000000 0000000000000000000000000000000011111111 k = 5, d = 16,c = k  d / 2

33. Open Problems • Can we improve O(k)? • (k) for graph representation 1111111100000000000000000000000000000000 0000000011111111000000000000000000000000 0000000000000000111111110000000000000000 0000000000000000000000001111111100000000 0000000000000000000000000000000011111111 k = 5, d = 16,c = k  d / 2

34. Open Problems • Can we improve O(k)? • (k) for graph representation 10101010101010101010101010101010 11001100110011001100110011001100 11110000111100001111000011110000 11111111000000001111111100000000 11111111111111110000000000000000 k = 5, d = 16,c = 2  d

35. Open Problems • Can we improve O(k)? • (k) for graph representation 10101010101010101010101010101010 11001100110011001100110011001100 11110000111100001111000011110000 11111111000000001111111100000000 11111111111111110000000000000000 k = 5, d = 16,c = 2  d

36. Q: How to share such data? • Anonymize the quasi-identifiers • Suppress information • Privacy guarantee: anonymity • Quality: the amount of suppressed information • Clustering • Privacy guarantee: cluster size • Quality: various clustering measures

37. Clustering Approach [AFKKPTZ 06’]

38. Transfers into a Metric…

39. Clusters and Centers

40. Clusters and Centers

41. Measure • How good are the clusters • “Tight” clusters are better • Minimize max radius: Gather-k • Minimize max distortion error: Cellular-k •  radius  num_nodes Cost: Gather-k: 10 Cellular-k: 624

42. Measure • How good are the clusters • “Tight” clusters are better • Minimize max radius: Gather-k • Minimize max distortion error: Cellular-k •  radius  num_nodes • Handle outliers • Constant approximations!

43. Comparison • k = 5 • 5-anonymity • Suppress all entries • More distortion • Clustering • Can pick R5 as the center • Less distortion • Distortion is directly related with pair-wise distances

44. Results [AFKKPTZ 06’] • Gather-k • Tight 2-approximation • Extension to outlier: 4-approximation • Cellular-k • Primal-dual const. approximation • Extensions as well

45. Results [AFKKPTZ 06’] • Gather-k • Tight 2-approximation • Extension to outlier: 4-approximation • Cellular-k • Primal-dual const. approximation • Extensions as well

46. 2-approximation • Assume an optimal value R • Make sure each node has at least k – 1 neighbors within distance 2R. R 2R A

47. 2-approximation • Assume an optimal value R • Make sure each node has at least k – 1 neighbors within distance 2R. • Pick an arbitrary node as a center and remove all remaining nodes within distance 2R. Repeat until all nodes are gone. • Make sure we can reassign nodes to the selected centers.

48. Example: k = 5

49. Optimal Solution R 1 2