1 / 32

Probabilistic Inference Protection on Anonymized Data

Probabilistic Inference Protection on Anonymized Data. Raymond Chi-Wing Wong (the Hong Kong University of Science and Technology) Ada Wai-Chee Fu (the Chinese University of Hong Kong) Ke Wang (Simon Fraser University) Yabo Xu (Sun Yat-sen University) Jian Pei (Simon Fraser University)

dior
Download Presentation

Probabilistic Inference Protection on Anonymized Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Probabilistic Inference Protection on Anonymized Data Raymond Chi-Wing Wong (the Hong Kong University of Science and Technology) Ada Wai-Chee Fu (the Chinese University of Hong Kong) Ke Wang (Simon Fraser University) Yabo Xu (Sun Yat-sen University) Jian Pei (Simon Fraser University) Philip S. Yu (Univerisity of Illinois at Chicago) Prepared by Raymond Chi-Wing Wong Presented by Raymond Chi-Wing Wong

  2. Outline • Introduction • l-diversity • Background Knowledge • Proposed Model • Conclusion

  3. Knowledge 2 I also know Alan with (Male, 41) Release the data set to public Simplified 2-diversity: to generate a data set such that each individual is linked to a sensitive value (e.g., Lung Cancer) with probability at most 1/2 1. l-diversity Bucketization In other words, P(Alan is linked to Lung Cancer) is at most 1/2. Combining Knowledge 1 and Knowledge 2, we can deduce that Alan is linked to Lung Cancer with probability=1/2. Knowledge 1 This dataset satisfies 2-diversity. QI Table Sensitive Table

  4. Knowledge 2 I also know Alan with (Male, 41) Release the data set to public Simplified 2-diversity: to generate a data set such that each individual is linked to a sensitive value (e.g., Lung Cancer) with probability at most 1/2 1. l-diversity This can be obtained from statistical reports from the US department of Health and Human Services and other statistical data sources discussed in previous studies Bucketization Knowledge 3 QI Based Distribution Knowledge 1 This dataset satisfies 2-diversity. QI Table Sensitive Table

  5. Knowledge 2 I also know Alan with (Male, 41) Release the data set to public Simplified 2-diversity: to generate a data set such that each individual is linked to a sensitive value (e.g., Lung Cancer) with probability at most 1/2 1. l-diversity It is more likely that a male patient is linked to Lung Cancer compared with a female patient. Bucketization Knowledge 3 QI Based Distribution Knowledge 1 Combining Knowledge 1, 2 and 3, we can deduce that Alan is linked to Lung Cancer with very high probability (much greater than 1/2). This dataset satisfies 2-diversity. Why? QI Table Sensitive Table

  6. Knowledge 2 I also know Alan with (Male, 41) Release the data set to public Simplified 2-diversity: to generate a data set such that each individual is linked to a sensitive value (e.g., Lung Cancer) with probability at most 1/2 Objective: to make sure that the probability is bounded by a threshold (e.g., 1/2). 1. l-diversity Bucketization Knowledge 3 QI Based Distribution We need to formulate how to calculate the probability (e.g., P(Alan is linked to Lung Cancer) ) according to Knowledge 1, 2 and 3 Knowledge 1 Combining Knowledge 1, 2 and 3, we can deduce that Alan is linked to Lung Cancer with very high probability (much greater than 1/2). This dataset satisfies 2-diversity. QI Table Sensitive Table

  7. Objective: to make sure that the probability is bounded by a threshold (e.g., 1/2). 1. l-diversity We need to formulate how to calculate the probability (e.g., P(Alan is linked to Lung Cancer) ) according to Knowledge 1, 2 and 3

  8. Objective: to make sure that the probability is bounded by a threshold (e.g., 1/2). 1. l-diversity • Challenge 1: Calculating the probability (e.g., P(Alan is linked to Lung Cancer)) is computationally expensive. We need to formulate how to calculate the probability (e.g., P(Alan is linked to Lung Cancer) ) according to Knowledge 1, 2 and 3

  9. Objective: to make sure that the probability is bounded by a threshold (e.g., 1/2). 1. l-diversity • Challenge 1: Calculating the probability (e.g., P(Alan is linked to Lung Cancer)) is computationally expensive. • Challenge 2: The formula for this probability is not monotonic with respect to the A-group size. Most existing privacy studies involve some formulae which are monotonic. Thus, most existing algorithms (e.g., Incognito and Mondrian) rely on this monotonic property.

  10. Objective: to make sure that the probability is bounded by a threshold (e.g., 1/2). 1. l-diversity Objective: to make sure that P(Alan is linked to Lung Cancer) ≤ 1/2 • Challenge 1: Calculating the probability (e.g., P(Alan is linked to Lung Cancer)) is computationally expensive. • Challenge 2: The formula for this probability is not monotonic with respect to the A-group size. Most existing privacy studies involve some formulae which are monotonic. Thus, most existing algorithms (e.g., Incognito and Mondrian) rely on this monotonic property.

  11. Objective: to make sure that the probability is bounded by a threshold (e.g., 1/2). 1. l-diversity Objective: to make sure that P(Alan is linked to Lung Cancer) ≤ 1/2 • Challenge 1: Calculating the probability (e.g., P(Alan is linked to Lung Cancer)) is computationally expensive. • Challenge 2: The formula for this probability is not monotonic with respect to the A-group size. Related Work: There is a closely related work [LLZ09] for this problem. [LLZ09] T. Li, N. Li and J. Zhang, “Modeling and Integrating Background Knowledge in Data Anonymization”, ICDE 2009 [LLZ09] approximates the formula for this probability. Thus, there is no solid guarantee on the privacy protection.

  12. Objective: to make sure that the probability is bounded by a threshold (e.g., 1/2). 1. l-diversity Objective: to make sure that P(Alan is linked to Lung Cancer) ≤ 1/2 • Challenge 1: Calculating the probability (e.g., P(Alan is linked to Lung Cancer)) is computationally expensive. • Challenge 2: The formula for this probability is not monotonic with respect to the A-group size. Contributions: We propose a condition. If this condition is satisfied, we canguarantee the privacy requirement (i.e., P(Alan is linked to Lung Cancer) ≤ 1/2 ) Besides, this condition can overcome Challenge 1 and Challenge 2. Specifically, (1) Computing the condition is computationally cheap, and (2) The condition involves a monotonic function on the A-group size.

  13. Objective: to make sure that the probability is bounded by a threshold (e.g., 1/2). 1. l-diversity Objective: to make sure that P(Alan is linked to Lung Cancer) ≤ 1/2 • The major idea of the condition includes some simple calculations based on the statistics of an A-group • The size of the A-group (N) • The privacy requirement (r) • The global probabilities of each tuple in the A-group to a sensitive value Contributions: We propose a condition. If this condition is satisfied, we canguarantee the privacy requirement (i.e., P(Alan is linked to Lung Cancer) ≤ 1/2 ) Besides, this condition can overcome Challenge 1 and Challenge 2. Specifically, (1) Computing the condition is computationally cheap, and (2) The condition involves a monotonic function on the A-group size.

  14. N Satisfied/Not Satisfied r Global probabilities Objective: to make sure that the probability is bounded by a threshold (e.g., 1/2). 1. l-diversity Objective: to make sure that P(Alan is linked to Lung Cancer) ≤ 1/2 • The major idea of the condition includes some simple calculations based on the statistics of an A-group • The size of the A-group (N) • The privacy requirement (r) • The global probabilities of each tuple in the A-group to a sensitive value Condition Check If it is satisfied, we deduce that the privacy requirement is satisfied (e.g., P(Alan is linked to Lung Cancer) ≤ 1/2)

  15. 4. Conclusion • Background Knowledge • QI-based Probability Distribution • Two Challenges • Challenge 1: The formula for the probability is computationally expensive • Challenge 2: The formula is not monotonic • Proposed Condition • overcomes Challenge 1 and Challenge 2

  16. Q&A

  17. Release the data set to public 1. l-diversity There is another way to prevent this linkage called Generalization. The following principle to be discussed can also be applied to Generalization. A way to prevent this linkage. Bucketization These two tuples form an anonymized group (A-group) GID = L1 These two tuples form another A-group. GID = L2

  18. Release the data set to public 1. l-diversity Bucketization GID = L1 GID = L2 QI Table Sensitive Table

  19. Release the data set to public 1. l-diversity Bucketization QI Table Sensitive Table

  20. Knowledge 2 I also know Alan with (Male, 41) Release the data set to public 1. l-diversity Knowledge 1 Combining Knowledge 1 and Knowledge 2, we can deduce that Alan is linked to Lung Cancer.

  21. Merging Objective: to make sure that the probability is bounded by a threshold (e.g., 1/2). 1. l-diversity P(an individual is linked to a sensitive value) ≤ 0.5 • Monotonicity • Consider two A-groups P(an individual is linked to a sensitive value) = 0.5 An A-group with GID = L1 An A-group “merged”from these two A-groups An A-group with GID = L2 P(an individual is linked to a sensitive value) = 0.4 The probability is monotonically decreasing when the size of the A-gourp increases.

  22. Merging Objective: to make sure that the probability is bounded by a threshold (e.g., 1/2). 1. l-diversity It is possible that P(an individual is linked to a sensitive value) > 0.5 • Non-Monotonicity • Consider two A-groups P(an individual is linked to a sensitive value) = 0.5 An A-group with GID = L1 An A-group “merged”from these two A-groups An A-group with GID = L2 P(an individual is linked to a sensitive value) = 0.4 The probability is notmonotonically decreasing when the size of the A-gourp increases.

  23. Knowledge 2 I also know Alan with (Male, 41) N Satisfied/Not Satisfied r Global probabilities Objective: to make sure that the probability is bounded by a threshold (e.g., 1/2). 1. l-diversity Objective: to make sure that P(Alan is linked to Lung Cancer) ≤ 1/2 Knowledge 1 For the sake of illustration, we focus on attribute Gender only. Knowledge 3 QI Based Distribution Suppose we are interested in knowing whether P(Alan is linked to Lung Cancer) ≤ 1/2. 2 Condition Check 2 If it is satisfied, we deduce that the privacy requirement is satisfied (e.g., P(Alan is linked to Lung Cancer) ≤ 1/2) 0.1 0.003

  24. N Satisfied/Not Satisfied r Global probabilities What is the condition check? In the condition check, there is an expression ceil in terms of N, r and global probabilities to compute. 2 Condition Check 2 0.1 0.003

  25. What is the condition check? In the condition check, there is an expression ceil in terms of N, r and global probabilities to compute. • Theorem 1: If the condition is satisfied, then the privacy requirement is satisfied.

  26. Theorem 2: Computing ceil can be done in O(1) time. This means that we overcome Challenge 1. Challenge 1: Calculating the probability is computationally expensive. • Theorem 3:ceil is a monotonically increasing function on N where N is the A-group size. This means that we overcome Challenge 2. Challenge 2: The formula for the original probability is not monotonic with respect to the A-group size.

  27. ceil = (N-r)/fmax fmax(r-1)/(1-fmax) + (N-1) N Satisfied/Not Satisfied r Global probabilities What is the condition check? The greatest global probability fmax = max{f1, f2} = max{0.1, 0.003} = 0.1 The difference between the greatest global probability and the “current” global probability = 0 = 0.1 – 0.1 1 = fmax– f1 in terms of N, r and fmax. = 0.097 = 0.1 – 0.003 2 = fmax– f2 The condition is whether this difference 1 (and 2) is at most an expression ceil 2 Condition Check 2 f1 0.1 0.003 f2

  28. ceil = (N-r)/fmax fmax(r-1)/(1-fmax) + (N-1) What is the condition check? The greatest global probability fmax = max{f1, f2} = max{0.1, 0.003} = 0.1 The difference between the greatest global probability and the “current” global probability = 0 = 0.1 – 0.1 1 = fmax– f1 = 0.097 = 0.1 – 0.003 2 = fmax– f2 The condition is whether this difference 1 (and 2) is at most an expression ceil • Theorem 1: If i≤ceil is satisfied, then the privacy requirement is satisfied.

  29. Anonymization • The condition check gives hints for anonymization • Initially, each tuple forms an A-group. • Repeat the following until each A-group satisfies the condition. • If there is an A-group violating the condition, merge this A-group with some other A-group such that the “merged” A-group satisfies the condition.

  30. Release the data set to public B.1.2 K-Anonymity Problem: to generate a data set such that each possible value appears at least TWO times. Two Kinds of Generalisations 1. ShatinNT 2. 16 July* “ShatinNT” causes LESS distortion than “16 July*” Question: how can we measure the distortion? This data set is 2-anonymous

  31. HKG * NT Jan July Oct Feb KLN 16 July 21 Oct 8 Feb 29 Jan Fanling Mongkok Jordon Shatin B.1.2 K-Anonymity Measurement= 1/1=1.0 * Measurement= 2/2=1.0 Male Female Measurement= 1/2 =0.5 Conclusion: We propose a measurement of distortion of the modified/anonymized data.

  32. HKG * NT Jan July Oct Feb KLN 16 July 21 Oct 8 Feb 29 Jan Fanling Mongkok Jordon Shatin B.1.2 K-Anonymity Measurement= 1/1=1.0 * Measurement= 2/2=1.0 Male Female Measurement= 1/2 =0.5 Can we modify the measurement? e.g. different weightings to each level

More Related