1 / 30

Anonymization of Set-Valued Data via Top-Down, Local Generalization

Anonymization of Set-Valued Data via Top-Down, Local Generalization. Yeye He Jeffrey F. Naughton University of Wisconsin-Madison. Overview. The problem: Anonymizing set-valued data presents challenges not seen in relational data

katelyn
Download Presentation

Anonymization of Set-Valued Data via Top-Down, Local Generalization

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Anonymization of Set-Valued Data via Top-Down, Local Generalization Yeye He Jeffrey F. Naughton University of Wisconsin-Madison

  2. Overview • The problem: • Anonymizing set-valued data presents challenges not seen in relational data • Previous solutions explored parts but not all of the problem space • Our goals: • Develop a scalable algorithm for the new variant of the problem • Perform experiments to explore strengths and weaknesses of the approach

  3. What’s set-valued data • “Relational data” • One sensitive attribute for each tuple • “Set-valued data” • Logically: (personid, {item1, item2, …, itemn}) • Multiple sensitive values in one record possible

  4. An attack scenario • Retailer publishes market basket data • The adversary knows Alice has bought milk, beer, and diapers • The adversary infers Alice has also bought pregnancy test and diabetes medicine • beer, milk, diapers • beer, milk, diapers, pregnancy test, diabetes medicine

  5. Existing work: a priori QI/SI partition • Scenarios where a priori partitioning set elements into Quasi-Identifier Item & Sensitive Item possible • {beer, milk, diapers, pregnancy test, diabetes medicine} • Substantial existing work & good algorithms • [Ghinita+08] [Xu+08a] [Xu+08b] [Nergiz+07] • But what if a priori partitioning not possible? • Individuals may have different privacy requirements • The adversary may see sensitive items and use as QI Set-valued data anonymization a priori QI/SI partition possible ?

  6. Existing work: no QI/SI partition • Prior work [Terrovitis+08] proposed the km-anonymity model • km-anonymity • For any transaction (data record) T, for any subset of m items in T, there are at least k-1 other transactions with the same m items Set-valued data anonymization a priori QI/SI partition possible No a priori QI/SI partition

  7. The m in km-anonymity [Terrovitis+08] • Attack revisited • The data 103anonymized, the adversary sees {beer, milk, diapers} • Cannot tell Alice’s transaction from the other 9 • Effective assuming the adversary never sees more than m=3 items • m in km-anonymity • requires some identified ms.t. no adversary will ever see more than m items • What about the case where there is no such m? • The case we consider Set-valued data anonymization a priori QI/SI partition possible No a priori QI/SI partition Has identified m No identified m

  8. Our model: k-anonymityfor set-valued data • Transactional database D is k-anonymous if • Every transaction (data record) occurs at least k times • Different from km-anonymity [Terrovitis+08] • no limit on m, i.e., valid for any m • thus a stronger privacy model

  9. k-anonymity subsumes km-anonymity[Terrovitis+08] • Every database D that satisfies k-anonymity also satisfieskm-anonymity • There exists a database D that satisfies km-anonymity for all m but not k-anonymity • Example: 23-anonymous but not 2-anonymous T1 = {A, B, C} T2 = {A, B, C} T3 = {A, B} No QI/SI partition Set-valued data anonymization a priori QI/SI partition possible k-anon km-anon

  10. Problem statement • Given a transactional database D, find a transformation D’ of Ds.t.: • D’ satisfies k-anonymity • the transformation minimizes information loss between (D, D’)

  11. Hierarchical generalization All Alcohol Health care • Transaction generalization • Ti: {“Beer”, “Wine”, “Diaper”}  {“Alcohol”, “Health care”} • Duplicates removed Beer Wine Diaper Pregnancy test

  12. Information loss metric • Normalized Certainty Penalty (NCP) [Xu+06] • Also used in previous work [Terrovitis+08] All Health care Alcohol • Example: • Generalize “Beer” to “Alcohol”: (2/4 = 0.5) info loss • Generalize “Beer” to “All”: (4/4 = 1) info loss Beer Wine Diaper Pregnancy test

  13. Our algorithm: Partition-based anonymization • Top-down • Generalize everything to the root representation • Resulting one initial partition • Divide and conquer • Choose a node to specialize for each partition • Based on information gain heuristics • Recursively partition on resulting sub-partitions

  14. Example: 2-anonymization

  15. Generalize all data to root {ALL} • {a1} {ALL} • {a1,a2} {ALL} • {b1,b2} {ALL} • {b1,b2} {ALL} • {a1,a2,b2} {ALL} • {a1,a2,b2} {ALL} • {a1,a2,b1,b2} • One initial partition

  16. Initial partition:specialize using ALL  {A, B} {ALL} {A} {ALL} {A} {ALL} {B} {ALL} {B} {ALL} {A, B} {ALL} {A, B} {ALL} {A, B} • Produces three sub-partitions

  17. Green partition: specialize using A  {a1, a2} • {a1} • {a1,a2} {B} {B} {A, B} {A} {A, B} {A} {A, B} • Specialization violates 2-anonymity, rolls back

  18. Blue partition: specialize using B  {b1, b2} {B} • {b1,b2} {B} • {b1,b2} {A, B} {A} {A, B} {A} {A, B} • Specialization ok, reaches leave level, stop

  19. Red partition: specialize using A  {a1, a2} • {b1,b2} • {b1,b2} {A, B} • {a1,a2,B} {A} {A, B} • {a1,a2,B} {A} {A, B} • {a1,a2,B} • Choosing A over B based on max info gain heurisitcs

  20. Red partition: specialize using B  {b1, b2} • {b1,b2} • {b1,b2} • {a1,a2,B} • {a1,a2,b2} {A} • {a1,a2,B} • {a1,a2,b2} {A} • {a1,a2,B} • {a1,a2,b1,b2} • Specializing B violating 2-anonymity, rolls back

  21. Main advantages • Effective (less information loss) • Even though we impose a stronger privacy criteria • Local recoding vs. Global recoding • Efficient (less execution time) • Divide and conquer vs. bottom-up (exhaustive) enumeration • Linear in the input data & level of the hierarchy vs. worst case exponential in previous work

  22. Experimental setup: market basket data • Real-world benchmark data • BMS-WebView-1, BMS-WebView-2, BMS-POS • No accompanying hierarchy data • Used synthetic hierarchy (as in the previous work) • Comparing our Partition-based algorithm (Partition), with previous Apriori-Anonymization (AA) [Terrovitis+08]

  23. An order of magnitude faster on market basket data

  24. Less information losson market basket data • Why? Local recoding

  25. Sensitivity analysis: consistently faster with varied parameters

  26. Sensitivity analysis: less information loss in most cases

  27. Experimental setup: AOL query log • From a set-valued perspective • No accompanying hierarchy data, again • Use alphabetical hierarchy • Use WordNethierarchy • Compare with an early work [Adar07]

  28. Less information loss than [Adar07] on AOL query log

  29. Reasonably efficient on AOL query log • Efficient given the size of the query log (2.2GB) • Information loss not as satisfactory as in market basket data • Words generalized to “event”, “process”, “thing”…

  30. Conclusion • Developed faster, better information preserving anonymization algorithm • for set-valued data with no QI/SI distinction • Performed well on market basket data • less satisfying for search log data • Open and important question: stronger privacy models • what is a good stronger privacy model than k-anonymity for set-valued data with no QI/SI distinction?

More Related