1 / 78

Hippocratic Data Management

Hippocratic Data Management. Rakesh Agrawal IBM Almaden Research Center. Thesis. We need information systems that respect the privacy of data they manage AND do not impede the useful flow of information. It is feasible to reconcile the apparent contradiction.

Solomon
Download Presentation

Hippocratic Data Management

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. HippocraticData Management Rakesh Agrawal IBM Almaden Research Center

  2. Thesis • We need information systems that • respect the privacy of data they manage AND • do not impede the useful flow of information. • It is feasible to reconcile the apparent contradiction

  3. Outline • Why Privacy in Data Systems • Some Technology Directions • Some Challenging Problems

  4. Drivers for Privacy • Privacy Surveys: • 17% privacy fundamentalists, 56% pragmatic majority, 27% marginally concerned (Understanding net users' attitude about online privacy, April 99) • 83% would stop doing business with a company if it misused customer information (Privacy on and off the Internet: What consumers want, Nov. 2001) • Govt. legislations & guidelines: • Fair Information Practices Act (US, 1974) • OECD Guidelines (Europe, 1980) • Canadian Standards Association’s Model Code (1995) • Australian Privacy Amendment (2000) • Japan: proposed legislation (2003) • HIPAA, GLB, Recent U.S. Federal & State Initiatives

  5. Privacy Violations • Accidents: • Kaiser, GlobalHealthrax • Lax security: • Massachusetts govt. • Ethically questionable behavior: • Lotus & Equifax, Lexis-Nexis, Medical Marketing Service, Boston University, CVS & Giant Food • Illegal: • Toysmart

  6. Assertion • Enterprises lack tools and technologies for managing private data and enforcing privacy policies.

  7. Founding Tenets of Current Database Systems • Ullman, “Principles of Database and Knowledgebase Systems” • Fundamental: • Manage persistent data. • Access a large amount of data efficiently. • Desirable: • Support for data model, high-level languages, transaction management, access control, and resiliency. • Similar list in other database textbooks.

  8. Statistical & Secure Databases • Statistical Databases • Provide statistical information (sum, count, etc.) without compromising sensitive information about individuals, [AW89] • Multilevel Secure Databases • Multilevel relations, e.g., records tagged “secret”, “confidential”, or “unclassified”, e.g. [JS91] • Need to protect privacy in transactional databases that support daily operations. • Cannot restrict queries to statistical queries. • Cannot tag all the records “top secret”.

  9. Our Research Directions • Privacy Preserving Data Mining • Hippocratic Databases

  10. Data Mining and Privacy • The primary task in data mining: development of models about aggregated data. • Can we develop accurate models without access to precise information in individual data records? R. Agrawal, R. Srikant. Privacy Preserving Data Mining. ACM Int’l Conf. On Management of Data (SIGMOD), May 2000.

  11. 30 | 25K | … 50 | 40K | … Randomizer Randomizer 65 | 50K | … 35 | 60K | … Reconstruct Age Distribution Reconstruct Salary Distribution Data Mining Algorithm Model Privacy Preserving Data Mining

  12. Reconstruction Problem • Original values x1, x2, ..., xn • from probability distribution X • To hide these values, we use y1, y2, ..., yn • from probability distribution Y • Given • x1+y1, x2+y2, ..., xn+yn • the probability distribution of Y Estimate the probability distribution of X.

  13. Intuition (Reconstruct single point) • Use Bayes' rule for density functions

  14. Intuition (Reconstruct single point) • Use Bayes' rule for density functions

  15. Reconstruction: Intuition • Combine estimates of where a point came from for all the points: • yields estimate of original distribution.

  16. Reconstruction Algorithm • fX0 := Uniform distribution • j := 0 • repeat • fXj+1(a) := Bayes’ Rule • j := j+1 • until (stopping criterion met) • Converges to maximum likelihood estimate. • D. Agrawal & C.C. Aggarwal, PODS 2001.

  17. Works Well

  18. Classification • Naïve Bayes • Assumes independence between attributes. • Decision Tree • Correlations are weakened by randomization.

  19. Experimental Methodology • Compare accuracy against • Original: unperturbed data without randomization. • Randomized: perturbed data but without making any corrections for randomization. • Test data not randomized. • Synthetic data benchmark from [AGI+92]. • Training set of 100,000 records, split equally between the two classes.

  20. Decision Tree Experiments

  21. Accuracy vs. Randomization

  22. So far… • Question: Can we develop accurate models without access to precise information in individual data records? • Answer: yes, by randomization. • for numerical attributes, classification • How about Association Rules?

  23. Associations Recap • A transaction t is a set of items (e.g. books) • All transactions form a set Tof transactions • Any itemset A has support s in Tif • Itemset A is frequent if s smin • Task: Find all frequent itemsets

  24. The Problem • How to randomize transactions so that • we can find frequent itemsets • while preserving privacy at transaction level? Evfimievski, R. Srikant, R. Agrawal, J. Gehrke. Mining Association Rules Over Privacy Preserving Data. 8th Int'l Conf. on Knowledge Discovery in Databases and Data Mining, July 2002.

  25. Randomization Overview Alice J.S. Bach, painting, nasa.gov, … J.S. Bach, painting, nasa.gov, … Recommendation Service B. Spears, baseball, cnn.com, … Bob B. Spears, baseball, cnn.com, … B. Marley, camping, linux.org, … Chris B. Marley, camping, linux.org, …

  26. Randomization Overview Alice J.S. Bach, painting, nasa.gov, … J.S. Bach, painting, nasa.gov, … Recommendation Service B. Spears, baseball, cnn.com, … Bob Associations B. Spears, baseball, cnn.com, … B. Marley, camping, linux.org, … Chris Recommendations B. Marley, camping, linux.org, …

  27. Randomization Overview Alice J.S. Bach, painting, nasa.gov, … Metallica, painting, nasa.gov, … Recommendation Service Support Recovery B. Spears, soccer, bbc.co.uk, … Bob Associations B. Spears, baseball, cnn.com, … B. Marley, camping, ibm.com … Chris Recommendations B. Marley, camping, linux.org, …

  28. Uniform Randomization • Given a transaction, • keep item with, say 20% probability, • replace with a new random item with 80% probability.

  29. 10 M transactions of size 10 with 10 K items: 1% have {x, y,z} 5% have {x, y}, {x,z}, or {y,z} only 94% have one or zero items of {x, y, z} Example: {x, y, z} at most • 0.2• (9/10,000)2 • 0.23 • 0.22 • 8/10,000 0.008% 800 ts. 97.8% 0.00016% 16 trans. 1.9% less than 0.00002% 2 transactions 0.3% Privacy Breach: Given {x, y, z} in the randomized transaction, we have about 98% certainty of {x, y, z} in the original one

  30. Privacy Breach • Suppose: • t is an original transaction; • t’ is the corresponding randomized transaction; • A is a (frequent) itemset. • Definition: Itemset A causes a privacy breach of level  if, for some item z A,

  31. Our Solution “Where does a wise man hide a leaf? In the forest. But what does he do if there is no forest?” “He grows a forest to hide it in.” G.K. Chesterton • Insert many false items into each transaction • Hide true itemsets among false ones Can we still find frequent itemsets while having sufficient privacy?

  32. Cut and Paste Randomization • Given transaction t of size m, construct t’: t = a, b, c, u, v, w, x, y, z t’ =

  33. Cut and Paste Randomization • Given transaction t of size m, construct t’: • Choose a number j between 0 and Km (cutoff); t = a, b, c, u, v, w, x, y, z t’ = j = 4

  34. Cut and Paste Randomization • Given transaction t of size m, construct t’: • Choose a number j between 0 and Km (cutoff); • Include j items of t into t’; t = a, b, c, u, v, w, x, y, z t’ = b, v, x, z j = 4

  35. Cut and Paste Randomization • Given transaction t of size m, construct t’: • Choose a number j between 0 and Km (cutoff); • Include j items of t into t’; • Each other item is included into t’ with probability pm . The choice of Km and pm is based on the desired level of privacy. t = a, b, c, u, v, w, x, y, z t’ = b, v, x, z œ, å, ß, ξ, ψ, €, א, ъ, ђ, … j = 4

  36. Partial Supports To recover original support of an itemset, we need randomized supports of its subsets. • Given an itemset A of size k and transaction size m, • A vector of partial supports of A is • Here sk is the same as the support of A. • Randomized partial supports are denoted by

  37. Transition Matrix • Let k = |A|, m = |t|. • Transition matrixP = P (k, m) connects randomized partial supports with original ones: • Randomized supports are distributed as a sum of multinomial distributions.

  38. The Unbiased Estimators • Given randomized partial supports, we can estimate original partial supports: • Covariance matrix for this estimator: • To estimate it, substitute sl with (sest)l . • Special case: estimators for support and its variance

  39. Privacy Breach Analysis • How many added items are enough to protect privacy? • Have to satisfy Pr [zt | At’] <  ( no privacy breaches) • Select parameters so that it holds for all itemsets. • Use formula ( ): • Parameters are to be selected in advance! • Construct a privacy-challenging test: an itemset whose all subsets have maximum possible support. • Enough to know maximal support of an itemset for each size.

  40. Lowest Discoverable Support • LDS is s.t., when predicted, is 4away from zero. • Roughly, LDS is proportional to |t| = 5, = 50%

  41. LDS vs. Breach Level |t| = 5, |T| = 5 M • Reminder: breach level is the limit on Pr [zt | A  t’]

  42. Real Datasets: soccer, mailorder • Soccer is the clickstream log of WorldCup’98 web site, split into sessions of HTML requests. • 11 K items (HTMLs), 6.5 M transactions • Mailorder is a purchase dataset from a certain on-line store • Products are replaced with their categories • 96 items (categories), 2.9 M transactions

  43. Results Breach level = 50%. Soccer: smin = 0.2%  0.07% for 3-itemsets Mailorder: smin = 0.2%   0.05% for 3-itemsets

  44. Summary • Can have our cake and mine it too! • Randomization is an interesting approach for building data mining models while preserving user privacy!!! • Y. Lindell, B. Pinkas. Privacy Preserving Data Mining. Crypto 2000. S. Rizvi, J. Haritsa, “Privacy-Preserving Association Rule Mining”, VLDB 2002 J. Vaidya, C.W. Clifton. Privacy Preserving Association Rule Mining in Vertically Partitioned Data. KDD 2002.

  45. The Hippocratic Oath “What I may see or hear in the course of treatment or even outside of the treatment in regard to the life of men, which on no account [ought to be] spread abroad, I will keep to myself, holding such things shameful to be spoken about.” – Hippocratic Oath, 8 (circa 400 BC)

  46. Hippocratic Databases Founding tenet: Responsibility for the privacy of data they manage. R. Agrawal, J. Kiernan, R. Srikant, Y. Xu Hippocratic Databases 28th Int'l Conf. on Very Large Databases (VLDB), August 2002..

  47. Approach • Derive founding principles from current privacy legislation. • Strawman Design

  48. Ten Principles of Hippocratic Databases • Collection Group • Purpose Specification, Consent, Limited Collection • Use Group • Limited Use, Limited Disclosure, Limited Retention, Accuracy • Security & Openness Group • Safety, Openness, Compliance

  49. Collection Group • Purpose Specification • For personal information stored in the database, the purposes for which the information has been collected shall be associated with that information. • Consent • The purposes associated with personal information shall have consent of the donor (person whose information is being stored). • Limited Collection • The information collected shall be limited to the minimum necessary for accomplishing the specified purposes.

More Related