1 / 21

Defining and Achieving Differential Privacy

Defining and Achieving Differential Privacy. Cynthia Dwork , Microsoft. TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A A A A A A A A. Meaningful Privacy Guarantees. Statistical databases Medical Government Agency Social Science

farhani
Download Presentation

Defining and Achieving Differential Privacy

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Defining and Achieving Differential Privacy Cynthia Dwork, Microsoft TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAAAAAA

  2. Meaningful Privacy Guarantees • Statistical databases • Medical • Government Agency • Social Science • Searching / click stream • Learn non-trivial trends while protecting privacy of individuals and fine-grained structure

  3. Linkage Attacks • Using “innocuous” data in one dataset to identify a record in a different dataset containing both innocuous and sensitive data • At the heart of the voluminous research on hiding small cell counts in tabular data

  4. The Netflix Prize • Netflix Recommends Movies to its Subscribers • Offers $1,000,000 for 10% improvement in its recommendation system • Not concerned here with how this is measured • Publishes training data • Nearly 500,000 records, 18,000 movie titles • “The ratings are on a scale from 1 to 5 (integral) stars. To protect customer privacy, all personal information identifying individual customers has been removed and all customer ids have been replaced by randomly-assigned ids. The date of each rating and the title and year of release for each movie are provided.” • Some ratings not sensitive, some may be sensitive • OK for Netflix to know, not OK for public to know

  5. A Publicly Available Set of Movie Rankings • International Movie Database (IMDb) • Individuals may register for an account and rate movies • Need not be anonymous • Visible material includes ratings, dates, comments • By definition, these ratings not sensitive

  6. The Fiction of Non-PII Narayanan & Shmatikov 2006 • Movie ratings and dates are PII • “With 8 movie ratings (of which we allow 2 to be completely wrong) and dates that may have a 3-day error, 96% of Netflix subscribers whose records have been released can be uniquely identified in the dataset.” • Linkage attack prosecuted using the IMDb. • Link ratings in IMDb to (non-sensitive) ratings in Netflix, revealing sensitive ratings in Netflix • NS draw conclusions about user. • May be wrong, may be right. User harmed either way.

  7. What Went Wrong? • What is “Personally Identifiable Information”? • Typically syntactic, not semantic • Eg, genome sequence not considered PII ?? • Suppressing “PII” doesn’t rule out linkage attacks • Famously observed by Sweeney, circa 1998 • AOL debacle • Need a more semantic approach to privacy

  8. Semantic Security Against an Eavesdropper Goldwasser &Micali 1982 • Vocabulary • Plaintext: the message to be transmitted • Ciphertext: the encryption of the plaintext • Auxiliary information: anything else known to attacker • The ciphertext leaks no information about the plaintext. • Formalization Compare the ability of someone seeing aux and ciphertext to guess (anything about) the plaintext, to the ability of someone seeing only aux to do the same thing. Difference should be “tiny”.

  9. Semantic Security for Statistical Databases? • Dalenius, 1977 • Anything that can be learned about a respondent from the statistical database can be learned without access to the database • An ad omnia guarantee • Happily, Formalizes to Semantic Security • Recall: Anything about the plaintext that can be learned from the ciphertext can be learned without the ciphertext • Popular Intuition: prior and posterior views about an individual shouldn’t change “too much”. • Clearly Silly • My (incorrect) prior is that everyone has 2 left feet. • Very popular in literature nevertheless • Definitional awkwardness even when used correctly

  10. Semantic Security for Statistical Databases? • Unhappily, Unachievable • Can’t achieve cryptographically small levels of “tiny” • Intuition: (adversarial) user is supposedto learn unpredictable things about the DB; translates to learning more than a cryptographically tiny amount about a respondent • Relax “tiny”?

  11. Relaxed Semantic Security for Statistical Databases? • Relaxing Tininess Doesn’t Help • Dwork & Naor 2006 • Database teaches average heights of population subgroups • “Terry Gross is two inches shorter than avg Lithuanian ♀” • Access to DB teaches Terry’s height • Terry’s height learnable from the DB, not learnable otherwise • Formal proof extends to essentially any notion of privacy compromise, uses extracted randomness from the SDB as a one-time pad. • Bad news for k-,l-,m- etc. • Attack Works Even if Terry Not in DB! • Suggests new notion of privacy: risk incurred by joining DB • “Differential Privacy” • Privacy, when existence of DB is stipulated • Before/After interacting vs Risk when in/notin DB

  12. Pr[ K (DB - Me) = t] ≤e¼ 1§ Pr[ K (DB + Me) = t] Differential Privacy K gives -differential privacy if for all values of DB and Me and all transcripts t: Pr [t]

  13. Pr [response] X X X Bad Responses: Differential Privacy is an Ad Omnia Guarantee • No perceptible risk is incurred by joining DB. • Anything adversary can do to me, it could do without Me (my data).

  14. ? An Interactive Sanitizer: KDwork, McSherry, Nissim, Smith 2006 + noise f K f: DB  R K (f, DB) = f(DB) + Noise Eg, Count(P, DB) = # rows in DB with Property P

  15. Sensitivity of a Function f How Much Can f(DB + Me) Exceed f(DB - Me)? Recall: K (f, DB) = f(DB) + noise Question Asks: What difference must noise obscure? • f = maxDB, Me |f(DB+Me) – f(DB-Me)| eg, Count = 1

  16. Calibrate Noise to Sensitivity f = maxDB, Me |f(DB+Me) – f(DB-Me)| Theorem: To achieve -differential privacy, use scaled symmetric noise Lap(|x|/R) with R = f/. 0 -4R -3R -2R -R R 2R 3R 4R 5R Pr[x] proportional to exp(-|x|/R) Increasing R flattens curve; more privacy Noise depends on f and , not on the database

  17. Pr[ K (f, DB - Me) = t] =exp(-(|t- f-|-|t- f+|)/R)≤ exp(-f/R) Pr[ K (f, DB + Me) = t] Calibrate Noise to Sensitivity f = maxDB, Me |f(DB+Me) – f(DB-Me)| Theorem: To achieve -differential privacy, use scaled symmetric noise Lap(|x|/R) with R = f/. 0 -4R -3R -2R -R R 2R 3R 4R 5R

  18. Multiple Queries For query sequence f1, …, fd-privacy achieved with noise generation parameter Ri = fi/ for each response. Can sometimes do better. Noise must increase with the sensitivity of the query sequence. Naively, more queries means noisier answers Dinur and Nissim 2003 et sequelae Speaks to the Non-Interactive Setting • Any non-interactive solution permitting “too accurate” answers to “too many” questions is vulnerable to attack. • Privacy mechanism is at an even greater disadvantage than in the interactive case; can be exploited

  19. Future Work • Investigate Techniques from Robust Statistics • Area of statistics devoted to coping with • Small amounts of wild data entry errors • Rounding errors • Limited dependence among samples • Problem: the statistical setting makes strong assumptions about existence and nature of an underlying distribution • Differential Privacy for Social Networks, Graphs • What are the utility questions of interest? • Definitional and Algorithmic Work for Other Settings • “Differential” approach more broadly useful • Several results discussed in next few hours • Porous Boundary Between “Inside” and “Outside”? • Outsourcing, bug reporting, combating D-DoS attacks and terror

  20. Privacy is a natural resource. It’s non-renewable, and it’s not yours. Conserve it.

More Related