The Promise of Differential Privacy

1 / 50

# The Promise of Differential Privacy - PowerPoint PPT Presentation

The Promise of Differential Privacy. Cynthia Dwork , Microsoft Research. TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A A A A A A A A. On the Primacy of Definitions. Learning from History. Pre-Modern Cryptography. Propose. Break. Modern Cryptography.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## The Promise of Differential Privacy

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
1. The Promise of Differential Privacy Cynthia Dwork, Microsoft Research TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAAAAAA

2. On the Primacy of Definitions Learning from History

3. Pre-Modern Cryptography Propose Break

4. Modern Cryptography Propose STRONGER Propose STRONGERDefinition Propose Definition Algs algorithms satisfying definition BreakDefinition Break Definition

5. Modern Cryptography Propose STRONGER Propose STRONGERDefinition Propose Definition Algs algorithms satisfying definition BreakDefinition Break Definition

6. No Algorithm? Propose Definition ? Why?

7. Provably No Algorithm? Propose WEAKER/DIFFDefinition Propose Definition Alg / ? ? BadDefinition

8. Getting Started Model, motivation, definition

9. ? The Model • Database is a collection of rows • One per person in the database • Adversary/User and curator computationally unbounded • All users are part of one giant adversary • “Curator against the world” C

10. ? “Pure” Privacy Problem • Difficult Even if • Curator is Angel • Data are in Vault C

11. Typical Suggestions • “Large Set” Queries • How many MSFT employees have Sickle Cell Trait (CST)? • How many MSFT employees who are not female Distinguished Scientists with very curly hair have the CST? • Add Random Noise to True Answer • Average of responses to repeated queries converges to true answer • Can’t simply detect repetition (undecidable) • Detect When Answering is Unsafe • Refusal can be disclosive

12. A Litany

13. William Weld’s Medical Record [S02] HMO data voter registration data name ethnicity address ZIP visit date date reg. diagnosis birth date procedure party affiliation sex medication total charge last voted

14. AOL Search History Release (2006) Heads Rolled Name: Thelma Arnold Age: 62 Widow Residence: Lilburn, GA

15. Subsequent challenge abandoned

16. GWAS Membership [Homer et al. ‘08] … … SNP: Single Nucleotide (A,C,G,T) polymorphism NIH-funded studies pulled data from public view T C … … T T Reference Population Major Allele (C): 94% Minor Allele (T): 6% Genome-Wide Association Study Allele frequencies for many thousands of SNPS

17. Definitional Failures • Failure to Cope with Auxiliary Information • Existing and future databases, newspaper reports, Flikr, literature, etc. • Definitions are Syntactic • Dalenius’s Ad Omnia Guarantee (1977): • Anything that can be learned about a respondent from the statistical database can be learned without access to the database

18. Provably No Algorithm! • Dalenius’s Ad Omnia Guarantee (1977): • Anything that can be learned about a respondent from the statistical database can be learned without access to the database • Unachievable in useful databases [D.,Naor ‘06] • I’m from Mars. My (incorrect) prior is that everyone has 2 left feet. • Database teaches: almost everyone has one left and one right foot.

19. Databases that Teach • Database teaches that smoking causes cancer. • Smoker S’s insurance premiums rise. • This is true even if S not in database! • Learning that smoking causes cancer is the whole point. • Smoker S enrolls in a smoking cessation program. • Differential privacy: limit harms to the teachings, not participation • The outcome of any analysis is essentially equally likely, independent of whether any individual joins, or refrains from joining, the dataset.

20. Pr [response] Z Z Z Bad Responses: Differential Privacy [D., McSherry, Nissim, Smith 06] M gives (ε,0) -differential privacy if for all adjacent x and x’, and all C Range(M ): Pr[ M (x) C] ≤e Pr[ M (x’) C] Neutralizes all linkage attacks. Composes unconditionally and automatically: Σiε i ratio bounded

21. Pr [response] Z Z Z Bad Responses: (, d) - Differential Privacy M gives (ε,d) -differential privacy if for all adjacent x and x’, and all C Range(M ): Pr[ M (x) C] ≤e Pr[ M (x’) C] + d Neutralizes all linkage attacks. Composes unconditionally and automatically: (Σii ,Σidi ) ratio bounded This talk: negligible

22. Range Equivalently, “Privacy Loss”

23. Privacy by Process Randomized Response [Warner’65]

24. Did You Have Sex Last Night? • Flip a coin. • Heads: Flip again and respond “Yes” if heads, “No” if otherwise • Tails: Answer honestly • Analysis: • Pr [say “Y” | truth = Y] / Pr [say “Y” | truth = N] = 3 • Pr[say “N” | truth = N] / Pr [say “N” | truth = Y] = 3 • Privacy is by Process • “Plausible deniability”

25. Did You Have Sex Last Night? • Randomized response is differentially private. • Log of ratio of probabilities of seeing any answer, as truth varies • Pr [say “Y” | truth = Y] / Pr [say “Y” | truth = N] = 3 • Pr[say “N” | truth = N] / Pr [say “N” | truth = Y] = 3

26. Different Bang, Same Buck Finding a “Stable” Algorithm Can be Hard

27. DP: A Definition, Not an Algorithm Many randomized algorithms for the same task provide -DP. The discovery that one method works poorly for your problem is only that. Others may work better.

28. ? Randomized Response in Our Setting • Q = What Fraction Had Sex? • C randomizes responses in each record; releases fraction of 1’s • Call this Algorithm 1. C

29. Sensitivity of a Function Adjacent databases differ in at most one row. Counting queries have sensitivity 1. Sensitivity captures how much one person’s data can affect output • f = maxadjacentx,x’ |f(x) – f(x’)|

30. Laplace Distribution Lap(b) p(z) = exp(-|z|/b)/2b variance = 2b2 ¾ = √2 b Increasing b flattens curve

31. Calibrate Noise to Sensitivity f = maxadjx,x’ |f(x) – f(x’)| Theorem [DMNS06]: On query f, to achieve -differential privacy, it suffices to addscaled symmetric noise [Lap(f/)]. 0 -4b -3b -2b -b b 2b 3b 4b 5b Noise depends on f and , not on the database Smaller sensitivity (f) means less distortion Better privacy (smaller e )means more distortion

32. Example: Counting Queries • How many people in the database had sex? • Sensitivity = 1 • Sufficient to add noise • Fractional version: add noise • Call this Algorithm 2

33. Two -DP Algorithms • Algorithm 1: Randomized Response: Error • Algorithm 2: Laplace Mechanism: Error • Algorithm 2 is better than Algorithm 1

34. Vector-Valued Queries f = maxadj x, x’||f(x) – f(x’)||1 Theorem [DMNS06]: On query f, to achieve -differential privacy, it suffices to addscaled symmetric noise [Lap(f/)]d. 0 -4b -3b -2b -b b 2b 3b 4b 5b Noise depends on f and , not on the database Smaller sensitivity (f) means less distortion Better privacy (smaller e )means more distortion

35. Example: Histograms f = maxadj x, x’ ||f(x) – f(x’)||1 Theorem: To achieve -differential privacy, it suffices to add scaled symmetric noise [Lap(f/)]d.

36. Pr[ M(f, x – Me) = t] =exp(-(||t- f-||-||t- f+||)/b)≤ exp(f/b) Pr[ M (f, x+ Me) = t] Why Does it Work ? f = maxx, Me||f(x+Me) – f(x-Me)||1 Theorem: To achieve -differential privacy, it Suffices to add scaled symmetric noise [Lap(f/)]d 0 -4b -3b -2b -b b 2b 3b 4b 5b

37. Composition • “Simple”: k-fold composition of (,±)-differentially private mechanisms is (k, k±)-differentially private. • Advanced: rather than • : the -fold composition of -dp mechanisms is -dp • What is Bob’s lifetime exposure risk? • Eg, 10,000 -dp or -dp databases, for lifetime cost of -dp • What should be the value of ? • 1/801 • OMG, that is small! Can we do better?

38. Hugely Many Queries Single Database

39. Omitting polylog(various things, some of them big, like ) terms Error [Hardt-Rothblum] Runtime Exp(|U|)

40. Discrete-Valued Functions • Strings, experts, small databases, … • Each has a utility for , denoted • Exponential Mechanism [McSherry-Talwar’07] Output with probability

41. Exponential Mechanism Applied Many (fractional) counting queries[Blum, Ligett, Roth’08]: Given -row database , set of properties, produce a synthetic database giving good approx to “What fraction of rows of satisfy property ?” . • is set of all “small” databases (size given by sampling error bounds) • | -62/4589 -1/3 -7/286 -1/100000 -1/310

42. Non-Trivial Accuracy with -DP Stateless Mechanism Stateful Mechanism Barrier at [D.,Naor,Vadhan]

43. Non-Trivial Accuracy with -DP Independent Mechanisms Statefull Mechanism To handle hugely many databases must introduce coordination

44. Two Additional Techniques + An application that combines them

45. Functions “Expected” to Behave Well • Propose-Test-Release [D.-Lei’09] • Privacy-preserving test for “goodness” of data set • Eg, low local sensitivity [Nissim-Raskhodnikova-Smith07] Big gap … …… … Robust statistics theory: Lack of density at median is the only thing that can go wrong PTR: Dptest for low sensitivity median (equivalently, for high density) if good, then release median with low noise else output (or use a sophisticated dp median algorithm)

46. High/Unknown Sensitivity Functions • Subsample-and-Aggregate [Nissim, Raskhodnikova, Smith’07]

47. Application: Feature Selection If “far” from collection with no large majority value, then output most common value. Else quit.

48. Application: Model Selection If “far” from collection with no large majority value, then output most common value. Else quit.

49. A Few of Many Future Directions • Efficiency for handling hugely many queries • Time complexity (counting queries): Connection to Tracing Traitors • Sample complexity / database size • Differentially private analysis of social networks? • Is DP the right definition? At what granularity? • What do we want to compute? • Is there an alternative to dp? • Axiomatic approach? • Focus on a specific application (Datamining!) • Collaborative effort with domain experts • What can be proved about S&A for feature/model selection?

50. Thank You! 0 -4R -3R -2R -R R 2R 3R 4R 5R