**The Promise of Differential Privacy** Cynthia Dwork, Microsoft Research TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAAAAAA

**On the Primacy of Definitions** Learning from History

**Pre-Modern Cryptography** Propose Break

**Modern Cryptography** Propose STRONGER Propose STRONGERDefinition Propose Definition Algs algorithms satisfying definition BreakDefinition Break Definition

**Modern Cryptography** Propose STRONGER Propose STRONGERDefinition Propose Definition Algs algorithms satisfying definition BreakDefinition Break Definition

**No Algorithm?** Propose Definition ? Why?

**Provably No Algorithm?** Propose WEAKER/DIFFDefinition Propose Definition Alg / ? ? BadDefinition

**Getting Started** Model, motivation, definition

**?** The Model • Database is a collection of rows • One per person in the database • Adversary/User and curator computationally unbounded • All users are part of one giant adversary • “Curator against the world” C

**?** “Pure” Privacy Problem • Difficult Even if • Curator is Angel • Data are in Vault C

**Typical Suggestions** • “Large Set” Queries • How many MSFT employees have Sickle Cell Trait (CST)? • How many MSFT employees who are not female Distinguished Scientists with very curly hair have the CST? • Add Random Noise to True Answer • Average of responses to repeated queries converges to true answer • Can’t simply detect repetition (undecidable) • Detect When Answering is Unsafe • Refusal can be disclosive

**A Litany**

**William Weld’s Medical Record [S02]** HMO data voter registration data name ethnicity address ZIP visit date date reg. diagnosis birth date procedure party affiliation sex medication total charge last voted

**AOL Search History Release (2006)** Heads Rolled Name: Thelma Arnold Age: 62 Widow Residence: Lilburn, GA

**Subsequent challenge abandoned**

**GWAS Membership [Homer et al. ‘08]** … … SNP: Single Nucleotide (A,C,G,T) polymorphism NIH-funded studies pulled data from public view T C … … T T Reference Population Major Allele (C): 94% Minor Allele (T): 6% Genome-Wide Association Study Allele frequencies for many thousands of SNPS

**Definitional Failures** • Failure to Cope with Auxiliary Information • Existing and future databases, newspaper reports, Flikr, literature, etc. • Definitions are Syntactic • Dalenius’s Ad Omnia Guarantee (1977): • Anything that can be learned about a respondent from the statistical database can be learned without access to the database

**Provably No Algorithm!** • Dalenius’s Ad Omnia Guarantee (1977): • Anything that can be learned about a respondent from the statistical database can be learned without access to the database • Unachievable in useful databases [D.,Naor ‘06] • I’m from Mars. My (incorrect) prior is that everyone has 2 left feet. • Database teaches: almost everyone has one left and one right foot.

**Databases that Teach** • Database teaches that smoking causes cancer. • Smoker S’s insurance premiums rise. • This is true even if S not in database! • Learning that smoking causes cancer is the whole point. • Smoker S enrolls in a smoking cessation program. • Differential privacy: limit harms to the teachings, not participation • The outcome of any analysis is essentially equally likely, independent of whether any individual joins, or refrains from joining, the dataset.

**Pr [response]** Z Z Z Bad Responses: Differential Privacy [D., McSherry, Nissim, Smith 06] M gives (ε,0) -differential privacy if for all adjacent x and x’, and all C Range(M ): Pr[ M (x) C] ≤e Pr[ M (x’) C] Neutralizes all linkage attacks. Composes unconditionally and automatically: Σiε i ratio bounded

**Pr [response]** Z Z Z Bad Responses: (, d) - Differential Privacy M gives (ε,d) -differential privacy if for all adjacent x and x’, and all C Range(M ): Pr[ M (x) C] ≤e Pr[ M (x’) C] + d Neutralizes all linkage attacks. Composes unconditionally and automatically: (Σii ,Σidi ) ratio bounded This talk: negligible

** Range** Equivalently, “Privacy Loss”

**Privacy by Process** Randomized Response [Warner’65]

**Did You Have Sex Last Night?** • Flip a coin. • Heads: Flip again and respond “Yes” if heads, “No” if otherwise • Tails: Answer honestly • Analysis: • Pr [say “Y” | truth = Y] / Pr [say “Y” | truth = N] = 3 • Pr[say “N” | truth = N] / Pr [say “N” | truth = Y] = 3 • Privacy is by Process • “Plausible deniability”

**Did You Have Sex Last Night?** • Randomized response is differentially private. • Log of ratio of probabilities of seeing any answer, as truth varies • Pr [say “Y” | truth = Y] / Pr [say “Y” | truth = N] = 3 • Pr[say “N” | truth = N] / Pr [say “N” | truth = Y] = 3

**Different Bang, Same Buck** Finding a “Stable” Algorithm Can be Hard

**DP: A Definition, Not an Algorithm** Many randomized algorithms for the same task provide -DP. The discovery that one method works poorly for your problem is only that. Others may work better.

**?** Randomized Response in Our Setting • Q = What Fraction Had Sex? • C randomizes responses in each record; releases fraction of 1’s • Call this Algorithm 1. C

**Sensitivity of a Function** Adjacent databases differ in at most one row. Counting queries have sensitivity 1. Sensitivity captures how much one person’s data can affect output • f = maxadjacentx,x’ |f(x) – f(x’)|

**Laplace Distribution Lap(b)** p(z) = exp(-|z|/b)/2b variance = 2b2 ¾ = √2 b Increasing b flattens curve

**Calibrate Noise to Sensitivity** f = maxadjx,x’ |f(x) – f(x’)| Theorem [DMNS06]: On query f, to achieve -differential privacy, it suffices to addscaled symmetric noise [Lap(f/)]. 0 -4b -3b -2b -b b 2b 3b 4b 5b Noise depends on f and , not on the database Smaller sensitivity (f) means less distortion Better privacy (smaller e )means more distortion

**Example: Counting Queries** • How many people in the database had sex? • Sensitivity = 1 • Sufficient to add noise • Fractional version: add noise • Call this Algorithm 2

**Two -DP Algorithms** • Algorithm 1: Randomized Response: Error • Algorithm 2: Laplace Mechanism: Error • Algorithm 2 is better than Algorithm 1

**Vector-Valued Queries** f = maxadj x, x’||f(x) – f(x’)||1 Theorem [DMNS06]: On query f, to achieve -differential privacy, it suffices to addscaled symmetric noise [Lap(f/)]d. 0 -4b -3b -2b -b b 2b 3b 4b 5b Noise depends on f and , not on the database Smaller sensitivity (f) means less distortion Better privacy (smaller e )means more distortion

**Example: Histograms** f = maxadj x, x’ ||f(x) – f(x’)||1 Theorem: To achieve -differential privacy, it suffices to add scaled symmetric noise [Lap(f/)]d.

**Pr[ M(f, x – Me) = t]** =exp(-(||t- f-||-||t- f+||)/b)≤ exp(f/b) Pr[ M (f, x+ Me) = t] Why Does it Work ? f = maxx, Me||f(x+Me) – f(x-Me)||1 Theorem: To achieve -differential privacy, it Suffices to add scaled symmetric noise [Lap(f/)]d 0 -4b -3b -2b -b b 2b 3b 4b 5b

**Composition** • “Simple”: k-fold composition of (,±)-differentially private mechanisms is (k, k±)-differentially private. • Advanced: rather than • : the -fold composition of -dp mechanisms is -dp • What is Bob’s lifetime exposure risk? • Eg, 10,000 -dp or -dp databases, for lifetime cost of -dp • What should be the value of ? • 1/801 • OMG, that is small! Can we do better?

**Hugely Many Queries** Single Database

** Omitting polylog(various things, some of them big, like )** terms Error [Hardt-Rothblum] Runtime Exp(|U|)

**Discrete-Valued Functions** • Strings, experts, small databases, … • Each has a utility for , denoted • Exponential Mechanism [McSherry-Talwar’07] Output with probability

**Exponential Mechanism Applied** Many (fractional) counting queries[Blum, Ligett, Roth’08]: Given -row database , set of properties, produce a synthetic database giving good approx to “What fraction of rows of satisfy property ?” . • is set of all “small” databases (size given by sampling error bounds) • | -62/4589 -1/3 -7/286 -1/100000 -1/310

**Non-Trivial Accuracy with -DP** Stateless Mechanism Stateful Mechanism Barrier at [D.,Naor,Vadhan]

**Non-Trivial Accuracy with -DP** Independent Mechanisms Statefull Mechanism To handle hugely many databases must introduce coordination

**Two Additional Techniques** + An application that combines them

**Functions “Expected” to Behave Well** • Propose-Test-Release [D.-Lei’09] • Privacy-preserving test for “goodness” of data set • Eg, low local sensitivity [Nissim-Raskhodnikova-Smith07] Big gap … …… … Robust statistics theory: Lack of density at median is the only thing that can go wrong PTR: Dptest for low sensitivity median (equivalently, for high density) if good, then release median with low noise else output (or use a sophisticated dp median algorithm)

**High/Unknown Sensitivity Functions** • Subsample-and-Aggregate [Nissim, Raskhodnikova, Smith’07]

**Application: Feature Selection** If “far” from collection with no large majority value, then output most common value. Else quit.

**Application: Model Selection** If “far” from collection with no large majority value, then output most common value. Else quit.

**A Few of Many Future Directions** • Efficiency for handling hugely many queries • Time complexity (counting queries): Connection to Tracing Traitors • Sample complexity / database size • Differentially private analysis of social networks? • Is DP the right definition? At what granularity? • What do we want to compute? • Is there an alternative to dp? • Axiomatic approach? • Focus on a specific application (Datamining!) • Collaborative effort with domain experts • What can be proved about S&A for feature/model selection?

**Thank You!** 0 -4R -3R -2R -R R 2R 3R 4R 5R