An Introduction to Differential Privacy and its Applications 1

An Introduction to Differential Privacyand its Applications1 Ali Bagherzandi Ph.D Candidate University of California at Irvine 1- Most slides in this presentation are from CyntiaDwork’s talk.

Goal of Differential Privacy: Privacy-Preserving Statistical Analysis of Confidential Data Finding Statistical Correlations Analyzing medical data to learn genotype/phenotype associations Correlating cough outbreak with chemical plant malfunction Can’t be done with HIPAA safe-harbor sanitized data Noticing Events Detecting spike in ER admissions for asthma Official Statistics Census Contingency Table Release (statistics for small sets of attributes) Model fitting, regression analyses, etc. Standard Datamining Tasks Clustering; learning association rules, decision trees, separators; principal component analysis

Two Models: K ? Sanitized Database Database Non-Interactive: Data are sanitized and released

Two Models: K ? Sanitized Database Database Interactive: Multiple Queries, Adaptively Chosen

Achievements for Statistical Data-Bases Defined Differential Privacy Proved classical goal (Dalenius, 1977) unachievable “Ad Omnia” definition; independent of linkage information General Approach; Rigorous Proof Relates degree of distortion to the (mathematical) sensitivity of the computation needed for the analysis “How much” can the data of one person affect the outcome? Cottage Industry: redesigning algorithms to be insensitive Assorted Extensions When noise makes no sense; when actual sensitivity is much less than worst-case; when the database is distributed; … Lower bounds on distortion Initiated by Dinur and Nissim

Semantic Security for Confidentiality [Goldwasser-Micali ’ 82] Vocabulary Plaintext: the message to be transmitted Ciphertext: the encryption of the plaintext Auxiliary information: anything else known to attacker The ciphertext leaks no information about the plaintext. Formalization Compare the ability of someone seeing aux and ciphertextto guess (anything about) the plaintext, to the ability of someone seeing only aux to do the same thing. Difference should be “tiny”.

Semantic Security for Privacy? Dalenius, 1977 Anything that can be learned about a respondent from the statistical database can be learned without access to the database. Happily, Formalizes to Semantic Security Unhappily, Unachievable [Dwork and Naor 2006] Both for not serious and serious reasons. 7

Semantic Security for Privacy? The Serious Reason (told as a parable) Database teaches average heights of population subgroups “Terry Tao is four inches taller than avg Swedish ♀” Access to DB teaches Terry’s height. Terry’s height learnable from the DB, not learnable w/o. Proof extends to “any” notion of privacy breach. Attack Works Even if Terry Not in DB! Suggests new notion of privacy: risk incurred by joining DB Before/After interacting vs Risk when in/notin DB

Differential Privacy: Formal Definition K gives e-differentialprivacy if for all values of DB, DB’ differing in a single element, and all S inRange(K ) Pr[ K (DB) inS] Pr[ K (DB’) inS] ≤ eε~ (1+ε) ratio bounded Pr [t]

Achieving Differential Privacy: + noise ? f K f: DB  Rd K (f, DB) = f(DB) + [Noise]d Eg, Count(P, DB) = # rows in DB with Property P 42

Sensitivity of Function How Much Can f(DB + Me) Exceed f(DB - Me)? Recall: K (f, DB) = f(DB) + noise Question Asks: What difference must noise obscure? Δf= maxH(DB, DB’)=1 ||f(DB) – f(DB’)||1 eg, ΔCount= 1 43

Calibrate Noise to Sensitivity Δf= maxH(DB, DB’)=1||f(DB) – f(DB’)||1 Theorem: To achieve ε-differentialprivacy, use scaled symmetric noise [Lap(R)]d with R = Δf/ε. 0 -4R -3R -2R -R R 2R 3R 4R 5R Lap(R) : f(x) = 1/2R exp(-|x|/R) Increasing R flattens curve; more privacy Noise depends on f and e, not on the database

Differential Privacy: Example Query : Average claim Aux input : A fraction C of entries. Without Noise : Can learn my claim w/ prob. C. What privacy level is enough for me? I feel safe if I’m hidden among n people  Pr [identify] = 1/n If n=100, c=0.01 in this example e = 1. In most applications, same e gives me more feeling of privacy. e.g. adv. Identification itself is probabilistic.  e = 0.1, 1.0, ln(2)

Differential Privacy: Example Query : Average claim Aux input : 12 entries e = 1  hide me among 1200 people Δf = maxDB, DB-Me |f(DB) – f(DB-Me)| = maxDB, DB-Me |Avg(DB) – Avg(DB-Me)| = 11.10-10.29 = 0.81  Avg. not a sensitive function! Add noise : Lap(Df/e) = Lap(0.81/1) = Lap(0.81) Var = 1.31  Roughly: respond w/ ± 1.31 true value

Differential Privacy: Example Query : max claim Aux input : 12 entries e = 1  hide me among 1200 people Δf = maxDB, DB-Me |f(DB) – f(DB-Me)| = maxDB, DB-Me |Avg(DB) – Avg(DB-Me)| = 20-12.54 = 7.46  Max. function is very sensitive! Add noise : Lap(Df/e) = Lap(7.46/1) = Lap(7.46) Var = 1.31  Roughly: respond w/ ± 111.30 true value  Useless : error > sampling error

A Natural Relaxation: (d,T)-Differential Privacy For all DB, DB’ differing in at most one element for all subset S of Range(K ), Pr[ K (DB) inS] ≤ edPr[ K (DB’) inS] + T where T = T(n) is negligible. Cf : e-DifferentialPrivacy is unconditional, independent of n

Feasibility Results: [DworkNissim ’04] (Here: Revisionist Version) Database D = {d1, … ,dn} rows in {0,1}k T “counting queries” q = (S,g) S subset[1..n], g: rows{0,1}; “How many rows in S satisfy g?” q(D) = ∑i inS g(di) Privacy Mechanism: Add Binomial Noise r = q(D) + B(f(n), ½) SuLQ Theorem: (δ, T)-Differential Privacy when T(n)=O(nc), 0<c <1; δ=1/O(nc’), 0< c’ < c/2 f(n) = (T(n)/δ2) log6 (n) If (√T(n)/δ)<O(√n) then |noise| < |sampling error|! e.g. : T =n3/4 and δ= 1/10 then √T/δ=10n3/8 << √n

Impossibility Results: Line of Research Initiated by Dinur and Nissim ’03 Blatant Non-Privacy: Adversary guesses correctly o(n) rows Blatant non-privacy if: all / r*cn / (1/2 + e) c’n answers are within o(√n) of the true answer, even against an adversary restricted to n / cn / c’n queries and poly(n) / poly(n) / exp(n) computation. [Y] / [DMT] / [DMT] Results are independent of how noise is distributed. A variant model permits poly(n) computation in the final case [DY]. 2n queries: noise E permits correctly guessing n-4E rows [DiNi]. Depressing implications for non-interactive case.

Thank You!

An Introduction to Differential Privacy and its Applications 1