Foundations of Privacy Lecture 4

Foundations of PrivacyLecture 4 Lecturer:Moni Naor

Recap of last week’s lecture • Differential Privacy • Sensitivity: • Global sensitivity of query q:Un→Rd GSq = maxD,D’ ||q(D) – q(D’)||1 • Local sensitivity of query q at point D LSq(D)= maxD’ |q(D) – q(D’)| • Smooth sensitivity Sf*(X)= maxY {LSf(Y)e- dist(x,y)} • Histograms • Differential privacy of median • Exponential Mechanism

Histograms Inputs x1, x2, ..., xnin domain U Domain U partitioned into d disjoint bins S1,…,Sdq(x1, x2, ..., xn) = (n1, n2, ..., nd) where nj = #{i : xi in j-th bin} Can view as d queries: qi counts # spoints in set Si For adjacent D,D’, only one answer can change - it can change by 1 Global sensitivity of answer vector is 1 Sufficient to add Lap(1/ε) noise to eachquery, still get ε-privacy

The Exponential Mechanism [McSherry Talwar] A general mechanism that yields • Differential privacy • May yield utility/approximation • Is defined and evaluated by considering all possible answers The definition does not yield an efficient way of evaluating it Application/original motivation: Approximate truthfulness of auctions • Collusion resistance • Compatibility

Side bar: Digital Goods Auction • Some product with 0 cost of production • n individuals with valuation v1, v2, … vn • Auctioneer wants to maximize profit

Example of the Exponential Mechanism • Data: xi= website visited by student i today • Range: Y = {website names} • For each name y, let q(y, X) = #{i : xi = y} Goal: output the most frequently visited site • Procedure: Given X, Output website ywith probability prop to eq(y,X) • Popular sites exponentially more likely than rare ones Website scores don’t change too quickly Size of subset

Setting • For input D 2Unwant to find r2R • Base measure  on R - usually uniform • Score function q’:Un £R  R assigns any pair (D,r) a real value • Want to maximize it (approximately) The exponential mechanism • Assign output r2R with probability proportional to eq’(D,r)(r) Normalizing factor req’(D,r)(r)

The exponential mechanism is private • Let  = maxD,D’,r |q(D,r)-q(D’,r)| Claim: The exponential mechanism yields a 2¢¢ differentially private solution • Prob [output = r on input D] = eq’(D,r)(r)/req’(D,r)(r) • Prob [output = r on input D’] = eq’(D’,r)(r)/req’(D’,r)(r) adjacent Ratio is bounded by e e

Laplace Noise as Exponential Mechanism • On query q:Un→R let q’(D,r) = -|q(D)-r| • Prob noise = y e-y / 2 ye-y = /2e-y Laplace distribution Y=Lap(b) has density function Pr[Y=y] =1/2b e-|y|/b y 0 -4 -3 -2 -1 1 2 3 4 5

Any Differentially Private Mechanism is an instance of the Exponential Mechanism • Let M be a differentially private mechanism Take q’(D,r) to be logProb[M(D) =r] Remaining issue: Accuracy

Private Ranking • Each element i 2 {1, … n} has a real valued score SD(i)based on a data set D. • Goal: Output k elements with highest scores. • Privacy • Data set D consists of n entries in domain D. • Differential privacy: Protects privacy of entries in D. • Condition: Insensitive Scores • for any element i, for any data sets D, D’ that differ in one entry:|SD(i)- SD’(i)| · 1

Approximate ranking • Let Sk be the kth highest score based on data set D. • An output list is  -useful if: Soundness: No element in the output has score less than Sk -  Completeness: Every element with score greater than Sk +  is in the output. Score·Sk -  Sk + ·Score Sk - ·Score·Sk + 

Two Approaches Each input affects all scores • Score perturbation • Perturb the scores of the elements with noise • Pick the top k elements in terms of noisy scores. • Fast and simple implementation Question: what sort of noise should be added? What sort of guarantees? • Exponential sampling • Run the exponential mechanism k times. • more complicated and slower implementation What sort of guarantees? Homework

Exponential Mechanism: Simple Example (almost free) private lunch Database of n individuals, lunch options {1…k},each individual likes or dislikes each option (1 or 0) Goal: output a lunch option that many like For each lunch option j2[k], ℓ(j) is # of ind. who like j Exponential Mechanism:Output j with probability eεℓ(j) Actual probability: eεℓ(j)/(∑ieεℓ(i)) Normalizer

Synthetic DB: Output is a DB ? answer 1 answer 3 answer 2 Sanitizer query 1,query 2,. . . Database Synthetic DB: output also a DB (of entries from same universe X), user reconstructs answers by evaluating query on output DB Software and people compatible Consistent answers

Answering More Queries Using exponential mechanism Differential Privacy for every set Cof counting queries Error is Õ(n2/3 log|C|) Remarkable Hope for rich private analysis of small DBs! Quantitative: #queries >> DB size, Qualitative: output of sanitizer -synthetic DB-output is a DB itself

Counting Queries DatabaseDof sizen • Queries with low sensitivity Counting-queries Cis a setof predicates c: U  {0,1} Query: how many D participants satisfy c ? Relaxed accuracy: answer query withinαadditive errorw.h.p Not so bad:error anyway inherent in statistical analysis Assume all queries given in advance Query c U Non-interactive

Utility and Privacy Can’t Always Be Achieved Simultaneously Impossibility results for counting queries: DB with n participants can’t have o(√n) error, O(n) queries[DiNi, DwMcTa07,DwYe08] In all these cases, strong privacy violation What can we do? almost entire DB compromised

Huge DBs [Dwork Nissim] DB of size n >> # queries |C|: Add independent noise to answer on every query Noise per query ~ #queries For accuracy, need #queries ≤ n May be reasonable for huge internet-scale DBs,Privacy “for free”

What about smaller DBs? DB of size n < #queries |C|, impossibility results:can’t have o(√n) error Error must be Ω(√n)

The BLR Algorithm For DBs F and Ddist(F,D) = maxq2C |q(F) – q(D)| Intuition: far away DBs get smaller probability Blum Ligett Roth08 Algorithm on input DB D: Sample from a distribution on DBs of size m: (m < n) DB F gets picked w.p. /e-ε·dist(F,D)

The BLR Algorithm Idea: • In general: Do not use large DB • Sample and answer accordingly • DB of size m guaranteeing hitting each query with sufficient accuracy

The BLR Algorithm: 2ε-Privacy For adjacent D,D’ for every F|dist(F,D) – dist(F,D’)| ≤ 1 Probability ofFby D:e-ε·dist(F,D)/∑G of size m e-ε·dist(G,D) Probability of F by D’:numerator and denominator can change by eε-factor 2ε-privacy Algorithm on input DB D: Sample from a distribution on DBs of size m: (m < n) DB Fgets picked w.p. / e-ε·dist(F,D)

The BLR Algorithm: Error Õ(n2/3 log|C|) There exists Fgood of size m=Õ((n\α)2·log|C|) s.t. dist(Fgood,D) ≤α Pr [Fgood] ~ e-εα For any Fbad with dist2α,Pr [Fbad] ~ e-2εα Union bound: ∑bad DB FbadPr [Fbad]~ |U|me-2εα For α=Õ(n2/3log|C|), Pr [Fgood] >> ∑ Pr [Fbad] Algorithm on input DB D: Sample from a distribution on DBs of size m: (m < n)DBF gets picked w.p. /e-ε·dist(F,D)

The BLR Algorithm: Running Time Generating the distribution by enumeration:Need to enumerate every size-m database,where m= Õ((n\α)2·log|C|) Running time ≈|U|Õ((n\α)2·log|c|) Algorithm on input DB D: Sample from a distribution on DBs of size m: (m < n) DB F gets picked w.p. /e-ε·dist(F,D)

Conclusion Offline algorithm, 2ε-Differential Privacy for anyset C of counting queries Error α is Õ(n2/3 log|C|/ε) Super-poly running time: |U|Õ((n\α)2·log|C|)

Can we Efficiently Sanitize? The good news If the universe is small, Can sanitize EFFICIENTLY The bad news cannot do much better, namely sanitize in time:sub-poly(|C|) AND sub-poly(|U|) Timepoly(|C|,|U|)

How Efficiently Can We Sanitize? |C| subpoly poly |U| subpoly ? ? poly ? ? Good news!

The Good News: Can Sanitize When Universe is Small Efficient Sanitizer for query set C • DB size n ¸ Õ(|C|o(1) log|U|) • error is ~ n2/3 • Runtime poly(|C|,|U|) Output is a synthetic database Compare to [Blum Ligget Roth]: n ¸ Õ(log|C| log|U|), runtime super-poly(|C|,|U|)

Recursive Algorithm Start with DB D and large query set C Repeatedly choose random subset Ci+1of Ci:shrink query set by (small) factor C0=C C1 C2 Cb

Recursive Algorithm Start with DB D and large query set C Repeatedly choose random subset Ci+1of Ci:shrink query set by (small) factor End recursion: sanitize D w.r.t. small query set Cb Output is good for all queries in small setCi+1 Extract utility on almost-all queries in large set Ci Fix remaining “underprivileged” queries in large set Ci C0=C C1 C2 Cb

Foundations of Privacy Lecture 4