Foundations of Privacy Lecture 11

Foundations of PrivacyLecture 11 Lecturer:Moni Naor

Recap of recent lecture • Continual changing data • Counters • How to combine expert advice • Multi-counter and the list update problem • Pan Privacy • General Transformation to continual output

Petting The Dynamic Privacy Zoo Continual Pan Privacy Differentially Private Outputs Privacy under Continual Observation Pan Privacy Sketch vs. Stream User level Privacy

Sanitization Can’t be Too Accurate Usual counting queries • Query: qµ[n] • i2 q diResponse = Answer + noise Blatant Non-Privacy: Adversary Guesses 99% bits Theorem: If all responses are within o(n) of the true answer, then the algorithm is blatantly non-private. But: require exponential # of queries .

Proof: Exponential Adversary • Focus on Column Containing Super Private Bit • Assume all answers are within error bound . 1 0 “The database” 0 1 0 1 1 d Will show that  cannot be o(n)

Proof: Exponential Adversary • Estimate #1’s in all possible sets • 8Sµ[n]: |K(S) – i2Sdi| ≤  • Weed Out “Distant” DBs • For each possible candidate database c: If for any Sµ[n]: |i2 Sci – K(S)| > , then rule out c. • If c not ruled out, halt and output c Claim: Real database d won’t be ruled out K(S) real answer on S

Contradiction? • We have seen algorithms that allow answer each query with accuracy o(n) • O(√n) and O(n2/3) • Why is there no contradiction with current results

What can we do efficiently? Allowed “too” much power to the adversary • Number of queries • Computation • On the other hand: lack of wild errors in the responses Theorem: For any sanitization algorithm: If all responses are within o(√n) of the true answer, then it is blatantly non-private even against a polynomial time adversary making O(n log2 n) random queries. Show the adversary

The model • As before: database dis a bit string of length n. • Users query for subset sums: • A query is a subset qµ{1, …, n} • The (exact) answer is aq= i2qdi • -perturbation • for an answer: aq ±

Privacy requires Ω(√n) perturbation For every query qj: its answer according to c is at most 2 far from its (real) answer in d. Consider a database with o(√n) perturbation • Adversary makes t = n log2 n random queries qj, getting noisyanswersaj • Privacy violating Algorithm: Construct database c = {ci}1 ≤ i ≤ nby solving Linear Program: 0 ≤ ci ≤ 1 for 1 ≤ i ≤ n aj- ≤ i2qci ≤ aj+ for 1 ≤ j ≤ t • Round the solution: • if ci> 1/2 set to 1 and to 0 otherwise A solution must exist: d itself

Bad solutions to LP do not survive • A query disqualifies a potential database c if its answer for the query is more than 2 + 1 far from its real answer in d. • Idea: show that for a database c that is far away from d a random query disqualifiesc with some constant probability  • Want to use the Union Bound: all far away solutions are disqualified w.p. at least 1 – nn(1 - )t = 1–neg(n) How do we limit the solution space? Round each one value to closest 1/n

Privacy requires Ω(√n) perturbation A query disqualifies a potential database c if its answer for the query is more than 2 + 1 far from its real answer in d. Claim: a random query disqualifies far away from d database c with some constant probability  • Therefore: t = n log2 n queries leave a negligible probability for each far reconstruction. • Union bound: all far away suggestions are disqualified w.p. at least 1 – nn(1 - )t = 1 – neg(n) Count number of entries far from d Can apply union bound by discretization

Review and Conclusion • When the perturbation is o(√n), choosing Õ(n) random queries gives enough information to efficiently reconstruct an o(n)-close db. • Database reconstructed using Linear programming – polynomial time. o(√n)databases are Blatantly Non-Private. • poly(n) time reconstructable

Ω(√n) lower bound revisited • An attack on a o(√n)-perturbation database with substantially better performance • Previous attack uses n log2 n queries and runs in n5 log4 n time (LP) • New attack: issues nqueries and runs in O(nlogn) time • New attack is deterministic • Fixed set of queries for each size • Not necessarily an advantage – must ask certain queries

Vector defines a function The Fourier Attack • Treat the database d as a function Z2logn → Z2 • Query specific subset sums: from which the Fourier coefficients of the function can be calculated • One for each Fourier coefficient • Round reconstructed function’s values to bits • When the sums have o(√n) error, so do the coefficients • the reconstruction can be shown to have o(n) error. • Fourier transform can be computed in time O(n log n) Key point: linearity of Fourier transform implies small error in coefficients also mean small error in function

Fourier Transform • The characters of Z2k : homomorphisms into {-1,1} • There are 2kcharacters: one for each a=(a1, a2, …, ak) 2Z2k a(x) = (-1)i=1ai xi • For function f: Z2logn → R The Fourier coefficientsf(a) are xa(x) f(x) We have: f(x) = aa(x) f(a) Ha,b = a (b) H =2k x 2k Hadamard matrix H H = 2k I k Æ f = H f f = 1/2k H f Æ Æ Æ

Parseval’s Identity • Relates the absolute values of f to absolute values of Fourier coefficients of f x 2Z2k |f(x)|2 = 1/2k a 2Z2k |f(a)|2 Æ

Evaluating Fourier Coefficients with Counting queries • Let 0 = x f(x) • For a=(a1, a2, …, ak) let Sa= {x| <a,x>=0 mod 2} • f(a) = 2 x 2 Sa f(x) - 0 Approximation of counting query on Sa yields approximation of f(a) with related term f = 1/2k H f => 1/2k H (f + e) = f + 1/2k He |Sa|= 2k-1 Æ e: error vector of Fourier co. Æ e=(e1, e2, …, en) Æ Æ

e: error vector of Fourier co. f = 1/2k H f => 1/2k H (f + e) = f + 1/2k He If 1/2k He has(n) entries which are ¸ ½ Then by Parseval’s: 1/2k a 2Z2k |ea|2is(n) Hence: at least one |ea| is(√n) e=(e1, e2, …, en) Æ Æ n Contradicting assumption on accuracy x 2Z2k |f(x)|2 = 1/2k a 2Z2k |f(a)|2

Changing the Model: weighted counting • Previous attacks: assume all queries are within some small perturbation  New model: • To up to ½-of the queries unbounded noise is added • To the rest “small” noise  bounded • Stronger query model: subset sums are weighted with weights 0...p-1 for Want some randomness of queries – otherwise repetition some primep = Ω(1/2 + /) Cannot “hide” single bits: all the weight might be there

Interpolation attack By dropping info • Treat database as linear form of n variables over Zp • Treat a query q = (q1, …, qn) as the evaluation of the form at a point f(q1, …, qn) = Σi=1..n di qi mod p • An answer to query q =((p-1)/2, 0, …, 0) that is within (p-1)/4 error tells us the first db bit • Similarly to all other bits • No point in asking the query directly: these useful queries might have unbounded noise • Need to deduce (approximate) answer to q from other queries

Interpolation attack - implementation • Want to evaluate a specific query q with small error • Pick a random degree-2 curve that passes through q and issue queries for the p points on the curve • Key issue: points on curve are pairwise independent • Therefore: for sufficiently many queries, with high probability interpolation gives a correct (up to small noise) answer for q • Can try exhaustively all degree 2 polynomials Similar to Reed Muller decoding

Interpolation attack … • Interpolation implemented by searching all p3 degree 2 polynomials for one which is -close at ½- of the entries polynomial • restrictions of a deg-2 curve to a linear form is a deg-2 polynomial • Any two such polynomials must be 2-close, due to low degree • Hence the accuracy of the reconstructed answer is 2. • For (p-1)/4 > 2:can figure out any specific database bit with high probability To query

Interpolation Attack: evaluating a query accurately • DB: f(q1, …, qn) = Σi=1..ndiqi (Zpn →Zp) • Pick a curve: for two random points u1, u2in Zpn: c(t) = q+ u1t + u2t2 (Zp→Zpn) • Restriction of f to c: f|c(t) = f(c(t)) this is a degree-2 polynomial (Zp→Zp) • Query allp points of c to get evaluations of f|c • answers are inaccurate • Interpolate to find f|c up to a small error • Evaluate f|c(0) = f(q) accurately

Interpolation attack - performance • Time for finding any specific bit: O(p4)=O(-8) • Independent of db size n? (querying time? |q| = Θ(n)) • Can be used with very large databases if interesting part is small • Time to construct whole db with small error: O(n) with pn queries (or O(n2))

Summary • Ω(√n) perturbation lower bound revisited – simple and efficient attack • When queries allow sufficiently large weights, an adversary can: • Handle unbounded noise on large portion of the queries • Find out private data in time independent of size of DB

Foundations of Privacy Lecture 11

Foundations of Privacy Lecture 11

Presentation Transcript

Foundations of Cryptography Lecture 1

Foundations of Privacy Formal Lecture Zero-Knowledge and Deniable Authentication

Seminar in Foundations of Privacy

Lecture 7: Foundations of Hinduism

Lecture 11: Provenance and Data privacy

Foundations of Privacy Lecture 3

Foundations of Privacy Lecture 7+8

Foundations of Privacy Lecture 5

Foundations of Privacy Lecture 7

Foundations of Privacy Lecture 7

Foundations of Privacy Lecture 6

Foundations of Privacy Lecture 5

Foundations of Privacy Lecture 8

Foundations of Computer Vision Lecture 11

Foundations of Privacy Lecture 4

Foundations of Cryptography Lecture 2

Foundations of Cryptography Lecture 12

Seminar in Foundations of Privacy

Foundations of Cryptography Lecture 11

GDPR: The Foundations of Data Privacy