Towards Privacy in Public Databases

Towards Privacy in Public Databases Shuchi Chawla, Cynthia Dwork, Frank McSherry, Adam Smith, Hoeteck Wee

Database Privacy • A “Census” problem • Individuals provide information • Census Bureau publishes sanitized records • Privacy is legally mandated; what utility can we achieve? • Inherent Privacy vs. Utility trade-off • Our goal: • Find a middle path: • preserve macroscopic properties • “disguise” individual records (containing private info) • Establish a framework for meaningful comparison of techniques Shuchi Chawla

What about Secure Function Evaluation? • Secure Function Evaluation [Yao, GMW] • Allows parties to collaboratively compute a function f of their private inputs  = f(a,b,c,…) ( E.g.,  = sum(a,b,c, …) ) • Each player learns only what can be deduced from  and her own input to f • SFE and privacy are complementary problems: one does not imply the other • SFE: Given what must be preserved, protect everything else • Privacy: Given what must be protected, preserve as much as you can Shuchi Chawla

This talk… • A formalism for privacy • What we mean by privacy • A good sanitization procedure • Results • Histograms and Perturbations • Subsequent work; Open problems Shuchi Chawla

What do we mean by Privacy? • [Ruth Gavison] Protection from being brought to the attention of others • inherently valuable • attention invites further privacy loss • Privacy is assured to the extent that one blends in with the crowd • Appealing definition; can be converted into a precise mathematical statement… Shuchi Chawla

The basic model – a geometric approach • Database consists of pts in high dimensional space R d • Samples from some underlying distribution • Points are unlabeled: you are your collection of attributes • (Relative) distance is everything: points that are closer are more similar and vice versa • A “real” database RDB – controlled by a central authority • n points in d-dimensional space • Think of d as the number of sensitive attributes • A “sanitized” database SDB – released to the world • Information about fake individuals, a summary of the real data, or, a combination of both Shuchi Chawla

The adversary or Isolator • On input SDB and auxiliary information, adversary outputs a point q Rd • q “isolates” a real point x, if it is much closer to x than to x’s neighbors i.e., if B(q,cd) contains fewer than T other points from RDB • c, T – privacy parameters; e.g., c = 4, T = 100 isolated d Not isolated cd RDB Shuchi Chawla

Requirement for the sanitizer • No way of obtaining privacy if AUX already reveals too much! • Sanitization compromises privacy if giving the adversary access to the SDB considerably increases its probability of success • Definition of “considerably” can be forgiving, e.g. 1/1000 • Rigorously: • Provides a framework for describing the power of a sanitization method, and hence for comparisons • Aux is going to cause trouble. Ignore it for now. 2-d in our results D II ’ w.h.p. over RDBD aux  S RDB: | Pr[ xS: I(SDB,aux) iso.s x ] – Pr[xS: I ’(aux) iso.s x ] |  D II ’ w.h.p. over RDBD aux  x RDB: | Pr[ I(SDB,aux) isolates x ] – Pr[ I ’(aux) isolates x ] |  Shuchi Chawla

A “bad” sanitizer that passes [Abhinandan Das, Cornell] • Disguise one attribute extremely well; Leave the others in the clear • Without info about the special attribute, the adversary cannot “isolate” any point • However, he knows all other attributes exactly! • What goes wrong? • The assumption that “distance is everything” • No isolation  no privacy breach, even if the adversary knows a lot of information Shuchi Chawla

Utility goals • Desirable results • Macroscopic properties (e.g. means) should be preserved • Running statistical tests / data-analysis algorithms should return results similar to those obtained from real data • We show: • Concrete point-wise results on histograms and clustering algorithms Shuchi Chawla

1 1 1 1 1 2 2 1 1 2 2 1 1 Two techniques for sanitization • Recursive histograms • Assume: the universe is a d-dimensional hypercube [-1,1]n • As long as a cell contains  T points: • Subdivide it into 2d hypercubes by splitting each side evenly • Recurse until all cells have T points • Output a list of cells and counts d=2, T=3 Shuchi Chawla

Two techniques for sanitization • Recursive histograms • Perturbation • For every point x, compute its T-radius tx: |B(x, tx)| = T • Add random vector to x of length proportional to tx – doesn’t work by itself T=1 Shuchi Chawla

1 1 1 1 1 2 2 1 1 2 2 1 1 Two techniques for sanitization • Recursive histograms • Perturbation – combined with histograms • Results on privacy • Rely on randomness in distribution and sanitization • Do not use any computational assumptions • When D = uniform over a hypercube, c = O(1), T = arbitrary probability of success for the adversary:   2-d • Better results for special cases Shuchi Chawla

Key results on utility • Perturbation-based sanitization: Allows for various clustering algorithms to perform nearly as well as on real data • Spectral techniques • Diameter-based clusterings • Histograms : a popular summarization technique in statistics • Recursive histograms – benefit of providing more detail where required • Provide density information even without the counts • No randomness involved! Shuchi Chawla

1 1 1 1 1 2 2 1 1 2 2 1 1 A brief proof of privacy • Recall recursive histograms • Simplifying assumption: Input distribution is uniform over the hypercube • Intuition • The adversary’s view – a product of uniform distributions over histogram cells • The uniform distribution is “well-spread-out” – the adversary cannot conclusively single out a point in it Shuchi Chawla

A brief proof of privacy • Case 1: Sparse cell • Expected distance ||q - x|| proportional to diameter of cell • c times this distance is larger than diameter of parent cell • Therefore, B(q,c) contains at least T points • Case 2: Dense cell • Consider the balls B(q,r) and B(q,cr) for some radius r • The adversary wins if Pr[  x  B(q,r) ] is large, and, Pr[  T points in B(q,cr) ] is small • However, we show: Pr[  x  B(q,cr) ] >> Pr[  x  B(q,r) ] q x Shuchi Chawla

A brief proof of privacy • Lemma: Let c be a large enough constant. For any cell and any r < diam(cell)/c , Pr[  x  B(q,cr) cell ]  2d Pr[  x  B(q,r) cell ] • Proof idea: • Pr[  x  B(q,r) cell ]  Vol( B(q,r) cell ) • Vol( B(q,cr) cell ) > 2d Vol( B(q,r) cell ) Uses arguments about Normal and Uniform random variables • Corollary: Prob. of success for the adversary < 2-d B(q,cr) B(q,r) cell Shuchi Chawla

Follow-up work • Isolation in few dimensions • Adversary must be more and more accurate in fewer dimensions • Randomized recursive histograms [Chawla, Dwork, McSherry, Talwar] • Similar privacy guarantees for “nearly-uniform” distributions over “well-rounded” universes • Preserve distances between pairs of points to a reasonable accuracy (additive error depending on T) • General-case impossibility • Cannot allow arbitrary AUX –  utility, and  definitions of privacy,  AUX that prevents privacy-preserving sanitization Shuchi Chawla

What about the real world? • Lessons from the abstract model • High dimensionality is our friend • Histograms are powerful; Spherical perturbations promising • Need to scale different attributes appropriately, so that data is well-rounded • Moving towards real data • Outliers • Our notion of c-isolation deals with them; existence may be disclosed • Discrete attributes • Possible solution: Convert them into real-valued attributes by adding noise? • The low-dimensional case • Is it inherently impossible? • Dinur and Nissim show impossibility for 1-dimensional data Shuchi Chawla

Questions?

Towards Privacy in Public Databases