1 / 27

From Idiosyncratic to Stereotypical: Toward Privacy in Public Databases

This research paper explores the trade-off between privacy and utility in public databases, proposing a novel approach to sanitize data while preserving macroscopic properties. The paper examines statistical techniques and perturbation of attribute values to achieve privacy without compromising data utility.

mccool
Download Presentation

From Idiosyncratic to Stereotypical: Toward Privacy in Public Databases

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. From Idiosyncratic to Stereotypical:Toward Privacy in Public Databases Shuchi Chawla, Cynthia Dwork, Frank McSherry, Adam Smith, Larry Stockmeyer, Hoeteck Wee work done at Microsoft Research, SVC

  2. Database Privacy • Census data – a prototypical example • Individuals provide information • Census bureau publishes sanitized records • Privacy is legally mandated; what utility can we achieve? • Inherent Privacy vs Utility trade-off • One extreme – complete privacy; no information • Other extreme – complete information; no privacy • Goals: • Find a middle path • preserve macroscopic properties • “disguise” individual identifying information • Change the nature of discourse • Establish framework for meaningful comparison of techniques Shuchi Chawla

  3. Current solutions • Statistical approaches • Alter the frequency (PRAN/DS/PERT) of particular features, while preserving means. • Additionally, erase values that reveal too much • Query-based approaches • Perturb output or disallow queries that breach privacy • Unsatisfying • Overly constrained definitions; ad-hoc techniques • Ad-hoc treatment of external sources of info • Erasure can disclose information; Refusal to answer may be revelatory Shuchi Chawla

  4. Our Approach • Crypto-flavored definitions • Mathematical characterization of Adversary’s goal • Precise definition of when sanitization procedure fails • Intuition: seeing sanitized DB gives Adversary an “advantage” • Statistical Techniques • Perturbation of attribute values • Differs from previous work: perturbation amounts depend on local densities of points • Highly abstracted version of problem • If we can’t understand this, we can’t understand real life. • If we get negative results here, the world is in trouble. Shuchi Chawla

  5. An outline of this talk • A mathematical formalism • What do we mean by privacy? • An abstract model of datasets • Isolation • Good sanitizations • A candidate sanitization • Privacy for the 2-point case • General argument for privacy of n-point datasets • A brief overview of results • Open issues; moving on to real-world applications Shuchi Chawla

  6. What do WE mean by privacy? • [Ruth Gavison] Protection from being brought to the attention of others • inherently valuable • attention invites further privacy loss • Privacy is assured to the extent that one blends in with the crowd • Appealing definition; can be converted into a precise mathematical statement… Shuchi Chawla

  7. A geometric view • Abstraction : • Points in a high dimensional metric space – say R d; drawn i.i.d. from some distribution • Points are unlabeled; you are your collection of attributes • Distance is everything points are similar if and only if they are close (L2 norm) • Real Database (RDB) – private n unlabeled points in d-dimensional space. • Sanitized Database (SDB) – public n’ new points possibly in a different space. Shuchi Chawla

  8. The adversary or Isolator • Using SDB and auxiliary information (AUX), outputs a point q • q “isolates” a real point x, if it is much closer to x than to x’s neighbors. • Even if q looks similar to x, it may fail to isolate x if it looks as similar to x’s neighbors as well. • Tightly clustered points have a smaller radius of isolation RDB Isolating Non-isolating Shuchi Chawla

  9. The adversary or Isolator cd d q x (c-1) d • I(SDB,AUX) = q • x is isolated if B(q,cd) contains less than T points • T-radius of x – distance to its T-nearest neighbor • x is “safe” if x > (T-radius of x)/(c-1) B(q,cdx) contains x’s entire T-neighborhood c – privacy parameter; eg. 4 large T and small c is good Shuchi Chawla

  10. A good sanitization • No way of obtaining privacy if AUX already reveals too much! • Sanitizing algorithm compromises privacy if the adversary is able to increase his probability of isolating a point considerably by looking at its output • Definition of “considerably” can be forgiving, say, n-2 • A rigorous definition • I D aux z  x  I’ | Pr[I(SDB,z) succeeds on x ] – Pr[I’(z) succeeds on x] | is small • Provides a framework for describing the power of a sanitization method, and hence for comparisons Shuchi Chawla

  11. The Sanitizer • The privacy of x is linked to its T-radius  Randomly perturb it in proportion to its T-radius • x’ = San(x) R B(x,T-rad(x)) T=1 Shuchi Chawla

  12. The Sanitizer • The privacy of x is linked to its T-radius  Randomly perturb it in proportion to its T-radius • x’ = San(x) R B(x,T-rad(x)) • Intuition: • We are blending x in with its crowd If the number of dimensions (d) is large, there are “many” pre-images for x’. The adversary cannot conclusively pick any one. • We are adding random noise with mean zero to x, so several macroscopic properties should be preserved. Shuchi Chawla

  13. Flavor of Results (Preliminary) • Assumptions Data arises from a mixture of Gaussians dimensions d,num of points n are large; d = w(log n) • Results Privacy: An adversary who knows the Gaussians and some auxiliary information cannot isolate any point with probability more than 2-W(d) (Several special cases; General result not yet proved) Utility:An honest user who does not know the Gaussians, can compute the means with a high probability Shuchi Chawla

  14. The “simplest” interesting case • RDB = {x, y} x, y 2R B(o,) where o – “origin” • T=1; c=4; SDB = { x’, y’ } • The adversary knows x’, y’, r and d= |x-y| • We show: There are m=2W(d) “decoy” pairs (xi,yi) • (xi,yi) are legal pre-images of (x’,y’) that is, |xi-yi|=d and Pr[ xi,yi | x’,y’ ] = Pr[ x,y | x’,y’ ] • Adversary cannot know which of the (xi, yi) represents reality • The adversary can only isolate one point in {x1,y1, … xm, ym} at a time Shuchi Chawla

  15. The “simplest” interesting case x’ xH x yH y y’ H • Consider a hyperplane H through x’, y’ and o • xH, yH – mirror reflections of x, y through H Note: reflections preserve distances! • The world of xH, yH looks identical to the world of x, y Pr[ xH,yH | x’,y’ ] = Pr[ x,y | x’,y’ ] Shuchi Chawla

  16. The “simplest” interesting case r x1 2q 2r sinq x x2 • Consider a hyperplane H through x’, y’ and o • xH, yH – mirror reflections of x, y through H Note: reflections preserve distances! • The world of xH, yH looks identical to the world of x, y • How many different H such that the corresponding xH are pairwise distant? Sufficient to pick r=2/3d and q = 30° Fact: There are 2W(d) vectors in d-dim, at angle 60° from each other.  Probability that adversary wins ≤ 2-W(d) = 2/3 d r Shuchi Chawla

  17. The general case… n points • The adversary is given n-1 real points x2,…,xn and one sanitized point x’1 ; T = 1; flat prior • Reflections do not work – too many constraints • A more direct argument – examine posterior distribution on x1 • Let Z = { pR d | p is a legal pre-image for x’1 } Q = { p | if x1=p then x1 is isolated by q } • We show that Pr[ Q∩Z | x’1 ] ≤ 2-W(d) Pr[ Z | x’1 ] Pr[x1 in Q∩Z | x’1 ] = prob mass contribution from Q∩Z / contribution from Z = 21-d /(1/4) Shuchi Chawla

  18. The general case… n points x3 Z Q∩Z x5 q x’ Q x2 x4 x6 Z = { p | p is a legal pre-image for x’1 } Q = { p | x1=p is isolated by q } • Key observation: • As |q-x’| increases, Q becomes larger. • But, larger distance from x’ implies smaller probability mass, because x is randomized over a larger area Probability depends only on the solid angle subtended at x’ Shuchi Chawla

  19. The general case… n sanitized points L R • Privacy does not follow immediately from the previous analysis with real points! • Problem: Sanitization is non-oblivious Other sanitized points reveal information about x, if x is their nearest neighbor • Solution: Decouple the two kinds of information – from x’ and x’i Shuchi Chawla

  20. The general case… n sanitized points L R • Claim 1 (Privacy for L): Given all sanitizations, all points in R, and all but one point in L, adversary cannot isolate last point Follows from the proof for n-1 real points • Claim 2 (Privacy for R): Given all sanitizations, all points in L and all but one point in R, adversary cannot isolate last point Work under progress Idea: Show that the adversary cannot distinguish between whether R contains some point x or not. (Information-theoretic argument) Shuchi Chawla

  21. Results on privacy.. An overview Shuchi Chawla

  22. Results on utility… An overview Skip Shuchi Chawla

  23. Learning mixtures of Gaussians(Spectral methods) • Observation: Top eigenvectors of a matrix span a low-dimensional space that yields a good approximation of complex data sets, in particular Gaussian data. • Intuition • Sampled points are “close” to means of the corresponding Gaussians in any subspace • Span of top k singular vectors approximates span of the means • Distances between means of Gaussians are preserved • Other distances shrink by a factor of √(k/n) • Our goal: show that the same algorithm works for clustering sanitized data. Shuchi Chawla

  24. Spectral techniques for perturbed data • A sanitized point is the sum of two Gaussian variables – sample + noise • w.h.p. the 1-radius of a point is less than the “radius” of its Gaussian • Variance of the noise is small • Sanitized points are still close to their means (uses independence of direction) • Span of top k singular vectors still approximates the span of means of Gaussians • Distances between means are preserved; others shrink Shuchi Chawla

  25. Future directions • Extend the privacy argument to other “nice”distributions • Can revealing the distribution hurt privacy? • Characterize the kind of auxiliary information that is acceptable • Depends on the distribution on the datapoints • The low-dimensional case • Is it inherently impossible? • Dinur & Nissim show impossibility for the 1-dimensional case • Extend the utility argument to other interesting macroscopic properties Shuchi Chawla

  26. What about the real world? • Lessons from the abstract model • High dimensionality is our friend • Gaussian/spherically symmetric perturbations seem to be the right thing to do • Need to scale different attributes appropriately, so that data is well rounded • Moving towards real data • Outliers – Our notion of c-isolation deals with them - Existence of outlier may be disclosed • Discrete attributes – Convert them into real-valued attributes - e.g. Convert a binary variable into a probability Shuchi Chawla

  27. Questions? Shuchi Chawla

More Related