Privacy-Preserving Datamining on Vertically Partitioned Databases

Privacy-Preserving Datamining on Vertically Partitioned Databases Kobbi Nissim Microsoft, SVC Joint work with Cynthia Dwork

Privacy and Usability in Large Statistical Databases Kobbi Nissim Microsoft, SVC Joint work with Cynthia Dwork

d Data Privacy Data users access mechanism breach privacy

The Data Privacy Game: an Information-Privacy Tradeoff f f kobbi f • Private functions: E.g kobbi(DB)=dkobbi • Information functions: • want to reveal f(q, DB) for queries q • Explicit definition of private functions • The question: which information functions may be allowed? • Crypto: secure function evaluation • want to reveal f() • want to hide all functions () not computable from f() • Implicit definition of private functions Intuition: privacy breached if it is possible to associate private info with identity

k attributes f aq,f f  f n persons 0 0 0 1 0 f 1 1 0 0 1 0 0 1 1 0 1 0 1 0 0 1 1 0 1 1 0 0 1 0 1 Model: Statistical Database (SDB) Database {di,j} Query (q, f) q  [n] f : {0,1}k {0,1} Answer aq,f=iqf(di) Row distributionD (D1,D2,…,Dn)

Perturbation (Randomization Approach) • Exact answer to query (q, f): • aq,f=iq f(d1…dk) • Actual SDB answer: âq,f • Perturbation E: • For all q,f: | âq,f– aq,f| ≤ E • Questions: • Does perturbation give any privacy? • How much perturbation is needed for privacy? • Usability

small DB Affects usability medium DB large DB Previous Work • [Dinur, N] considered 1-attribute SDBs: • Unlimited adversary: • Perturbation of magnitude (n) required • Polynomial-time adversary: • Perturbation of magnitude (sqrt(n)) required • In both cases, adversary may reconstruct a good approximation for the database • Disallows even very week notions of privacy • These results hold also for our model! • Bounded adversary, restricted to T << n queries (SuLQ): • [D, Dwork, N] privacy preserving access mechanism with perturbation magnitude << sqrt(n) • Chance for usability • Reasonable model as database grow larger and larger

Previous Work - Privacy Definitions (1) X – data, Y – (noisy) observation of X • [Agrawal, Srikant ‘00]Interval of confidence • Let Y = X+noise (e.g. uniform noise in [-100,100]). Intuition: the larger the interval, the better privacy is preserved. • Problematic when knowledge about how X is distributed is take into account [AA] • [Agrawal, Aggarwal ‘01]Mutual information • Intuition: the smaller I(X;Y) is, the better privacy is preserved • Example where privacy is not preserved but mutual information does not show any trouble [EGS]

Previous Work - Privacy Definitions (2) X – data, Y – (noisy) observation of X • [Evfimievsky, Gehrke, Srikant PODS 03]p1-to-p2 breach • Pr[Q(X)] ≤ p1 and Pr[Q(x)|Y] ≥ p2 • Amplification = maxa,b,y Pr[ay]/Pr[by] • Show relationship between amplification and p1-to-p2breaches • [Dinur, N PODS 03] Similar approach, describing an adversary • Neglecting privacy breaches that happen with only a negligible probability • Somewhat take into account elsewhere gained knowledge

Privacy and Usability Concerns for the Multi-Attribute Model • Rich set of queries: subset sums over any property of the k attributes • Obviously increases usability, but how is privacy affected? • More to protect: Functions of the k attributes • Adversary prior knowledge: more possibilities • Partial information about the `attacked’ row • Information gained about other rows • Row dependency • Data may be vertically split (between k or less databases): • Can privacy still be maintained with independently operating databases? • How is usability affected?

use all gained info to choose i,g Privacy Definition - Intuition • 3-phase adversary • Phase 0: define a target set G of poly(n) functions g: {0,1}k {0,1} • Will try to learn some of this information about someone • Phase 1: adaptively query the database T=o(n) times • Phase 2: choose an index i of a row it intends to attack and a function gG • Attack: try to guess g(di,1…di,k) • given d-i

Privacy Definition • p0i,g– a-priori probability that g(di,1…di,k)=1 • Assuming the adversary only knows the underlying distributions D1…Dn • pTi,g– a-posteriori probability that g(di,1…di,k)=1 • Given answers to the T queries, and d-i • Define conf(p) = log (p/(1-p)) • Proved useful in [DN03] • Possible to rewrite our definitions using probabilities • confi,g= conf(pTi,g)– conf(p0i,g) • (,T) – privacy: • For all distributions D1…Dn , row i, function g and any adversary making at most T queries: Pr[confi,g> ] = neg(n)

Notes on the Privacy Definition • Somewhat models knowledge adversary may acquire `out of the system’ • Different distribution per person (smoking/non-smoking) • ith privacy preserved even when d-igiven • Relative privacy • Compares a-priori and a-posteriori knowledge • Privacy achieved: • For k = O(log n): • Bounded loss of privacy of property g(di1,…,dik) for all Boolean functions g and all i • Larger k: • bounded loss of privacy of g(di) for any member g of pre-specified poly-sized set of target functions

The SuLQ Database • Adversary restricted toT << nqueries • On query (q, f): • q  [n] • f : {0,1}k {0,1} : • Let aq,f=iq f(di,1…di,k) • Let N  Binomial(0, T) • Return aq,f+N

Privacy Analysis of the SuLQ Database • Pmi,g - a-posteriori probability that g(di,1…di,k)=1 • Given d-iand answers to the first m queries • conf(pmi,g)Describes a random walk on the line with: • Starting point: conf(p0i,g) • Compromise: conf(pmi,g) – conf(p0i,g)>  • W.h.p. more than T steps needed to reach compromise conf(p0i,g) conf(p0i,g) +

1 0 0 1 0 1 1 0 1 1 1 0 0 0 1 1 1 1 0 1 1 0 1 1 1 1 0 0 0 1 Usability (1) One multi-attribute SuLQ DB • Statistics of any property f of the k attributes • I.e. for what fraction of the (sub)population does f(d1…dk) hold? • Easy: just put f in the query

1 0 0 1 0 1 1 0 1 1 1 0 0 0 1 1 1 1 0 1 1 0 1 1 1 1 0 0 0 1 Usability (2)k ind. multi-attribute SuLQ DBs • impliesin probability: Pr[|] ≥ Pr[]+ • Estimate  within constant additive error • Learn statistics for any conjunct of two attributes: • Pr[ ^ ]=Pr[] (Pr[]+ ) • Principal Component Analysis? • Statistics for any Boolean function f of the two attribute values. E.g. Pr[  ]

Probabilistic Implication • impliesin probability: • Pr[|] ≥ Pr[]+ • We construct a tester for distinguishing <1from >2(for constants 1 < 2) • Estimating  follows by standard methods • In the analysis we consider deviations from an expected value, of magnitude sqrt(n) • As perturbation << sqrt(n), it does not mask out these deviations

(2-1)sqrt(n) Pr[aq,] aq, 0 1 Probabilistic Implication –The Tester • Pr[|] ≥ Pr[]+ • Distinguishing <1from >2: • Find a query q s.t. aq,> |q| p+sqrt(n) • Let bias= aq,- |q|  p • Issue query (q, ) • If aq, >threshold(bias, p, 1) output 1 <1 >2 random q

k1 attributes k attributes k2 attributes 0 0 1 0 0 0 0 1 0 0 1 1 1 0 0 1 1 0 0 0 1 0 1 0 0 1 0 1 1 0 1 1 1 1 0 0 1 0 1 0 0 1 f1 f2 f Usability (3)Vertically Partitioned SulQ DBs • E.g. k=k1+k2 • Learn statistics for any property f that is a Boolean function of outputs of the results from the two databases

Kd Total of t ( )2d numbers Usability (4)Published Statistics • Model: A trusted party (e.g. the Census Bureau) collects confidential information and publishes aggregate statistics • Let d<<k • Repeat t times: • Choose a (pseudo) random q and publish SuLQ answer (noisy statistics) for all d-ary conjuncts over the k attributes (q, 1^2^3) (q, 1^2^3) … (q, k-2^k-1^k) (q’, 1^2^3) (q’, 1^2^3) … (q’, k-2^k-1^k) …

Hence, the dataminer can now compute statistics for all • 2d-ary Boolean functions K2d 22d ( )2 K2d ( )22d Kd Savings: t ( )2d numbers vs. numbers Usability (4)Published Statistics (cont.) • A dataminer can now compute statistics for all 2d-ary conjuncts: • E.g. to compute Pr[1^4^7^11^12^15], run probabilistic implication tester on 1^4^7 and11^12^15 • t picked such that with probability 1- ,statistics forall functions is estimated within additive error  • Savings: O(25dkdd2logd) vs. O(22dk2d) for constant ,

Summary • Strong privacy definition and rigorous privacy proof in SuLQ • Extending the DiDwNi observation that privacy may be preserved in large databases • Usability for the dataminer: • Single database case • Vertically split databases • Positive indications regarding published statistics • Preserving privacy • Enabling usability

Open Questions (1) • Privacy definition - What’s the next step? • Goal: cover everything a realistic adversary may do • Improve usability/efficiency/… • Is there an alternative way to perturb and use the data that would result in more efficient/accurate datamining? • Same for datamining published statistics • Datamining 3-ary Boolean functions from single attribute SuLQ DBs • Our method does not seem to extend to ternary functions

Open Questions (2) • Maintaining privacy of all possible functions • Cryptographic measures??? • New applications for our confidence analysis • Self Auditing? • Decision whether to allow a query based on previous `good’ queries and their answers (But not DB contents) • How to compute conf? approximation?

Privacy-Preserving Datamining on Vertically Partitioned Databases

Privacy-Preserving Datamining on Vertically Partitioned Databases

Presentation Transcript

Privacy-Preserving Credit Checking

Privacy-Preserving Linear Programming

Privacy-Preserving Data Mashup

Privacy preserving network forensics

Privacy Preserving In LBS

data privacy-preserving

Privacy-preserving DRM

Privacy-Preserving Computation and Verification of Aggregate Queries on Outsourced Databases

Privacy-Preserving K- means Clustering over Vertically Partitioned Data

Privacy-Preserving Face Recognition

Privacy-Preserving Location Services

Efficient Skyline Computation on Vertically Partitioned Datasets

Privacy-Preserving Databases and Data Mining

Privacy Preserving OLAP

Privacy-Preserving Data Publishing

Privacy-Preserving Computation

Privacy-Preserving Data Sharing

Privacy-Preserving Transaction Escrow

Privacy Preserving K -means Clustering on Vertically Partitioned Data

Privacy-Preserving Clustering

Privacy-Preserving Data Mining

Privacy Preserving Data Mining