Helping Kinsey Compute

Helping Kinsey Compute Cynthia Dwork Microsoft Research

The Problem • Exploit Data, eg, Medical Insurance Database • Does smoking contribute to heart disease? • Was there a rise in asthma emergency room cases this month? • What fraction of the admissions during 2004 were men 25-35? • …while preserving privacy of individuals

Holistic Statistics • Is the dataset well clustered? • What is the single best predictor for risk of stroke? • How are attributes X and Y correlated; what is the cov(X,Y)? • Are the data inherently low-dimensional?

Statistical Database f f  f f Database (D1, … Dn) Query (f,S) f: row [0,1] S µ [n] Exact Answer  f(row r) + noise

Statistical Database f f  f f Under control of interlocutor: Noise generation Number of queries T permitted + noise

Why Bother With Noise? Limiting interface to queries about large sets is insufficient: A = {1, … , n} and B = {2, … , n} a2 A f(row a) - b2 B f(row b) = f(row 1)

Previous (Modern) Work in this Model • Dinur, Nissim [2003] Single binary attribute (query function f = identity) Non-privacy: whp adversary guesses 1- rows • Theorem: Polytime non-privacy if whp |noise| is o(√n) • Theorem: Privacy with o(√n) noise if #queries is << n • Privacy “for free” ! Rows » samples from underlying distribution: Pr[row i = 1] = p E[# 1’s] = pn, Var = (n) Acutal #1’s » pn §(√n) |Privacy-preserving noise| is o(sampling error)

Real Power in this Model • Dwork, Nissim [2004] Multiple binary attributes q=(S,f), f:{0,1}d! {0,1} • Definition of privacy appropriate to enriched query set • Theorem: Privacy with o(√n) noise if #queries is << n • Coined term SuLQ • Vertically Partitioned Databases • Learn joint statistics from independently operated SuLQ databases: • Given SulQA, SuLQB learn if A implies B in probability • Eg, heart disease risk increases with smoking • Enables learning statistics for all Boolean fns of attributes

Still More Power [Blum, Dwork, McSherry, Nissim 05] • Extend Privacy Proofs • Real-valued functions f: [0,1]d! [0,1] • Per row analysis: drop dependence on n! • How many queries has THIS row participated in? • Our Data, Ourselves • Holistic Statistics: A Calculus of Noisy Computation • Beyond statistics: • (not too) noisy versions of k-means, perceptron, ID3 algs • (not too) noisy optimal projections SVD, PCA • All of STAT learning

Towards Defining Privacy: “Facts of Life” vs Privacy Breach • Diabetes is more likely in obese persons • Does not imply THIS obese person has or will have diabetes • Sneaker color preference is correlated with political party • Does not imply THIS person in red sneakers is a Republican • Half of all marriages result in divorce • Does not imply Pr [ THIS marriage will fail ] = ½

(, T)-Privacy Power of adversary: • Phase 0: Specify a goal function g: row  {0,1} Actually, a polynomial number of functions; Adversary will try to learn this information about someone • Phase 1: Adaptively make T queries • Phase 2: Choose a row i to attack; get entire database except for row i Privacy Breach: Occurs if adversary’s “confidence” in g( row i ) changes by  Notes: • Adversary chooses goal • My privacy is preserved even if everybody else tells their secrets to the adversary

Flavor of Privacy Proofs • Define confidence in value of g( row i ) • c0 = log [p0/(1-p0)] • 0 when p = ½, skyrockets as p moves toward 0 or 1 • Model evolution of confidence as a martingale • Argue expected difference at each step is small • Compute absolute upper bound on difference • Plug these two parameters into Azuma’s inequality Obtain probabilistic statement regarding change in confidence, equivalently, change from prior to posterior probabilities about value of g( row i ) c0

Remainder of This Talk • Description of SuLQ Algorithm + Statement of Main Theorem • Examples • k means • SVD, PCA • Perceptron • STAT learning • Vertically Partitioned Data • Determining if ) in probability: Pr[|] ¸ Pr[]+  when  and  are in different SuLQ databases • Summary

The SuLQ Algorithm • Algorithm: • Input: query (S µ [n], f: [0,1]d! [0,1]) • Output:i 2 Sf( row i ) + N(0, R) • Theorem:8 , with probability at least 1-, choosing R > 32 log(2/) log (T/)T/2ensures that for each (target, predicate) pair, after T queries the probability that the confidence has increased by more than  is at most . • R is independent of n. Bigger n means better stats.

k Means Clustering physics, OR, machine learning, data mining, etc.

SuLQ k Means • Estimate size of each cluster • Estimate average of points in cluster • Estimate their sum; and • Divide estimated sum by estimated average

Side by Side: k Means and SuLQ k-Means Basic step: Input: data points p1,…,pn and k ‘centers’ c1,…,ck in [0,1]d Sj = points for which cj is the closest center Output: c’j = average of points in Sj, j=1, … k Basic step: Input: data points p1,…,pn and k ‘centers’ c1,…,ck in [0,1]d sj = SuLQ( f(di) := 1 if j = arg minj ||cj – di|| 0 otherwise) ’j = SuLQ( f(di) := di if j = arg minj ||cj - di|| 0 otherwise) / sj k(1+d) queries total

Small Error! For each 1 · j · k, if |Sj| >> R1/2 then with high probability ||’j – c’j|| is O( (||j|| + d1/2 ) R1/2/|Sj|). • Inaccuracies: • Estimating |Sj| • Summing points in Sj • Even with just the first: (1/sj - 1/|Sj|) I 2 Sjdi = (1/sj - 1/|Sj|) (j |Sj|) = ((|Sj| - sj)/sj ) j ¼(noise/size)j

Reducing Dimensionality • Reduce Dimensionality in a dataset while retaining those characteristics that contribute most to its variance • Find Optimal Linear Projections • Latent semantic indexing, spectral clustering, etc., employ best rank k approximations to A • Singular Value Decomposition uses top k eigenvectors of ATA • Principal Component Analysis uses top k eigenvectors of cov(A) • Approach • Approximate ATA and cov(A) using SuLQ, then compute eigenvectors

Optimal Projections ATA = i diT di  = (i di)/n cov(A) = i(di - )T(di - ) SuLQ (f(i) = diT di) = AT A + N(0,R)d £ d ’ = SuLQ(f(i)=di)/n SuLQ( f(i) =(di - ’)T (di - ’) ) d2 and d2+d queries, respectively

Perceptron [Rosenblatt 57] pi w w • Input: n points p1,…,pn in [-1,1]d, and labels b1,…,bn in {-1,1} • Assumed linearly separable, with a plane through the origin • Initialize w randomly • h w, p i b > 0 iff label b agrees with sign of h w, p i • While 9 labeled point (pi,bi) s.t. h wi, pii bi· 0, set w = w + pi·bi • Output: w

SuLQ Perceptron • Initialize w = 0d and s= n. Repeat while s >> R1/2 • Count the misclassified rows (1 query): s = SuLQ(f(di) := 1 if h di , w i bi· 0 and 0 ow) • Synthesize a misclassified vector (d queries): v = SuLQ(f(di) := bi di if h di , w i¢ bi· 0 and 0 ow) / s • Update w: Set w = w + v Return the final value of w.

How Many Rounds? Theorem: If there exists a unit vector w’ and scalar  such that for all i hw',dii bi¸ and for all j,  >> (dR)1/2/|Sj| then with high probability the algorithm terminates in at most 32 maxi |di|2 /  rounds. |Sj| = number of misclassified vectors at iteration j In each round j, hw', wi increases by more than |w| does. Since hw', wi· |w'| ¢ |w| = |w|, this must stop. Otherwise hw', wi would overtake |w|.

The Statistical Queries Learning Model [Kearns93] • Conceptc: {0,1}d {0,1} • Distribution D on {0,1}d • STAT(c,D) Oracle • Query: (p, ) where p:{0,1}d+1 {0,1} and  =1/poly(d) • Answer:PrxD[p(x,c(x))] +  for ||  

Capturing STAT Each row contains a labeled example (x, c(x)) Input: predicate p and accuracy  • Initialize tally = 0. • Reduce variance: Repeat t ¸ R/  n2 times tally = tally + SuLQ(f(di) := p(di)) Output: tally / tn

Capturing STAT Theorem: For any algorithm that -learns a class C using at most q statistical queries of accuracy {1, … , q}, the adapted algorithm can -learn C on a SuLQ database of n elements, provided that n2¸ R log(q / )}/(T-q) £j · q 1/j

Probabilistic Implication: Two SuLQ Databases • impliesin probability: Pr[|] ≥ Pr[]+ • Construct a tester for distinguishing <1from >2(for constants 1 < 2) • Estimate  by binary search • In the analysis we consider deviations from an expected value, of magnitude (√n) • As perturbation << √n, it does not mask out these deviations • Results generalize to functions  and  of attributes in two distinct SuLQ databases

Key Insight: Test for Pr[|] ≥ Pr[]+ Assume T chosen so that noise = o(√n). • Find a “heavy” set S for : a subset of rows that have more than |S| a +[a(1-a) |S]1/2 ones in  database. Here, a = Pr[] and |S| = (n). Find S s.t. aS, > |S| a + √ [|S|(a(1- a))]. Let excess= aS, - |S| a.Note thatexcess is (n1/2). • Query the SuLQ database for , on S If aS,¸ |S| Pr[] + excess ( / (1 - a)) then return 1 else return 0 If  is constant then noise is too small to hide the correlation.

Summary • SuLQ framework for privacy-preserving statistical databases • real-valued query functions • Variance for noise depends (roughly linearly) on number of queries, not size of database • Examples of power of SuLQ calculus • Vertically Partitioned Databases

Sources • C. Dwork and K. Nissim, Privacy-Preserving Datamining on Vertically Partitioned Databases • A. Blum, C. Dwork, F. McSherry, and K. Nissim, Practical Privacy: The SuLQ Framework • See http://research.microsoft.com/research/sv/DabasePrivacy

Helping Kinsey Compute

Helping Kinsey Compute

Presentation Transcript

Compute This!

Compute This!

Helping

Windows Azure Compute

Windows Azure Compute

BT Managed Compute

Addressing Campus Sexual Violence: Kinsey State University

Windows Azure Compute

HELPing

Administer, Compute, Element

2. Compute drape

VLDATA WP3: Compute Management

Compute

DEFINITELY DOES COMPUTE

Compute Blocks Revealed

Alfred Kinsey

Lagrangian diagnostics to compute

THE COMPUTE STATEMENT

High Performance Compute Cluster

AWS Compute Services

F rom EROTOLOGY to KINSEY