1 / 43

Private Statistics: A TCS Perspective

Private Statistics: A TCS Perspective. Gautam Kamath Simons Institute University of Waterloo Data Privacy: Foundations and Applications Boot Camp January 29, 2019. Outline. Setting and Goals Hypothesis Testing Distribution Estimation (Some) Other Statistical Tasks.

Download Presentation

Private Statistics: A TCS Perspective

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Private Statistics:A TCS Perspective Gautam Kamath Simons Institute University of Waterloo Data Privacy: Foundations and Applications Boot Camp January 29, 2019

  2. Outline • Setting and Goals • Hypothesis Testing • Distribution Estimation • (Some) Other Statistical Tasks

  3. Algorithms vs. Statistics Algorithms M “utility” Statistics Distribution M randomsampling “utility”

  4. Privacy in Statistics Statistics Desiderata: • Algorithm is accurate (with high probability over ) • May require assumptions about to hold • Algorithm is private (always) • This talk: -differentially private (usually) Distribution M randomsampling “utility”

  5. Privacy and Utility Privacy  Utility  Privacy  Utility 

  6. Why Worst-Case Privacy? • Can violate privacy of outliers • Salaries: Noise, then release dataset

  7. Why Worst-Case Privacy? • Can violate privacy of outliers • Statistics can be retracted, private information can’t

  8. Privacy? is -DP if for all inputs which differ on one entry: • This talk: Less sensitive statistics are cheaper to privatize • Sensitivity of : • Biggest difference on two neighboring datasets

  9. “Utility”? How much data is needed to approximately infer a property of the underlying distribution with high probability? With samples, . Probably Approximately Correct (PAC) Learning [L. Valiant ’84]

  10. The Cost of Privacy • How much more data is needed to guarantee privacy? • Sample complexity • : non-private cost • : additional cost due to privacy • In what situations is ?

  11. Asymptotic: One word, two meanings • Asymptotic statistics • Guarantees when sample size approaches infinity • Example: “As , the statistic .” • Doesn’t quantify “error” for finite • Asymptotics for computer scientists • Hide constant factors and lower order terms • Example: “To achieve accuracy , we require samples.” • Read: require , for some fixed (known, but hidden) constant • Is “asymptotic optimality” good enough?

  12. “Sample Complexity” versus “Rates” • Theorem: Given i.i.d. from , there exists an -private algorithm that, with probability , With samples, . For each , .

  13. Hypothesis Testing

  14. Hypothesis Testing • Given a dataset , was it generated from a distribution which satisfies some hypothesis? • “Yes or no?” question • : the null hypothesis • Today: ? • : some model of interest • Statisticians: Goodness-of-fit testing, one-sample testing • CS Theorists: Identity testing • Also today: : uniform distribution over integers • Multinomial data • “Uniformity testing”

  15. Classical Hypothesis Testing • Goal: If the holds (), probability of “rejecting” is • : Significance (false positive rate) • Generally the important constraint, easier to control • If doesn’t hold, probability of “not rejecting” is • : Power • Problem: What if is very close to ? • Often consider an “alternative hypothesis” • Can often control , but have to measure • When holds, the statistic’s distribution is predictable • May require asymptotic approximations

  16. Non-private: Pearson’s Chi-squared Test • : Uniform distribution over • : number of occurences of domain element • : number of samples

  17. Non-private: Pearson’s Chi-squared Test • Theorem: Ifholds, as , . • Chi-squared distribution: , where • Intuition: If , • Use quantiles of to determine test outcome • If , output • -value of when • Doesn’t account for asymptotic approximation • No guarantees about power

  18. Privatizing the Chi-Squared Testing [Gaboardi-Lim-Rogers-Vadhan ’16] • , • Lemma: As , • But finite-sample significance guarantees are now bad!

  19. Privatizing the Chi-Squared Testing [Gaboardi-Lim-Rogers-Vadhan ’16] • , • Lemma: As , • But finite-sample significance guarantees are now bad! • Use Monte Carlo to determine new thresholds • Analytically understand distribution for finite • Significance is now accurate, but power could be improved • Some post-processing helps... • [Kifer-Rogers, ’17]: try to “project out” noise • Can we rigorously reason about the required size of ?

  20. Minimax Hypothesis Testing • Alternative hypothesis : all distributions which are -far from • Parameterized by • required to make error rates under and both small? • Can be boosted to “high probability” at low cost

  21. An minimax-optimal non-private test • [Acharya-Daskalakis-K. ’15] • Subtracting allows us to bound the variance • Separate mean of under and , apply Chebyshev’s • Sample complexity: • [Paninski ’08, G. Valiant-P. Valiant ’14] • “Sub-linear” in domain size • How much does privacy cost?

  22. Subsample and Aggregate [Nissim-Raskhodnikova-Smith ’07] • Split dataset into parts • Compute function non-privately on each part • “Aggregate” results privately • Theorem: Private decision problems: non-private sample complexity • Proof: “Aggregate” = pick one of the results at random • Grants -DP -DP for decision problems • More general and powerful framework • Privatizing “Normal-ish” statistics [Smith ‘11] • PATE [Papernot-Song-Mironov-Raghunathan-Talwar-Erlingsson ‘18] • Baseline for private hypothesis testing:

  23. A Sensitivity-Limited Chi-Squared Test [Cai-Daskalakis-K. ’17] • Sensitivity of is determined by • Sensitive if a count is much larger than its expectation • But then it can’t be the right distribution! • If is large, output • Else, noisily threshold • Sample complexity:

  24. Even Better Tests! • Other optimal non-private statistics are more natural for privacy! • Counting number of non-observed elements [Paninski ’08] • Privatized in [Aliakbarpour-Diakonikolas-Rubinfeld ’18] • Empirical -distance [Diakonikolas-Gouleakis-Peebles-Price ’18] • Privatized in [Acharya-Sun-Zhang ’18] • Sample complexity: • Lower bounds in [Acharya-Sun-Zhang ’18]

  25. Distribution Estimation

  26. Private Distribution Estimation Given samples from , (privately) learn such that . Choice of distance may vary... And it really matters!!

  27. Univariate Learning: Multinomials • Privately estimate a discrete distribution over • In -distance: samples [folklore • See, e.g., [Diakonikolas-Hardt-Schmidt ’15] • Cost of privacy: minimal • In Kolmogorov distance: samples • [Beimel-Nissim-Stemmer ’13] • samples required! [Bun-Nissim-Stemmer-Vadhan ’15] • Cost of privacy: Hmm... *-DP

  28. Univariate Learning: Gaussians • Privately estimate a Gaussian with , • In -distance: • samples • [Karwa-Vadhan ’18] • Equivalently: estimate and in “scale invariant” fashion • Cost of privacy: Mild dependence on the scale parameters

  29. Multivariate Learning: Product Distributions • Privately estimate mean of a binary product distribution • In -distance: [folklore] • In -distance: [K.-Li-Singhal-Ullman ’18] • Corresponds to learning the distribution in -distance • In -distance: [Bun-Ullman-Vadhan ’14] • Cost of privacy: exponential! *-DP

  30. Univariate Learning: Gaussians • Privately estimate a Gaussian with , • In -distance: • samples • [Karwa-Vadhan ’18] • Equivalently: estimate and in “scale invariant” fashion • Cost of privacy: Mild dependence on the scale parameters

  31. Multivariate Learning: Gaussians *-DP • Privately estimate a Gaussian with , • In -distance: • samples • [K.-Li-Singhal-Ullman ’18] • Equivalently: estimate and in “scale invariant” fashion • Cost of privacy: Mild dependence on the scale parameters

  32. Distribution Learning vs. Reconstruction • Do reconstruction attacks give good private learning LBs? • Not really: • Weak parameters • Kobbi’s talk: Can’t answer queries with accuracy • Gives an lower bound: trivial • Type mismatch • Throw out half your data: reconstruction is impossible • Throw out half your data: NBD, sample twice as much

  33. Distribution Learning vs. Linear Queries • Distribution learning • Learn all queries, but for a simple class • E.g., product distributions: samples • Linear queries • Learn some queries, but for a complex class • For queries, samples • More from Gerome and Sasho tomorrow!

  34. Other Private Statistics

  35. Distributional Functional Estimation • is a discrete distribution over , privately estimate some • Support size, distance to uniformity, entropy • “Estimating the unseen” [G. Valiant-P. Valiant ’11] • samples [Acharya-K.-Sun-Zhang ’18] • Cost of privacy: Negligible! • Privatizing low-sensitivity methods [Orlitsky-Suresh-Wu ’16], [Wu-Yang ’16]

  36. Simple Hypothesis Testing • Determine whether was generated from (known) or • Compute likelihood of data, see which one is bigger • Neyman-Pearson Lemma: “The log-likelihood ratio test is optimal.” • Sample complexity: samples • Hellinger distance: • How to privatize?

  37. Private Simple Hypothesis Testing • But... what is ? • Simple private hypothesis testing: not so simple • Theorem: has the optimal sample complexity (up to constants) • [Canonne-K.-McMillan-Smith-Ullman ’18] • Not quite Neyman-Pearson...

  38. Private Simple Hypothesis Testing on Binomials • , • Neyman-Pearson: Threshold • Uniformly most powerful (UMP): same threshold is optimal for all simultaneously • A UMP private test for Binomial data • Noise using a “Truncated-Uniform-Laplace” (Tulap) distribution • [Awan-Slavković ’18] • Improves upon overlapping work by [Ghosh-Roughgarden-Sundararajan ’09] • UMPs can’t exist when the domain is larger than 2 • [Brenner-Nissim ’10]

  39. Changepoint Detection • , • Output which minimizes • Non-private: Cumulative Sum (CUSUM) • Based on log-likelihood ratio test • Private analysis by [Cummings-Krehbiel-Mei-Tuo-Zhang ’18] • Same drawbacks as LLR... • Reduction from changepoint detection to simple hypothesis testing • Apply test from before • [Canonne-K.-McMillan-Smith-Ullman ’18]

  40. Other things

  41. Local Privacy • Hypothesis Testing • [Gaboardi-Rogers ’18], [Sheffet ’18], [Acharya-Canonne-Freitag-Tyagi ’19] • Distribution Estimation • [Duchi-Jordan-Wainwright ’13] • Multinomials: [Kairouz-Bonawitz-Ramage ’16], [Acharya-Sun-Zhang ’18], [Ye-Barg ’18] • Gaussians: [Gaboardi-Rogers-Sheffet ’19], [Joseph-Kulkarni-Mao-Wu ’18]

  42. Other related tasks • PCA • [Chaudhuri-Sarwate-Sinha ’12], [Dwork-Talwar-Thakurta-Zhang ’14] • Clustering • [Wang-Wang-Singh ’15], [Balcan-Dick-Liang-Mou-Zhang ’17] • Computing Robust Statistics • [Dwork-Lei ’09]

  43. Thanks!

More Related