1 / 29

Attacks on Randomization based Privacy Preserving Data Mining

Attacks on Randomization based Privacy Preserving Data Mining. Xintao Wu University of North Carolina at Charlotte Sept 20, 2010. Scope. Outline. Part I: Attacks on Randomized Numerical Data Additive noise Projection Part II: Attacks on Randomized Categorical Data Randomized Response.

xia
Download Presentation

Attacks on Randomization based Privacy Preserving Data Mining

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Attacks on Randomization based Privacy Preserving Data Mining Xintao Wu University of North Carolina at Charlotte Sept 20, 2010

  2. Scope

  3. Outline Part I: Attacks on Randomized Numerical Data • Additive noise • Projection Part II: Attacks on Randomized Categorical Data • Randomized Response

  4. Y X E = + Additive Noise Randomization Example = +

  5. Individual Value Reconstruction (Additive Noise) • Methods • Spectral Filtering, Kargupta et al. ICDM03 • PCA, Huang, Du, and Chen SIGMOD05 • SVD, Guo, Wu and Li, PKDD06 • All aim to remove noise by projecting on lower dimensional space.

  6. Up = U + V Perturbed Original Noise Individual Reconstruction Algorithm • Apply EVD : • Using some published information about V, extract the first k components of as the principal components. • λ1≥ λ2··· ≥ λk ≥ λe and e1, e2, ··· ,ek are the corresponding eigenvectors. • Qk = [e1 e2··· ek] forms an orthonormal basis of a subspace X. • Find the orthogonal projection on to X : • Get estimate data set:

  7. 1-d estimation 2-d estimation noise 1st principal vector original signal perturbed 2nd principal vector Why it works • Noise are not correlated • Original data are correlated = +

  8. Challenging Questions • Previous work on individual reconstruction are only empirical • Attacker question: How close the estimated data is from the original one? • Data owner question: How much noise should be added to preserve privacy at a given tolerated level?

  9. Determining k • Strategy 1: (Huang and Du SIGMOD05 ) • Strategy 2:(Guo, Wu and Li, PKDD 2006) • The estimated data using is approximate optimal

  10. Y X E + = Perturbed Original Noise Transformation Original Additive Noise vs. Projection • Additive perturbation is not safe • Spectral Filtering Technique • H.Kargupta et al. ICDM03 • PCA Based Technique • Huang et al. SIGMOD05 • SVD based & Bound Analysis • Guo et al. SAC06,PKDD06 • How about the projection based perturbation? • Projection models • Vulnerabilities • Potential attacks Y R X = Perturbed

  11. Rotation Randomization Example = Y = R X RRT = RTR = I

  12. Rotation Approach (R is orthonormal) • When R is an orthonormal matrix (RTR = RRT = I) • Vector length: |Rx| = |x| • Euclidean distance: |Rxi - Rxj| = |xi - xj| • Inner product : <Rxi ,Rxj> = <xi , xj> • Many clustering and classification methods are invariant to this rotation perturbation. • Classification, Chen and Liu, ICDM 05 • Distributed data mining, Liu and Kargupta, TKDE 06

  13. 0.2902 0.2902 1.3086 1.3086 Example RRT = RTR = I

  14. 0.2902 0.2902 1.3086 1.3086 Weakness of Rotation Known sample attack Original data Known Info ? Regression

  15. General Linear Transformation • Y = R X + E • When R = I: Y = X + E (Additive Noise Model) • When RRT = RTR = I and E = 0: Y = RX (Rotation Model) • R can be an arbitrary matrix = + Y R X E = +

  16. Is Y = R X + E Safe? • R can be an arbitrary matrix, hence regression based attack wont work • How about noisy ICA direct attack? Y = R X + E General Linear Transformation Model X = A S + N Noisy ICA Model

  17. ICA Revisited • ICA Motivation • Blind source separation: separating unobservable or latent independent source signals when mixed signals are observed • Cocktail-party problem • What is ICA • ICA is a statistical technique which aims to represent a set of random variables as linear combinations of statistically independent component variables • ICA is a process for determining the structure that produced a signal

  18. Separation Process Demixing Matrix Separated Cost Function Independent? Optimize ICA Linear Mixing Process Mixing Matrix Source Observed

  19. Restriction of ICA • Restrictions: • All the components si should be independent; • They must be non-Gaussian with the possible exception of one component. • Can we apply the ICA directly? No • Correlations among attributes of X • More than one attributes may have Gaussian distributions X = AS Y = RX

  20. A-priori Knowledge based ICA (AK-ICA) Attack

  21. Correctness of AK-ICA • We prove that J exists such that • J represents the connection between the distributions of and More details, See Guo and Wu, PAKDD 2007

  22. Assumption • Privacy can be breached when a small subset of the original data X , is available to attackers • Assumption is reasonable! Privacy Concern 56% Refuse 17% No Concern 27% Willing to provide data Understanding net users' attitude about online privacy, April 99

  23. Outline Part I: Attacks on Randomized Numerical Data • Additive noise • Projection Part II: Attacks on Randomized Categorical Data • Randomized Response

  24. Randomized Response ([ Stanley Warner; JASA 1965]) : Cheated in the exam : Didn’t cheat in the exam Cheated in exam Purpose Purpose: Get the proportion( ) of population members that cheated in the exam. • Procedure: “Yes” answer Didn’t cheat Randomization device Do you belong to A? (p) Do you belong to ?(1-p) … … “No” answer As: Unbiased estimate of is:

  25. Matrix Expression • RR can be expressed by matrix as: ( 0: No 1:Yes) = • Unbiased estimate of is:

  26. Vector Response • is the true proportions of the population • is the observed proportions in the survey • is the randomization device set by the interviewer. = =

  27. Extension to Multi Attributes • m sensitive attributes: each has categories: • denote the true proportion corresponding to the combination be vector with elements ,arranged lexicographically. • e.g., if m =2, t1 =2 and t2=3 • Simultaneous Model • Consider all variables as one compounded variable and apply the regular vector response RR technique • Sequential Model stands for Kronecker product

  28. Disclosure Analysis R: Typical response which is “yes” ( ) or “no” ( ) Posterior probabilities: , are conditional probabilities set by investigators R is regarded as jeopardizing with respect or if: or

  29. Q A & Xintao Wu xwu@uncc.edu, http://www.sis.uncc.edu/~xwu Data Privacy Lab http://www.dpl.sis.uncc.edu

More Related