1 / 50

Privacy Preserving Outlier Detection using Locality Sensitive Hashing

Nisarg Raval , Madhuchand Rushi Pillutla, Piyush Bansal, K Srinathan and C V Jawahar IIIT Hyderabad, India. CSTAR. Privacy Preserving Outlier Detection using Locality Sensitive Hashing. Motivation. Motivation. Trusted Third Party (TTP). Motivation. Can we avoid TTP ?.

asa
Download Presentation

Privacy Preserving Outlier Detection using Locality Sensitive Hashing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Nisarg Raval, Madhuchand Rushi Pillutla, Piyush Bansal, K Srinathan and C V Jawahar IIIT Hyderabad, India CSTAR Privacy Preserving Outlier Detection using Locality Sensitive Hashing

  2. Motivation

  3. Motivation Trusted Third Party (TTP)

  4. Motivation Can we avoid TTP ? Trusted Third Party (TTP)

  5. Motivation Simulate Trusted Third Party

  6. Alice and Bob have database of customer behavior. • They together want to find fraudulent customers (outliers) in their respective database. • Only outliers should be revealed. • Individual data should be private. Privacy Preserving Outlier Detection

  7. Statistics based • Barnett et al. John Wiley 1994 • Density based • Papadimitriou et al. ICDE 2003 • Distance based • Knorr et al. VLDB 1998 • Ramaswamy et al. SIGMOD 2000 • Wang et al. ICDE 2011 Outlier Detection Approach

  8. Heuristic based • Atallah et al. KDEW 1999 • Verykios et al. KDE 2003 • Reconstruction based • Agrawal et al. SIGMOD 2000 • Rizvi et al. VLDB 2002 • Cryptography based • Lindell et al. CRYPTO 2000 • Clifton et al. SIGKDD 2002 Privacy Preserving Data Mining Verykios et al. ; SIGMOD 2004

  9. Vaidyaet al. ICDM 2004 • Pair wise distance computation • Secure Distance and Secure Comparison protocols • Zhou et al. EBISS 2009 • Pair wise distance computation • Homomorphic Encryption and Randomization Related Work

  10. Vaidyaet al. ICDM 2004 • Pair wise distance computation • Secure Distance and Secure Comparison protocols • Zhou et al. EBISS 2009 • Pair wise distance computation • Homomorphic Encryption and Randomization • Quadratic Cost • Approximately 1012 operations on 1 Million data points. Related Work Our method is 10000 times faster on 1 Million data points!

  11. Distance based outliers [Knorr et al. VLDB 1998] • An object is an outlier if very large fraction of total objects lie outside the specified radius. Outlier Detection

  12. Distance based outliers [Knorr et al. VLDB 1998] • An object is an outlier if very large fraction of total objects lie outside the specified radius. Outlier Detection

  13. Distance based outliers [Knorr et al. VLDB 1998] • An object is an outlier if very large fraction of total objects lie outside the specified radius. Outlier Detection Non Neighbors Neighbors

  14. Distance based outliers [Knorr et al. VLDB 1998] • An object is an outlier if very large fraction of total objects lie outside the specified radius. Outlier Detection Outlier Non Neighbors Neighbors

  15. Distance based outliers [Knorr et al. VLDB 1998] • An object is an outlier if very large fraction of total objects lie outside the specified radius. Outlier Detection Outlier Non Neighbors Neighbors

  16. Converse of the definition • An object is non-outlier if it has enough neighbors within specified radius. Our Approach

  17. Converse of the definition • An object is non-outlier if it has enough neighbors within specified radius. Our Approach

  18. Converse of the definition • An object is non-outlier if it has enough neighbors within specified radius. Our Approach Neighbors Non Neighbors

  19. Converse of the definition • An object is non-outlier if it has enough neighbors within specified radius. Our Approach Non - Outlier Neighbors Non Neighbors

  20. Converse of the definition • An object is non-outlier if it has enough neighbors within specified radius. Our Approach Easy to find small number of neighbors! Non - Outlier Neighbors Non Neighbors

  21. Property • Condition • Hash Family Locality Sensitive Hashing (LSH) Similar objects are hashed to same bin

  22. Centralized Algorithm Outlier Detection Find Non Outliers Near Neighbor Queries LSH • MadhuchandRushiPillutla, Nisarg Raval, PiyushBansal, KannanSrinathan and C.V. Jawahar • LSH Based Outlier Detection and Its Application in Distributed Setting CIKM 2011

  23. Vertically distributed data • Each player has different attributes for the same set of objects Distributed Settings

  24. Vertically distributed data • Each player has different attributes for the same set of objects • Horizontally distributed data • Each player has the same attributes for a subset of the total objects Distributed Settings

  25. Two phase: • Generation of global LSH bin structure using information from all the players • Find non outliers using generated global bin structure Privacy in Vertical Distribution

  26. Two phase: • Generation of global LSH bin structure using information from all the players • Find non outliers using generated global bin structure Privacy in Vertical Distribution How do we generate LSH bin structure privately ?

  27. Two phase: • Generation of global LSH bin structure using information from all the players • Find non outliers using generated global bin structure Privacy in Vertical Distribution • Private Hash Evaluation • LSH based on p-stable distribution

  28. Two phase: • Generation of global LSH bin structure using information from all the players • Find non outliers using generated global bin structure Privacy in Vertical Distribution • Secure Evaluation of Dot Product (a.v) • Each player will generate values of vector a corresponding to the dimensions of v they have. • Add the corresponding products to generate shares of dot product • Using Secure Sum protocol generate final dot product

  29. Two phase: • Generation of global LSH bin structure using information from all the players • Find non outliers using generated global bin structure Privacy in Vertical Distribution • Perform near neighbor queries • Secure Distance and Secure Comparison • Many neighbors • Quadratic Communication

  30. Two phase: • Generation of global LSH bin structure using information from all the players • Find non outliers using generated global bin structure Privacy in Vertical Distribution • Perform near neighbor queries • Secure Distance and Secure Comparison • Many neighbors • Quadratic Communication Can we break the Quadratic Bound ?

  31. Definition of outliers are subjective • Unlike traditional LSH queries NO explicit distance calculation • No communication required Hash Objects Approximate Near Neighbor Queries Count Neighbors Yes Non Outlier No Outlier

  32. No of queries = No of objects in database • Databases are very large Hash Objects Need for Pruning Count Neighbors Can we reduce the number of queries? Yes Non Outlier No Outlier

  33. Hash Objects Pruning Count Neighbors Yes No Outlier

  34. Neighbors of a non outlier are also non outliers Hash Objects Pruning Count Neighbors Yes Non Outliers No Outlier

  35. Neighbors of a non outlier are also non outliers Hash Objects Pruning < 1 % of total database needs to be processed! Count Neighbors Yes Non Outliers No Outlier

  36. Data is the union of the set of objects all players have • Steps: • Generate local LSH bin structure • Perform local pruning • Communicate to obtain global neighbor information • Perform global pruning Privacy in Horizontal Distribution

  37. Data is the union of the set of objects all players have • Steps: • Generate local LSH bin structure • Perform local pruning • Communicate to obtain global neighbor information • Perform global pruning Privacy in Horizontal Distribution How do we obtain global neighbor information privately ?

  38. Construct global LSH bin labels • Secure Union Protocol • Add count of objects of corresponding bins • Secure Sum protocol • Perform global pruning using global bin structure Private Global Bin Structure

  39. LSH is probabilistic • Probability of being near neighbor is at least • False neighbors may cause pruning of an outlier • False Negatives Approximation Error How do we reduce False Negatives ?

  40. Bin Threshold (BT) • Neighbor only if it appears in at least (BT) bins • Increasing BT will decrease False Negatives Hash Objects Reducing False Negatives Count Neighbors Yes Non Outlier No Outlier

  41. Bin Threshold may remove actual neighbors • High Bin Threshold reduce pruning efficiency • False Positives Bin Threshold How do we reduce False Positives without increasing False Negatives?

  42. Reducing False Positives LSH LSH LSH Pruning Pruning Pruning Compute Parameters Compute Parameters Compute Parameters Find Near Neighbors Find Near Neighbors Find Near Neighbors Intersection of Results Iteration n Iteration 1 Iteration 2 Generate Bin Structure Generate Bin Structure Generate Bin Structure Prune Non Outliers Prune Non Outliers Prune Non Outliers Final Set of Outliers Multiple Runs Output

  43. Analysis Security of the Algorithm depends on the security of Secure Union and Secure Sum protocols

  44. Experimental Results

  45. Increasing BT will increase detection rate but also increase false positives • Optimal BT • High detection rate • Low false positives Effect of Bin Threshold ( BT ) Corel Landsat Darpa Household

  46. False positives decrease exponentially with increase in iterations • Very small number of iterations needed to achieve low false positive rate Effect of Iterations on False Positives

  47. Less than Quadratic • Superior than previously known best results Communication Corel Landsat Up to 10000 times less communication on datasets of size 106 ! Darpa Household

  48. Performance False Positives can be considered as borderline outliers!

  49. Approximate Outlier Detection • Efficient Private algorithms for both Vertical and Horizontal Distribution • Efficient Pruning based on LSH • Scalable for large and high dimensional data • Trade off between Accuracy and Cost Conclusion

  50. CSTAR nisarg.raval@research.iiit.ac.in Supported by Microsoft Research India Travel Grant

More Related