1 / 25

A Secure Clustering Algorithm for Distributed Data Streams

A Secure Clustering Algorithm for Distributed Data Streams. Geetha Jagannathan Rutgers University Joint work with Krishnan Pillaipakkamnatt and D. Umano. Outline. The problem Prior results Clustering data streams Experimental results and comparison A privacy-preserving protocol

heidi
Download Presentation

A Secure Clustering Algorithm for Distributed Data Streams

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Secure Clustering Algorithm for Distributed Data Streams Geetha Jagannathan Rutgers University Joint work with Krishnan Pillaipakkamnatt and D. Umano

  2. Outline • The problem • Prior results • Clustering data streams • Experimental results and comparison • A privacy-preserving protocol • Conclusion

  3. The problem • Alice and Bob each have a data stream, defined on the same attributes. (horizontal partition) • The wish to compute a clustering on the combined data.

  4. Bob Alice

  5. Clustering on joint data Alice’s Data k = 4

  6. Clustering on joint data Bob’s Data k = 4

  7. Clustering on joint data Combined Data k = 4

  8. Trusted third party k-clustering k-clustering Bob Alice

  9. Privacy requirements • Parties are semi-honest • Same as trusted third party • Reveals nothing but the final output • In this case – the k cluster centers

  10. Prior results • PPDM protocols convert distributed DM algorithms into private ones • The k-means algorithm is the basis for many clustering protocols [VC03, JKM05, JW05, BO07] • “Leak” intermediate information • [JPW05] presents a leak-free clustering protocol based on a new clustering algorithm.

  11. Our Contributions • A leak free privacy-preserving protocol for distributed data streams. • A data stream clustering algorithm • Better than k-means (on average) • Comparable performance with BIRCH on many data sets, but with lower memory needs.

  12. Data Stream Algorithms • Data arrives in “stream” fashion: d1, d2, …, dn, … (the “end” of the stream is not known ahead of time). • Data is too large to fit entirely in memory. • Data can be accessed only in the order that it arrives. • Each data item can only be “read” once.

  13. The clustering algorithm • “Incrementally agglomerative”: It merges intermediate clusters without waiting for all the data to be available. • Runs in time linear in n.

  14. Overview of clustering algorithm K = 5 Output expected after n = 25 data points Output Level 2 clustering Level 1 clustering Level 0 clustering

  15. Clustering Algorithm Outline • The algorithm maintains a list of k-clusterings (each clustering is on some partial data). • In each iteration: • Input the next k data points as a level-0 clustering. • If two clusterings at level i are in the list, “merge” them into a level-(i + 1) k-clustering.

  16. Clustering algorithm outline • If output is needed after some n points have been read, all k-clusterings are “merged” into a single k-clustering.

  17. “Merging” clusterings • Have a set S clusters, which |S| > k. • Need a set S' of k clusters. • S' = S • Repeat • Compute merge error for every pair of clusters • Take the union of the pair with lowest error • Until |S'| = k

  18. Error (C1 U C2) = C1.weight * C2.weight * (dist(C1, C2))2

  19. Sample results (offset grid)

  20. Sample results (vs k-means)

  21. Sample result (vs. BIRCH)

  22. Realistic Data (Network Intrusion)

  23. The Secure Protocol • Input: Alice owns data stream D1 Bob owns data stream D2 • Output : k-clusters on D1m U D2n • Alice computes O(k log ( )) cluster centers and Bob computes O(k log ( )) cluster centers • Alice and Bob securely share their cluster centers • They securely merge clusters

  24. Sample Run(Distributed non-private protocol)

  25. Complexity • Communication complexity: O((k log(mn/k2)2) • Non-private setting (one party sends the intermediate clusters to the other) • Comm complexity: O(k log (m/k))

More Related