1 / 18

Monitoring Distributed Data Streams

Monitoring Distributed Data Streams. Assaf Schuster, Technion. Line of works joint with Tsachi Scharfmann, Technion Daniel Keren, Haifa U. The Distributed Systems Laboratory http://dsl.cs.technion.ac.il. High-performance clusters (DSM, Infiniband)

Download Presentation

Monitoring Distributed Data Streams

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Monitoring Distributed Data Streams Assaf Schuster, Technion. Line of works joint with Tsachi Scharfmann, Technion Daniel Keren, Haifa U. Israel Innovation Summit

  2. The Distributed Systems Laboratory http://dsl.cs.technion.ac.il • High-performance clusters (DSM, Infiniband) • Grid: research, development (Condor), production systems (Superlink, EGEE) • Mobile, ad-hoc, wireless networks • Sensor networks • Peer-to-peer systems • Knowledge extraction from distributed data/streams • In core parallelism, multicore, multithreading, programming paradigms, debugging Sponsors: EC, MOS, ISF, IDF, TAMAS, Intel, IBM, Microsoft, France Telecom, Voltaire, Mellanox, others Israel Innovation Summit

  3. Today’s Problem Definition • A set of distributed data streams • Example: a sensor network • A data vector is collected from each stream • Stream is infinite • Moving/jumping windows • Given: A function over the average of the data vectors • Given: A predetermined threshold • Question: did the function value cross the threshold? • Example 1: counting, frequency count, average (e.g. temperature) • sum over all data elements and all streams Israel Innovation Summit

  4. Example 2: Monitoring Air Quality • Sensors monitoring the concentration of air pollutants. • Each sensor holds a data vector with the measured concentration of pollutants (CO2, SO2, O3, etc.). • A function on the average data vector determines the Air Quality Index (AQI) • Alert in case the AQI exceeds a given threshold. Israel Innovation Summit

  5. Example 3: Variance Alert • Sensors monitoring the temperature in a server room (machine room, conference room, etc.) • Ensure uniform temp.: monitor variance of readings • Alert in case variance exceeds a threshold • Temperature readings by n sensors x1, …, xn • Each sensor holds a data vector vi = (xi2, xi)T • The average data vector is v = • Var(all sensors) = Israel Innovation Summit

  6. Example 4 (running example):Distributed Feature Selection • A collaborative distributed spam mail filtering system. • A mail server receives a stream of positive and negative examples. • Select a set of features (words) to be used in order to build a spam classifier. • A feature is good if its information gain is above a threshold. Israel Innovation Summit

  7. Distributed Calculation of Info Gain??? • Each server maintains a local contingency table for each feature. • Is the info gain on the global contingency table above the threshold? • Information gain of average contingency table cannot be derived from that of individual tables. IG(C1)=1 IG(C2)=1 Israel Innovation Summit

  8. Previous Solutions • Naïve Algorithms • All data is moved to a central place • Communication overhead • CPU overhead • Power overhead • Privacy issues • Can we do better? Israel Innovation Summit

  9. A Novel Geometric Approach [Forthcoming SIGMOD 2006] • Coloring the vector space • Grey:: function > threshold • White:: function <= threshold • Goal: determine color of global data vector (average). • Observation: average is in the convex hull of streams • If convex hull monochromatic then average is same color • How do we know convex hull is monochromatic? • Without global/central knowledge Israel Innovation Summit

  10. Distributively Bounding the Convex Hull • A reference point is known to all streams • Each stream constructs a ball • Theorem: convex hull is bound by the union of balls Israel Innovation Summit

  11. Basic Algorithm • Set reference point = initial average • Drift – the difference between current local data and reference • drift is diameter of ball • If ball becomes non monochromatic – recalculate average • used as the new reference • drifts become zero Israel Innovation Summit

  12. Rueters Corpus (RCV1-v2) • 800,000+ news stories • Aug 20 1996 -- Aug 19 1997 • Identify corporate/industrial stories n=10 Israel Innovation Summit

  13. Trade-off: Accuracy vs. Performance • Inefficiency: value of function on average is close to the threshold • Performance can be enhanced at the cost of less accurate result: • Set error margin around the threshold value Israel Innovation Summit

  14. Balancing • Globally calculating average is costly • Often possible to average only some of the data vectors. Israel Innovation Summit

  15. Scalability # messages per stream is constant. Israel Innovation Summit

  16. Questions? Israel Innovation Summit

  17. Window Size Israel Innovation Summit

  18. Simultaneous Features Israel Innovation Summit

More Related