real time querying of live and historical stream data n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Real-Time Querying of Live and Historical Stream Data PowerPoint Presentation
Download Presentation
Real-Time Querying of Live and Historical Stream Data

Loading in 2 Seconds...

play fullscreen
1 / 43

Real-Time Querying of Live and Historical Stream Data - PowerPoint PPT Presentation


  • 97 Views
  • Uploaded on

Real-Time Querying of Live and Historical Stream Data. Joe Hellerstein UC Berkeley. Joint Work. Fred Reiss UC Berkeley (IBM Almaden ) Kurt Stockinger , Kesheng Wu, Arie Shoshani Lawrence Berkeley National Lab. Outline. A challenging stream query problem

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

Real-Time Querying of Live and Historical Stream Data


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
    Presentation Transcript
    1. Real-Time Querying of Live and Historical Stream Data Joe Hellerstein UC Berkeley

    2. Joint Work • Fred Reiss UC Berkeley (IBM Almaden) • Kurt Stockinger, Kesheng Wu, ArieShoshaniLawrence Berkeley National Lab

    3. Outline • A challenging stream query problem • Real-world example: US DOE network monitoring • Open-Source Components • Stream Query Engine: TelegraphCQ • Data Warehousing store: FastBit • Performance Study • Stream Analysis, Load, Lookup • Handling Bursts: Data Triage

    4. Outline • A challenging stream query problem • Real-world example: US DOE network monitoring • Open-Source Components • Stream Query Engine: TelegraphCQ • Data Warehousing store: FastBit • Performance Study • Stream Analysis, Load, Lookup • Handling Bursts: Data Triage

    5. Agenda • Study a practical application of stream queries • High data rates • Data-rich: needs to consult “history” • Obvious settings • Financial • System Monitoring • Keep it real

    6. DOE Network Monitoring • U.S. Department of Energy (DOE) runs a nationwide network of laboratories • Including our colleagues at LBL • Labs send data over a number of long-haul networks • DOE is building a nationwide network operations center • Need software to help operators monitor network security and reliability

    7. Monitoring infrastructure

    8. Challenges • Live Streams • Continuous queries over unpredictable streams • Archival Streams • Load/index all data • Access on demand as part of continuous queries Open Source

    9. Outline • A challenging stream query problem • Real-world example: US DOE network monitoring • Open-Source Components • Stream Query Engine: TelegraphCQ • Data Warehousing store: FastBit • Performance Study • Stream Analysis, Load, Lookup • Handling Bursts: Data Triage

    10. Telegraph Project • 1999-2006, joint with Mike Franklin • An “adaptive dataflow” system • v1 in 2000: Java-based • Deep-Web Bush/Gore demo presages live web mashups • V2: TCQ, rewrite of PostgreSQL • External data & streams • Open source with active users, mostly in net monitoring • Commercialization at Truviso, Inc • “Big” academic software • 2 faculty, 9 PhDs, 3 MS

    11. Some Key Telegraph Features • Eddies: Continuous Query Optimization • Reoptimize queries at any point in execution • FLuX: Fault-Tolerant Load-balanced eXchange • Cluster Parallelism with High Availability • Shared query processing • Query processing is a join of queries and data • Data Triage • Robust statistical approximation under stress • See http://telegraph.cs.berkeley.edu for more

    12. FastBit • Background • Vertically-partitioned relational storewith bitmap indexes • Word-Aligned Hybrid (WAH) compression tuned for CPU efficiency as well as disk bandwidth • LBL internal project since 2004 • Open source LPGL in 2008

    13. Column 2 Column 2 1 2 3 4 5 3 2 1 5 4 3 … 0 0 0 0 1 0 … 0 0 1 0 0 1 … 1 0 0 0 0 0 … 0 1 0 0 0 0 … 0 0 0 1 0 0 … 0 0 1 0 0 0 … 0 1 0 0 0 0 … 1 0 0 0 0 1 … 0 0 0 0 1 0 … 0 0 0 1 0 0 … Introduction to Bitmap Indexing Relational Table Bitmap Index Column 1 Column 1 1 2 3 4 5 3 4 2 5 1 2 …

    14. Why Bitmap Indexes? • Fast incremental appending • One index per stream • No sorting or hashing of the input data • Efficient multidimensional range lookups • Example: Find number of sessions from prefix 192.168/16 with size between 100 and 200 bytes • Efficient batch lookups • Can retrieve entries for multiple keys in a single pass over the bitmap

    15. Outline • A challenging stream query problem • Real-world example: US DOE network monitoring • Open-Source Components • Stream Query Engine: TelegraphCQ • Data Warehousing store: FastBit • Performance Study • Stream Analysis, Load, Lookup • Handling Bursts: Data Triage

    16. The DOE dataset • 42 weeks, 08/2004 – 06/2005 • Est. 1/30 of DOE traffic • Projection: • 15K records/sec typical • 1.7M records/sec peak • TPC-C today • 4,092,799 tpmC (2/27/07) • i.e. 68.2K per sec • 1.5 orders of magnitude needed! • And please touch 2 orders of magnitude more random data, too • But.. append-only updates + streaming queries (temporal locality)

    17. Our Focus: Flagging abnormal traffic • When a network is behaving abnormally… • Notify network operators • Trigger in-depth analysis or countermeasures • How to detect abnormal behavior? • Analyze multiple aspects of live monitoring data • Compute relevant information about “normal” behavior from historicalmonitoring data • Compare current behavior against this baseline

    18. Example: “Elephants” • The query: • Find the k most significant sources of traffic on the network over the past t seconds. • Alert the network operator if any of these sources is sending an unusually large amount of traffic for this time of the day/week, compared with its usual traffic patterns.

    19. System Architecture

    20. Query Workload • Five monitoring queries, based on discussions with network researchers and practitioners • Each query has three parts: • Analyze flow record stream (TelegraphCQ query) • Retrieve and analyze relevant historical monitoring data (FastBit query) • Compare current behavior against baseline

    21. Query Workload Summary • Elephants • Find heavy sources of network traffic that are not normally heavy network users • Mice • Examine the current behavior of hosts that normally send very little traffic • Portscans • Find hosts that appear to be probing for vulnerable network ports • Filter out “suspicious” behavior that is actually normal • Anomaly Detection • Compare the current traffic matrix (all source, destination pairs) against past traffic patterns • Dispersion • Retrieve historical traffic data for sub-networks that exhibit suspicious timing patterns • Full queries are in the paper

    22. Best-Case Numbers • Single PC, dual 2.8GHz single-core Pentium 4, 2GB RAM, IDE RAID (60 MB/sec throughput) • TCQ performance up to 25Krecords/sec • Depends heavily on query, esp. window size • Fastbit can load 213K tups/sec • NW packet trace schema • Depends on batch size: 10Mtups per batch • Fastbit can “fetch” 5 Mrecords/sec • 8-bytes of output per record only! 40MB/sec near RAID I/O throughput • Best end-to-end: 20Ktups/sec • Recall desire of 15Ktups/sec steady state, 1.7Mtups/sec burst

    23. Streaming Query Processing

    24. Streaming Query Processing

    25. Index Insertion

    26. Index Insertion

    27. Index Lookup

    28. Index Lookup

    29. End-to-End Throughput

    30. End-to-End Throughput

    31. Summary of DOE Results • With sufficiently large load window, FastBit can handle expected peak data rates • Streaming query processing becomes the bottleneck • Next step: Data Triage

    32. Outline • A challenging stream query problem • Real-world example: US DOE network monitoring • Open-Source Components • Stream Query Engine: TelegraphCQ • Data Warehousing store: FastBit • Performance Study • Stream Analysis, Load, Lookup • Handling Bursts: Data Triage

    33. Data Triage • Provision for the typical data rate • Fall back on approximation during bursts • But always do as much “exact” work as you can! • Benefits: • Monitor fast links with cheap hardware • Focus on query processing features, not speed • Graceful degradation during bursts • 100% result accuracy most of the time ICDE 2006

    34. Bursty data goes to the triage process first Place a triage queue in front of each data source Summarize excess tuples to prevent missing deadlines Data Triage To Query Engine Summaries Of Triaged Tuples Relational Tuples Triaged Tuples Summarizer Triage Queue Triage Process Tuples Initial Parsing And Filtering Packets ICDE 2006

    35. Query engine receives tuples and summaries Use a shadow query to compute approximation of missing results Data Triage User Query Engine Merge Summaries Of Missing Results Main Query Shadow Query Summaries Of Triaged Tuples Relational Tuples ICDE 2006

    36. Read the paper for… • Provisioning • Where are the performance bottlenecks in this pipeline? • How do we mitigate those bottlenecks? • Implementation • How do we “plug in” different approximation schemes without modifying the query engine? • How do we build shadow queries? • Interface • How do we present the merged query results to the user? ICDE 2006

    37. Delay Constraint Window 1 Window 2 Delay Constraints All results from this window… …must be delivered by this time Time ICDE 2006

    38. Experiments • System • Data Triage implemented on TelegraphCQ • Pentium 3 server, 1.5 GB of memory • Data stream • Timing-accurate playback of real network traffic from www.lbl.gov web server • Trace sped up 10x to simulate an embedded CPU select W.adminContact, avg(P.length) asavgLength, stdev(P.length) asstdevLength wtime(*) aswindowTime from Packet P [range ’1 min’ slide ’1 min’ ], WHOIS W where P. srcIP>WHOIS.minIP andP.srcIP<WHOIS.maxIP groupbyW.adminContact limit delay to ‘10 seconds’; ICDE 2006

    39. Experimental Results: Latency ICDE 2006

    40. Experimental Results: Accuracy • Compare accuracy of Data Triage with previous work • Comparison 1: Drop excess tuples • Both methods using 5-second delay constraint • No summarization ICDE 2006

    41. Experimental Results: Accuracy • Comparison 2: Summarize all tuples • Reservoir sample • 5 second delay constraint • Size of reservoir = number of tuples query engine can process in 5 sec ICDE 2006

    42. Conclusions & Issues • Stream Query + Append Only Warehouse • Good match, can scale a long way at modest $$ • Data Triage combats Stream Query bottleneck • Provision for the common case • Approximate on excess load during bursts • Keep approximation limited, extensible • Parallelism needed • See FLuX work for High Availability • Shah et al, ICDE ‘03 and SIGMOD ‘04 • vs. Google’s MapReduce • Query Optimization for streams? Adaptivity! • Eddies: Avnur & HellersteinSIGMOD ‘00 • SteMs: Raman, Deshpande, Hellerstein, ICDE ‘03 • STAIRs: Deshpande & HellersteinVLDB ’04 • Deshpande/Ives/Raman survey, F&T-DB ‘07

    43. More? • http://telegraph.cs.berkeley.edu • Frederick Reiss and Joseph M. Hellerstein. “Declarative Network Monitoring with an Underprovisioned Query Processor”. ICDE 2006. • F. Reiss, K. Stockinger, K. Wu, A. Shoshani, J. M. Hellerstein. “Enabling Real-Time Querying of Live and Historical Stream Data”. SSDBM 2007. • Frederick Reiss. “Data Triage”. Ph.D. thesis, UC Berkeley, 2007.