1 / 29

Integrating Distributed Data Streams

Integrating Distributed Data Streams. Alasdair J G Gray Supervisors: M Howard Williams Werner Nutt 7 th June 2007. Overview. The problem Limits of current technology Proposed system Architecture Query Answering Performance Conclusions. Main sources: sensors Characteristics:

Download Presentation

Integrating Distributed Data Streams

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Integrating DistributedData Streams Alasdair J G Gray Supervisors: M Howard Williams Werner Nutt 7th June 2007

  2. Overview • The problem • Limits of current technology • Proposed system • Architecture • Query Answering • Performance • Conclusions

  3. Main sources: sensors Characteristics: Unbounded Append only Frequency Managed by: Sensor networks Network/Grid monitoring Ubiquitous/Pervasive computing environments Streams of Data Reading

  4. Streams are everywhere Internet

  5. Bookkeeping Job progress Grid Monitoring data Grid Monitoring • Resources supplied by various institutions • Resources publish status information • Scheduler must allocate jobs to resources • Bookkeeping tracks resource usage • Users track job progress

  6. Requirements Ability to: • Publish distributed streams of data • Query multiple streams with no knowledge of • Existence of source streams • Location of individual streams • Access methods to individual streams • Scale to large numbers of users and sources

  7. Data Integration System • Several distributed data sources • Users send query to Mediator • Mediator • Translates user query into sub-queries • Combines results of sub-queries • Only for stored data sources Mediator DB1 DB3 DB2

  8. Stream Management System • Data streams into the server • Server applies long-standing queries to the streams • Answers streamed out • Users need to know which streams exist • Centralised server

  9. Solution Need a system that combines: • Ability to access multiple sources without specific source knowledge Data integration • Ability to process streams of data Stream processing A Stream Integration System!

  10. Consumer Consumer Registry Stream Integration System 1 • Producer publishes streams of data • Consumer query for streams of data • Registry matches consumer requests with publications Producer Producer Producer

  11. Publishing Monitoring Data • Stream data can be represented in terms of relations with • Keys: “what” and “where” • Measurements: the “value” • Timestamps: “when” For example, Network ThroughPut • One reading is a tuple in the relation

  12. Consuming Monitoring Data • Users are interested in how the grid changes over time. For example, • Latency for large packets sent from hw • Links with a low latency as recorded by the PingER tool • These can be expressed as SQL selection queries

  13. What is an Answer to a Query? • Global relations contain no tuples (virtual relation) • Need to translate into query over sources • An answer stream should be • Sound • Complete • Duplicate free • Weakly ordered: all tuples that share the same key value will be in timestamp order • Order in general is difficult in a distributed setting • Weak order sufficient for more complex queries such as aggregates

  14. Query Planning: Consumer Query q2: tool=‘ping' Λ latency≤10.0 q1: from='hw' Λ psize≥1024 • Satisfiability used to find relevant producers from='hw' Λ psize≥1024 Λ from='ral' Λ tool='ping' Λ from='hw' Λ tool='udp' S1: from='hw' Λ tool='udp' S2: from='hw' Λ tool='ping' S3: from='ral' Λ tool='ping' S4: from='ral' Λ tool='udp' S5: from=‘an' Λ tool='ping'

  15. Problem: Every consumer contacting every producer of interest does not scale Even a small Grid of less than a dozen sites has problems Grids may contain thousands of resources Forexample, Large Hadron Collider Computing Grid (LCG) Scalability is an Issue

  16. Republishers Allow the System to Scale A republisher • Consumes answers to a selection query • Merges "trickles" into streams • Publishes • Answer stream • Latest-state answer • History Republisher Producer S1 Producer S2 Problem: Choice in where to obtain information

  17. Query Planning in the Presence of Republishers q2: tool='ping' Λ latency≤10.0 q1: from='hw' Λ psize≥1024 • Meta query plan contains choice • Query plan uses one of R1 or R3 • Find all relevant publishers • Rank according to data provided R1: from='hw' R2: from='ral' R3:from='hw' Λ tool='ping' S1: from='hw' Λ tool='udp' S2: from='hw' Λ tool='ping' S3: from='ral' Λ tool='ping' S4: from='ral' Λ tool='udp' S5: from=‘an' Λ tool='ping'

  18. Weak Order is not Guaranteed q2: tool=‘ping' Λ latency≤10.0 • Tuples for same channel • (3) published before (8) • Arrive at consumer in wrong order (8) (3) slow link (8) (3) latency≤5.0 latency>5.0 (3) (8) S2: from='hw' Λ tool='ping'

  19. Generating Well Formed Query Plans • A publisher is relevant for a global query if • Conditions are satisfiable, and • All measurements that agree on their key values come from the same publisher • The measurement condition can be checked using entailment. • Previous example was well formed.

  20. Query Re-Planning • Queries are long-lived • Set of publishers can change • Query plans should reflect changes

  21. How does a new Republisher affect our Consumers? q2: tool= 'ping' Λ latency≤10.0 q1: from='hw' Λ psize≥1024 • Find consumers for which R4 is relevant • Compare R4 to publishers in Meta Query Plan R4: TRUE R1: from='hw' R2: from='ral' R3:from='hw' Λ tool='ping' S1: from='hw' Λ tool='udp' S2: from='hw' Λ tool='ping' S3: from='ral' Λ tool='ping' S4: from='ral' Λ tool='udp' S5: from=‘an' Λ tool='ping'

  22. Planning a Republisher Query • Applying Consumer planning techniques results in a problem R4: TRUE R1: from='hw' R2: from='ral' Problem: • Hierarchy contains cycles • Republishers disconnected from Producers R3:from='hw' Λ tool='ping' S1: from='hw' Λ tool='udp' S2: from='hw' Λ tool='ping' S3: from='ral' Λ tool='ping' S4: from='ral' Λ tool='udp' S5: from=‘an' Λ tool='ping'

  23. Desirable Properties for a Hierarchy • Correctness: streams answer queries • Cycle freeness: loops can lead to duplicates • Uniqueness: hierarchy defined for a set of publishers • Local planning: Publishers and Consumers only need to communicate with the Registry

  24. Generating Well Formed Hierarchies • Need a stricter relevance criterion • R1 can consume from R2 iff • Everything R2 offers is relevant to R1, and • R1 offers something R2 does not. • Can be checked by entailment • Ensures • No loops in the hierarchy • Republishers connected to the Producers

  25. Re-Planning, Re-Visited! • Stricter relevance criterion • Republishers only consume from publishers below them R4 is not relevant for R1 R4: TRUE R1: from='hw' R2: from='ral' R3:from='hw' Λ tool='ping' S1: from='hw' Λ tool='udp' S2: from='hw' Λ tool='ping' S3: from='ral' Λ tool='ping' S4: from='ral' Λ tool='udp' S5: from=‘an' Λ tool='ping'

  26. Republishers Effect on Latency • Tuple published by producer • Tuple passes through some number of republishers • Tuple arrives at consumer Republishers add to the time taken!

  27. Performance Measure

  28. Conclusions • Distributed streams of data are increasing be made available • Distributed users interested in multiple streams • Developed a system for • Publishing distributed data streams • Querying multiple stream sources without source knowledge • Republishers required to allow system to scale

  29. Future Work • Increase complexity of query language • Integrate stored and stream sources

More Related