1 / 29

The PIER Relational Query Processing System

The PIER Relational Query Processing System. Boon Thau Loo & Ryan Huebsch {boonloo, huebsch}@eecs.berkeley.edu Also developed with: Matthew Harren Faculty: Joe Hellerstein, Scott Shenker, Ion Stoica Group Email: p2p@db.cs.berkeley.edu. UC Berkeley – DB Seminar 2/8/02.

Albert_Lan
Download Presentation

The PIER Relational Query Processing System

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The PIER Relational Query Processing System Boon Thau Loo & Ryan Huebsch {boonloo, huebsch}@eecs.berkeley.edu Also developed with: Matthew Harren Faculty: Joe Hellerstein, Scott Shenker, Ion Stoica Group Email: p2p@db.cs.berkeley.edu UC Berkeley – DB Seminar 2/8/02

  2. What is PIER? Just another nonsense acronym? • PIER = Peer-to-Peer Infrastructure for Information Exchange and Retrieval • PIER = A relational query engine that runs on top of an existing P2P (semi-structured) network and its data • PIER = A better Napster, Gnutella, Freenet or your other favorite file-sharing software • PIER = An answer to the world’s information problems… well probably not

  3. Why? What’s wrong with existing P2P systems? • Existing P2P systems were designed mainly by networking people… • Excel in: • Fault tolerance • Flexibility • Decentralized (ad hoc, multiple administrators, limited controls) • Lack: • Real query facilities… only support keyword searching • Guarantees/promises (remember the networking motto… “best effort”)

  4. Why? What’s wrong with existing database systems? • Existing database systems… • Excel in: • Precise, correct answers • Powerful query facilities (SQL) • Guarantees and concurrency control (ACID Transactions) • Optimization… “don’t tell us how to do it… we tell you” • Lack: • Flexibility (servers come and go, loose structure) • Decentralized administration • Fault tolerance (can only tolerate a few problems, if any, the DB answer to a problem… “it didn’t work, try again later”)

  5. What about Napster, Gnutella, and the rest? • Napster is centralized • Subject to failure • Subject to censorship (in this case the music industry) • Requires administration • Gnutella does not scale • Queries only asked in nearby network (limited hop count) • Research projects: CAN, Chord (CFS, INS, TWINE), Pastry, Tapestry (OceanStore), etc. • Lack the query facilities

  6. What PIER IS and IS NOT about… • IS NOT ABOUT: Performance • In most situations a centralized solution could be faster… • IS ABOUT: Decentralized Features • No administrator, anonymity, shared resources, tolerates failures, resistant to censorship… • IS NOT (Just) ABOUT: Hype • Buzzwords, catch phrases, and empty promises • IS ABOUT: • Uniting the networking community and the database community to get the best of both

  7. Based on Distributed Hash Tables (DHT) to get many good networking properties A query processor is built on top Note: the data is stored separately from the query engine, not a standard DB practice! General PIER Architecture

  8. DHTs – Overview • Wide-area distributed hash table that supports finding data in an efficient manner • Data has an ID that is used to map the item to a node (or bucket) that stores it • Lookups/inserts  routing problem • Number of implementations: • CAN – UC Berkeley/ICSI • CHORD – MIT/UC Berkeley • Tapestry – UC Berkeley, based on Plaxton’s algorithm (UTexas). • Any many others, all similar APIs & functionalities

  9. DHTs – CAN in particular • Content Addressable Network (CAN) is one of many DHT algorithms • Overlay network – uses underlying TCP • Self-forming – only need to find one node to join, no central administration needed • Fault tolerant – adapts to many failures • Load balancing • Scalable – each node needs only a little state information about its neighbors, O(d) cost and primitives are O(n1/d) cost • d = # of dimensions, n = # of nodes

  10. Establishes a d-dimensional Cartesian coordinate space Nodes randomly choose a point in the space to join The node currently responsible for that space splits and the space is divided equally Nodes store information about their spatial neighbors in the coordinate space A node can refuse to split if it thinks it neighbors should be split instead, keeps the load balanced Node 4 7 7 7 7 7 7 7 7 7 6 6 6 6 6 6 6 6 6 Node 4 Node 6 Node 1 Node 1 Node 1 Node 1 Node 1 Node 4 Node 4 5 5 5 5 5 5 5 5 5 Node 1 Node 1 Node 2 Node 2 Node 2 4 4 4 4 4 4 4 4 4 3 3 3 3 3 3 3 3 3 Node 2 Node 2 2 2 2 2 2 2 2 2 2 Node 3 Node 3 Node 3 Node 3 Node 3 Node 2 1 1 1 1 1 1 1 1 1 Node 5 Node 5 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 4 4 6 6 6 6 6 6 6 6 6 7 7 7 7 7 7 7 7 7 5 5 5 5 5 5 5 5 5 1 1 1 1 1 1 1 1 1 CAN Overview

  11. Data is represented as a key/value pair The key or resource ID (RID) is hashed into the coordinate using a hashing function h(x), such as SHA1 Data is stored at the node responsible for that space 7 6 Node 4 Node 6 Node 1 5 4 3 Node 2 2 Node 3 1 Node 5 0 0 2 3 4 6 7 5 1 CAN Overview II I1 Item 1: h(RID1) = <3,2>

  12. Data is located by ‘hopping’ from neighbor to neighbor till the right node is found Node 6 wants I1 Determine coordinates: h(RID1) = <2,3> Determine which neighbor is closest N4 or N2 Hop to that neighbor and repeat 7 6 Node 4 Node 6 Node 1 5 4 3 Node 2 2 Node 3 1 Node 5 0 0 2 3 4 6 7 5 1 Note that space wraps (it is really a torus) so node 6 has nodes 2,3,4, and 5 as neighbors, but to make this simple example work we ignore that detail… CAN Overview III I1

  13. CAN Overview IV • Cost of operations is based on the path length from node to node: O(n1/d) • Basic API • Publish(namespace, RID, object) • Lookup(namespace, RID) • Multicast(group, object) • Applications can only do EXACT lookups… that just isn’t enough for real users!

  14. PIER is just another application as far as CAN is concerned… CAN objects = PIER tuples User applications can use PIER to query data using a subset of SQL (Joins and Group By). Assume: PIER is running on all CAN nodes Metadata is stored in CAN, not the actual large files, although functionally they could An object’s RID is its primary key (in most cases) The namespace is equivalent to the relation’s name PIER Overview

  15. CAN Extensions Not needed for correctness, could use polling instead • To allow for PIER to operate, it must have access to the local data store • LocalGet(namespace) – retrieve the data stored locally from a particular namespace • LocalNotify(namespace) – receive a callback when new data is inserted into the local store for the namespace • This violates the abstraction of location independence • However, access is limited to reading the data, applications can not control the location of data through these extensions

  16. Indexes. The lifeblood of a database engine. • CAN’s mapping of RID/Object is equivalent to an index • Additional indexes are created by adding another key/value pair with the key being the value of the indexed field(s) and value being a ‘pointer’ to the object (the RID or primary key) • Original: {35, abc.mp3, classical, 1837,…} • Index on 3rd/4th attributes: {classical.1837,35} • Index lookups require 2 CAN lookups, one for the index entry, one for the object

  17. Use multicast to distribute request As data arrives from BOTH tables, use pipelined hash join to generate results and send to requestor Data being Rehashed CAN Network Rehash-Join • The tuple is checked against predicates that apply to it (i.e. produced > 1970) • Unnecessary fields can be projected out • Re-insert the resulting tuple into the network using the join key value as the new RID, and use a new temporary namespace (both tables use same namespace) When each node receives the multicast it use localGet to read all data stored at the node. Each object or tuple is analyzed… I want Hawaiian images that appeared in movies produced since 1970… Create a query request SELECT name, URL FROM images, movies WHERE image.ID = movie.ID AND…

  18. Use multicast to distribute request When the Lookup request returns with an answer, perform the join and forward results to the requestor Lookup Requests & Answers CAN Network Get-Join • The tuple is checked against predicates that apply to it (i.e. produced > 1970) • Issue a Lookup request for the tuple(s) that may be able to join with it… the RID is join key and the namespace is for the other table • ONE TABLE MUST BE INDEXED ON JOIN KEY! When each node receives the multicast it use localGet to read all data stored at the node. Each object or tuple is analyzed… I want Hawaiian images that appeared in movies produced since 1970… Create a query request SELECT name, URL FROM images, movies WHERE image.ID = movie.ID AND…

  19. Hybrid-Join • Starts the same as the others… request is multicasted to the nodes • Instead of rehashing the entire tuple, only rehash the primary key and join key of tuples that may satisfy the request (just like a semi-join algorithm or index-on-the-fly) • At the nodes with the rehashed data, if a join is possible (matching join keys)… Lookup the entire tuple and then join

  20. Quick Comparison… • # of data movements (latency) • Rehash, Query Request, Rehash, Result = 3 • Get, Query Request, Lookup, Answer, Result = 4 • Hybrid, Query Request, Small Rehash, Lookup, Answer, Result = 5 • Amount of data sent • Rehash, send projected tuples matching some predicates, but can’t pre-check join predicate • Get, only retrieve tuples matching join predicate, but entire tuple must be fetched and may not match some predicates • Hybrid, send small tuples matching some predicates, entire tuple fetched if join is very likely

  21. Quick Comparison… 160k msgs = 750kb 250k msgs = 4.6MB 300k msgs = 600kb

  22. Rehash-Group By • Same as Rehash Join… rehash tuples matching predicates • At the rehash sites… perform the aggregate • After predefined time send the result • Obviously this needs work… • Partial results • Statistics work (confidence levels, etc.) • Hierarchical aggregation (like in sensor nets)

  23. Current Optimizations • Result tuple batching…Why send results one by one? • Pipelined hash join naturally produces more results in one pass as more tuples are processed. • Network may be optimized for packets of certain size, batch 2,3,4,x tuples per message • Selection & Projection • Before rehashing, throw out irrelevant tuples, project to reduce the size of the tuple • Rehash optimization • Instead of rehashing to all sites, randomly select a smaller group to be used.

  24. Interesting note… • The state of the join is stored in the DHT store • The query request is automatically delivered to new nodes when they join • Rehashed data is automatically re-routed to the proper node if the coordinate space adjusted • When a node splits (to accept a new node into the network) the data is also split, this includes previously delivered rehashed tuples • Allows for graceful re-organization of the network not to interfere with ongoing operations

  25. Where we are… • 25,000 lines of Java code, hours of debugging, speed/memory enhancements, etc. • A working real implementation of PIER on top of a CAN simulator • Initial work studying and analyzing algorithms… nothing really ground-breaking… YET! • Analyzing the design space and which problems seem most interesting to pursue

  26. DHT API? Is it good? • API isn’t good for: • Range queries • Limited multicast – Currently, all queries must be asked at all nodes, this is the same scaling problem with Gnutella & Freenet. • Batch Publish/Lookup operations • How to properly extend the API if needed • End-to-End argument always in play… which layer (CAN, PIER, or user application) should do what?

  27. Where to go from here?1st Understand where we are… • Initial steps we need to take: • Only testing so far have been limited to small micro-benchmarks… need to take careful look at • Scaling • Effects of dynamic node re-organization (including failures, which result in data AND computation loss) • Heterogeneous network connections & various topologies • Different data distributions • Optimal levels of parallelism… too many nodes can be just as bad, if not worse, then too few. • More realistic simulations (including traces) • Develop or adopt performance metrics to insure efficiency and measure alternatives

  28. Where to go from here?Then see where we can go… • The hardcore research areas: • Replication – for speed and fault tolerance (both in data and computation) • Pre-computation of (intermediate) results • Caching – Both at DHT and PIER levels • Query optimization • More algorithms, Dist-DBMS have more tricks • Security • What are the new apps the system enables?

  29. Relationship to other Berkeley Projects. Are we alone? • We have data coming from many sources, different rates, always changing… Telegraph? • We expect that data is constantly published in this type of network, better aggregation algorithms… Sensor Nets? • Many similar or overlapping queries are likely to be asked from different nodes within a short time window… SteMs, XML Filters? • Any ideas?

More Related