1 / 23

Querying the Internet with PIER (PIER = Peer-to-peer Information Exchange and Retrieval)

Querying the Internet with PIER (PIER = Peer-to-peer Information Exchange and Retrieval). Ryan Huebsch † Joe Hellerstein † , Nick Lanham † , Boon Thau Loo † , Scott Shenker ‡ , Ion Stoica † p2p@db.cs.berkeley.edu † UC Berkeley, CS Division

anahid
Download Presentation

Querying the Internet with PIER (PIER = Peer-to-peer Information Exchange and Retrieval)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Querying the Internet with PIER(PIER = Peer-to-peer Information Exchange and Retrieval) Ryan Huebsch† Joe Hellerstein†, Nick Lanham†, Boon Thau Loo†, Scott Shenker‡, Ion Stoica† p2p@db.cs.berkeley.edu †UC Berkeley, CS Division ‡International Computer Science Institute, Berkeley CA VLDB 9/12/03

  2. Internet Scale 1000’s – Millions Single Site Clusters Distributed 10’s – 100’s What is Very Large?Depends on Who You Are • Challenge: How to run DB style queries at Internet Scale! Database Community Network Community

  3. What are the Key Properties? • Lots of data that is: • Naturally distributed (where it’s generated) • Centralized collection undesirable • Homogeneous in schema • Data is more useful when viewed as a whole • This is the design space we have chosen to investigate.

  4. Who Needs Internet Scale?Example 1: Filenames • Simple ubiquitous schemas: • Filenames, Sizes, ID3 tags • Born from early P2P systems such as Napster, Gnutella, AudioGalaxy, etc. • Content is shared by “normal” non-expert users… home users • Systems were built by a few individuals ‘in their garages’  Low barrier to entry

  5. Example 2: Network Traces • Schemas are mostly standardized: • IP, SMTP, HTTP, SNMP log formats • Network administrators are looking for patterns within their site AND with other sites: • DoS attacks cross administrative boundaries • Tracking virus/worm infections • Timeliness is very helpful • Might surprise you how useful it is: • Network bandwidth on PlanetLab (world-wide distributed research test bed) is mostly filled with people monitoring the network status

  6. Our Challenge • Our focus is on the challenge of scale: • Applications are homogeneous and distributed • Already have significant interest • Provide a flexible framework for a wide variety of applications • Other interesting issues we will not discuss today: • Heterogeneous schemas • Crawling/Warehousing/etc • These issues are complementary to the issues we will cover in this talk

  7. Four Design Principles (I) • Relaxed Consistency • ACID transactions severely limits the scalability and availability of distributed databases • We provide best-effort results • Organic Scaling • Applications may start small, withouta priori knowledge of size

  8. Four Design Principles (II) • Natural habitat • No CREATE TABLE/INSERT • No “publish to web server” • Wrappers or gateways allow the information to be accessed where it is created • Standard Schemas via Grassroots software • Data is produced by widespread software providing a de-facto schema to utilize

  9. Declarative Queries Query Plan Overlay Network Physical Network

  10. Recent Overlay Networks:Distributed Hash Tables (DHTs) • What is a DHT? • Take an abstract ID space, and partition among a changing set of computers (nodes) • Given a message with an ID, route the message to the computer currently responsible for that ID • Can store messages at the nodes • This is like a “distributed hash table” • Provides a put()/get() API • Cheap maintenance when nodes come and go

  11. Key = (15,14) Data Recent Overlay Networks:Distributed Hash Tables (DHTs) (16,16) (16,0) (0,0) (0,16)

  12. Recent Overlay Networks:Distributed Hash Tables (DHTs) • Lots of effort is put into making DHTs better: • Scalable (thousands  millions of nodes) • Resilient to failure • Secure (anonymity, encryption, etc.) • Efficient (fast access with minimal state) • Load balanced • etc.

  13. PIER’s Three Uses for DHTs • Single elegant mechanism with many uses: • Search: Index • Like a hash index • Partitioning: Value (key)-based routing • Like Gamma/Volcano • Routing: Network routing for QP messages • Query dissemination • Bloom filters • Hierarchical QP operators (aggregation, join, etc) • Not clear there’s another substrate that supports all these uses

  14. Metrics • We are primarily interested in 3 metrics: • Answer quality (recall and precision) • Bandwidth utilization • Latency • Different DHTs provide different properties: • Resilience to failures (recovery time)  answer quality • Path length  bandwidth & latency • Path convergence  bandwidth & latency • Different QP Join Strategies: • Symmetric Hash Join, Fetch Matches, Symmetric Semi-Join, Bloom Filters, etc. • Big Picture: Tradeoff bandwidth (extra rehashing) and latency

  15. Experimental Setup • A simple join query • SELECT R.pkey, S.pkey, R.padFROM R, SWHERE R.num1 = S.pkey AND R.num2 > constant1 AND S.num2 > constant2 AND f(R.num3, S.num3) > constant3 • Table R has 10 times more tuples than S • num1, num2, num3 are uniformly distributed • R.num1 is chosen such that 90% of R tuples have one matching tuple in S, while 10% have no matching tuples. • R.pad is used to insure all result tuples are 1KB • Symmetric Hash Join

  16. Does This Work for Real?Scale-up Performance (1MB source data/node) Real Network Simulation

  17. Current Status of PIER • PIER currently runs box-and-arrow dataflow graphs • We are working on query optimization • GUI (called Lighthouse) for query entry and result display • Designed primarily for application developers • User inputs an optimized physical query plan

  18. Future Research • Routing, Storage and Layering • Catalogs and Query Optimization • Hierarchical Aggregations • Range Predicates • Continuous Queries over Streams • Sharing between Queries • Semi-structured Data

  19. Bonus Material

  20. Symmetric Hash Join (SHJ)

  21. Fetch Matches (FM)

  22. Symmetric Semi Join (SSJ) • Both R and S are projected to save bandwidth • The complete R and S tuples are fetched in parallel to improve latency

More Related