1 / 95

Query Processing and Networking Infrastructures

Query Processing and Networking Infrastructures. Day 1 of 2 Joe Hellerstein UC Berkeley Septemer 20, 2002. Two Goals. Day 1: Primer on query processing Targeted to networking/OS folk Bias: systems issues Day 2: Seed some cross-fertilized research Especially with networking

lenci
Download Presentation

Query Processing and Networking Infrastructures

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Query Processing and Networking Infrastructures Day 1 of 2 Joe Hellerstein UC Berkeley Septemer 20, 2002

  2. Two Goals • Day 1: Primer on query processing • Targeted to networking/OS folk • Bias: systems issues • Day 2: Seed some cross-fertilized research • Especially with networking • Thesis: dataflow convergence • query processing and routing • Clearly other resonances here • Dataflow HW architectures • Event-based systems designs • ML and Control Theory • Online Algorithms

  3. (Sub)Space of Possible Topics Distributed & Federated QP TransactionalStorage & Networking Active DBs (Trigger Systems) TraditionalRelational QP:Optimization & Execution Data Model& Query LanguageDesign Parallel QP NFNF Data Models (OO, XML, “Semistructured”) Adaptive QP Online and Approximate QP Indexing Data Reduction Data Streams& Continuous Queries Visual Querying &Data Visualization Compression Media Queries,Feature Extraction & Similarity Search Statistical Data Analysis(“Mining”) Boolean Text Search Traditional TextRanking HypertextRanking

  4. Likely Topics Here TransactionalStorage & Networking Distributed & Federated QP Active DBs (Trigger Systems) TraditionalRelational QP:Optimization & Execution Data Model& Query LanguageDesign Parallel QP NFNF Data Models (OO, XML, “Semistructured”) Adaptive QP Online and Approximate QP Indexing Data Reduction Data Streams& Continuous Queries Visual Querying &Data Visualization Compression Media Queries,Feature Extraction & Similarity Search Statistical Data Analysis(“Mining”) Boolean Text Search Traditional TextRanking HypertextRanking

  5. Plus Some Speculative Ones Distributed & Federated QP TraditionalRelational QP:Optimization & Execution ContentRouting IndirectionArchitectures Parallel QP Adaptive QP Online and Approximate QP Indexing Peer-to-PeerQP Data Streams& Continuous Queries SensornetQP NetworkMonitoring Boolean Text Search Traditional TextRanking

  6. Outline • Day 1: Query Processing Crash Course • Intro • Queries as indirection • How do relational databases run queries? • How do search engines run queries? • Scaling up: cluster parallelism and distribution • Day 2: Research Synergies w/Networking • Queries as indirection, revisited • Useful (?) analogies to networking research • Some of our recent research at the seams • Some of your research? • Directions and collective discussion

  7. Getting Off on the Right Foot

  8. Roots: database and IR research • “Top-down” traditions (“applications”) • Usually begins with semantics and models. • Common Misconceptions • Query processing = Oracle or Google. • Need not be so heavyweight or monolithic! Many reusable lessons within • IR search and DB querying are fundamentally different • Very similar from a query processing perspective • Many similarities in other data models as well • Querying is a synchronous, interactive process. • Triggers, rules and "continuous queries" not so different from plain old queries.

  9. So… we’ll go bottom-up • Focus on resuable building blocks • Attempt to be language- and model-agnostic • illustrate with various querying scenarios

  10. Confession: Two Biases • Relational query engines • Most mature and general query technology • Best documented in the literature • Conceptually general enough to “capture” most all other models/schemes • Everybody does web searches • So it’s both an important app, and an inescapable usage bias we carry around • It will inform our discussion. Shouldn’t skew it • Lots of other query systems/languages you can keep in mind as we go • LDAP, DNS, XSL/Xpath/XQuery, Datalog

  11. What Are Queries For? I • Obvious answer: search and analysis over big data sets • Search: select data of interest • Boolean expressions over content • sometimes with an implicit ordering on results • Analysis: construct new information from base data. • compute functions over each datum • concatenate related records (join) • partition into groups, summarize (aggregates) • aside: “Mining” vs. “Querying”? As a rule of thumb, think of mining as WYGIWIGY. • Not the most general, powerful answer…

  12. What Are Queries For? II • Queries bridge a (large!) level of indirection • Declarative programming: what you want, not how to get it • Easy (er) to express • Allows the “how” to change under the covers • A critical issue! • Not just for querying • Method invocation, data update, etc ?? !!

  13. Motivation for this Indirection • Critical when rates of change differ across layers: • In particular, whendapp/dt<< denvironment/dt • E.g. DB apps are used for years, decades (!!) • E.g. networked env: high rates of change (??) • DB lit calls this “data independence”

  14. Data Independence: Background • Bad Old Days • Hierarchical and “Network” (yep!) data models • Nesting & pointers mean that apps explicitly traverse data, become brittle when data layouts change • Apps with persistent data have slow dapp/dt • And the database environments change faster! • Logical changes to representation (schema) • Physical changes in storage (indexes, layouts, HW) • DBs often shared by multiple apps! • In B.O.D., all apps had to be rewritten on change

  15. It’s a SW Engineering Thing • Analogy: imagine if your C structs were to survive for decades • you’d keep them very simple • encapsulation to allow future mods • Similar Analogy to NWs • protocol simplicity is good • soft state is good (discourages hardcoded refs to transient resources) • But the fun systems part follows directly: • Achieve the goal w/respectable performance over a dynamic execution environment

  16. Codd’s Data Independence • Ted Codd, IBM c. 1969 and forward • Turing award 1981 • Two layers of indirection Applications Logical Independence Spanned by views and query rewriting Logical Representation(schema) Physical Independence Spanned by queryoptimization andexecution Physical Representation (storage)

  17. A More Architectural Picture Bridges logical independence Declarativequeryover views Query Rewriter Bridges physical independence Query Processor Declarative queryover base tables Optimizer Query Plan (Procedural) Executor N.B.: This classical QParchitecture raises someproblems. To be revisited! IteratorAPI Access Methods

  18. Access Methods & Indexing

  19. Access Methods • Base data access layer • Model: Data stored in unordered collections • Relations, tables, one type per collection • Interface: iterators • Open(predicate) -> cursor • Usually simple predicates: attribute op constant • op usually arithmetic (<, >, =), though we’ll see extensions (e.g. multi-d ops) • Next(cursor) -> datum (of known type) • Close(cursor) • Insert(datum of correct type) • Delete(cursor)

  20. Typical Access Methods • “Heap” files • unordered array of records • usually sequential on disk • predicates just save cross-layer costs • Traditional Index AMs • B-trees • actually, “B+”-trees: all data at leaves • Can scan across leaves for range search • predicates (<,>,=, between) result in fewer I/Os • random I/Os (at least to find beginning of range) • Linear Hash index • Litwin ‘78. Supports equality predicates only. • This is it for IR and standard relational DBs • Though when IR folks say “indexing”, they sometimes mean all of query processing

  21. Primary & Secondary Indexes Directory Directory Data (key, ptr) pairs Data

  22. Directory Directory Data Data Primary & Secondary Indexes Directory (key, ptr) pairs

  23. An Exotic Forest of Search Trees • Multi-dimensional indexes • For geodata, multimedia search, etc. • Dozens! E.g. R-tree family, disk-based Quad-Trees, kdB-trees • And of course “linearizations” with B-trees • Path indexes • For XML and OO path queries • E.g. Xfilter • Etc. • Lots of one-off indexes, often many per workload • No clear winners here • Extensible indexing scheme would be nice

  24. p1 p2 p3 … Generalized Search Trees (GiST) [Hellerstein et al., VLDB 95] • What is a (tree-based) DB index? Typically: • A clustering of data into leaf blocks • Hierarchical summaries (subtree predicates -- SPs) for pointers in directory blocks

  25. Generalized Search Trees (GiST) • Can realize that abstraction with simple interface: • User registers opaque SP objects with a few methods • Consistent(q, p): should query q traverse subtree? • Penalty(d, p): how bad is it to insert d below p • Union (p1, p2): form SP that includes p1, p2 • PickSplit({p1, …, pn}): partition SPs into 2 • Tree maintenance, concurrency, recovery all doable under the covers • Covers many popular multi-dimensional indexes • Most of which had no concurrency/recovery story • http://gist.cs.berkeley.edu

  26. Some Additional Indexing Tricks [O’Neil/Quass, SIGMOD 97] • Bitmap indexing • Many matches per value in (secondary) index? Rather than storing pointers to heap file in leaves, store a bitmap of matches in a (sorted) heap file. • Only works if file reorg is infrequent • Can make intersection, COUNT, etc. quicker during query processing • Can mix/match bitmaps and lists in a single index • Works with any (secondary) index with duplicate matches • “Vertical Partitioning” / “Columnar storage” • Again, for sorted, relatively static files • Bit-slice indexes

  27. Query Processing Dataflow Infrastructures

  28. Dataflow Infrastructure • Dataflow abstraction is very simple • “box-and-arrow” diagrams • (typed) collections of objects flow along edges • Details can be tricky • “Push” or “Pull”? • More to it than that • How do control-flow anddataflow interact? • Where does the data live? • Don’t want to copy data • If passing pointers, where doesthe “real” data live?

  29. Iterators • Most uniprocessor DB engines use iterators • Open() -> cursor • Next(cursor) -> typed record • Close(cursor) • Simple and elegant • Control-flow and dataflow coupled • Familiar single-threaded, procedure-call API • Data refs passed on stack, no buffering • Blocking-agnostic • Works w/blocking ops -- e.g. Sort • Works w/pipelined ops • Note: well-behaved iterators “come up for air” in inner loops • E.g. for interrupt handling g f S R

  30. Where is the In-Flight Data? • In standard DBMS, raw data lives in disk format, in shared Buffer Pool • Iterators pass references to BufPool • A tuple “slot” per iterator input • Never copy along edges of dataflow • Join results are arrays of refs to base tables • Operators may “pin” pages in BufPool • BufPool never replaces pinned pages • Ops should release pins ASAP (esp. across Next() calls!!) • Some operators copy data into their internal state • Can“spill” this state to private disk space

  31. Weaknesses of Simple Iterators • Evolution of uniprocessor archs to parallel archs • esp. “shared-nothing” clusters • Opportunity for pipelined parallelism • Opportunity for partition parallelism • Take a single “box” in the dataflow, and split it across multiple machines • Problems with iterators in this environment • Spoils pipelined parallelism opportunity • Polling (Next()) across the network is inefficient • Nodes sit idle until polled, and during comm • A blocking producer blocks its consumer • But would like to keep iterator abstraction • Especially to save legacy query processor code • And simplify debugging (single-threaded, synchronous)

  32. Exchange [Graefe, SIGMOD 90] • Encapsulate partition parallelism & asynchrony • Keep the iterator API between ops • Exchange operator partitions input data by content • E.g. join or sort keys • Note basic architectural idea! • Encapsulate dataflowtricks in operators, leavinginfrastructure untouched • We’ll see this again next week, e.g. in Eddies

  33. Exchange Internals • Really 2 operators, XIN and XOUT • XIN is “top” of a plan, and pulls, pushing results to XOUT queue • XOUT spins on its local queue • One thread becomes two • Producer graph & XIN • Consumer graph & XOUT • Routing table/fn in XINsupports partition parallelism • E.g. for || sort, join, etc. • Producer and consumer see iterator API • Queue + thread barrier turns NW-based “push” into iterator-style “pull” XOUT XIN route Exchange

  34. Exchange Benefits? • Remember Iterator limitations? • “Spoils pipelined parallelism opportunity” • solved by Exchange thread boundary • “Polling (Next()) across the network is inefficient” • Solved by XIN pushing to XOUT queue • “A blocking producer blocks its consumer” • Still a problem!

  35. Exchange Limitations • Doesn’t allow consumer work to overlap w/blocking producers • E.g. streaming data sources, events • E.g. sort, some join algs • Entire consumer graph blocks if XOUT queue empty • Control flow coupled to dataflow, so XOUT won’t return without data • Queue is encapsulated from consumer • But … • Note that exchange model is fine for most traditional DB Query Processing • May need to be extended for new settings…

  36. Fjords [Madden/Franklin, ICDE 01] • Thread of control per operator • Queues between each operator • Asynch or synch calls • Can do asynch poll-and-yield iteration in each operator (for both consumer and producer) • Or can do synchronous get_next iteration • Can get traditional behavior if you want: • Synch polls + queue of size 1 •  Iterators • Synch consumer, asynch producer • = Exchange • Asynch calls solve the blocking problem of Exchange exchange iterator

  37. Fjords • Disadvantages: • Lots of “threads” • Best done in an event-programming style, not OS threads • Operators really have to “come up for air” (“yield”) • Need to write your own scheduler • Harder to debug • But: • Maximizes flexibility for operators at the endpoints • Still provides a fairly simple interface for operator-writers

  38. Basic Relational Operators and Implementation

  39. Relational Algebra Semantics • Selection: sp(R) • Returns all rows in R that satisfy p • Projection: pC(R) • Returns all rows in R projected to columns in C • In strict relational model, remove duplicate rows • In SQL, preserve duplicates (multiset semantics) • Cartesian Product: R  S • Union: R  S Difference: R — S • Note: R, S must have matching schemata • Join: R p S = sp(R  S) • Missing: Grouping & Aggregation, Sorting

  40. Operator Overview: Basics • Selection • Typically “free”, so “pushed down” • Often omitted from diagrams • Projection • In SQL, typically “free”, so “pushed down” • No duplicate elimination • Always pass the minimal set of columns downstream • Typically omitted from diagrams • Cartesian Product • Unavoidable nested loop to generate output • Union: • Concat, or concat followed by dup. elim.

  41. Operator Overview, Cont. • Unary operators: Grouping & Sorting • Grouping can be done with hash or sort schemes (as we’ll see) • Binary matching: Joins/Intersections • Alternative algorithms: • Nested loops • Loop with index lookup (Index N.L.) • Sort-merge • Hash Join • Don’t forget: have to write as iterators • Every time you get called with Next(), you adjust your state and produce an output record

  42. Original Relation Partitions OUTPUT 1 1 2 INPUT 2 hash function hc . . . B-1 B-1 B main memory buffers Disk Disk Unary External Hashing [Bratbergsengen, VLDB 84] • E.g. GROUP BY, DISTINCT • Two hash functions, hc (coarse) and hf (fine) • Two phases: • Phase 1: for each tuple of input, hash via hc into a “spill” partition to be put on disk • B-1 blocks of memory used to hold output buffers for writing a block at a time per partition

  43. Partitions Hash table for partition Ri (k < B pages) hash fn hf Result Output buffer B main memory buffers Disk Unary External Hashing • Phase 2: for each partition, read off disk and hash into a main-memory hashtable via hf • For distinct, when you find a value already in hashtable, discard the copy • For GROUP BY, associate some agg state (e.g. running SUM) with each group in the hash table, and maintain

  44. External Hashing: Analysis • To utilize memory well in Phase 2, would like each partition to be ~ B blocks big • Hence works in two phases when B >= |R| • Same req as external sorting! • Else can recursively partition the partitions in Phase 2 • Can be made to pipeline, to adapt nicely to small data sets, etc.

  45. Hash Join (GRACE) [Fushimi, et al., VLDB 84] • Phase 1: partition each relation on the join key with hc, spilling to disk • Phase 2: • build each partition of smaller relation into a hashtable via hf • scan matching partition of bigger relation, and for each tuple probe the hashtable via hf for matches • Would like each partition of smaller relation to fit in memory • So works well if B >= |smaller| • Size of bigger is irrelevant!! (Vs. sort-merge join) • Popular optimization: Hybrid hash join • Partition #0 doesn’t spill -- it builds and probes immediately • Partitions 1 through n use rest of memory for output buffers • [DeWitt/Katz/Olken/Shapiro/Stonebraker/Wood, SIGMOD 84]

  46. Hash-Join Original Relations Partitions OUTPUT 1 1 2 INPUT 2 hash function hc . . . B-1 B-1 B main memory buffers Disk Disk Partitions of R & S Hash table for partition Ri (k < B-1 pages) hash fn hf hf Join Result Output buffer Input buffer for Si B main memory buffers Disk

  47. Symmetric Hash Join [Mikillineni & Su, TOSE 88] [Wilschut & Apers, PDIS 91] • Pipelining, in-core variant • Build and probe symmetrically • Correctness: Each output tuple generated when its last-arriving component appears • Can be extended to out-of-core case • Tukwila [Ives & HaLevy, SIGMOD ‘99] • Xjoin: Spill and read partitions multiple times • Correctness guaranteed by timestamping tuples and partitions • [Urhan & Franklin, DEBull ‘00]

  48. Relational Query Engines

  49. A Basic SQL primer SELECT [DISTINCT] <output expressions> FROM <tables> [WHERE <predicates>] [GROUP BY <gb-expression> [HAVING <h-predicates>]] [ORDER BY <expression>] • Join tables in FROM clause • applying predicates in WHERE clause • If GROUP BY, partition results by GROUP • And maintain aggregate output expressions per group • Delete groups that don’t satisfy HAVING clause • If ORDER BY, sort output accordingly

  50. Examples • Single-table S-F-W • DISTINCT, ORDER BY • Multi-table S-F-W • And self-join • Scalar output expressions • Aggregate output expressions • With and without DISTINCT • Group By • Having • Nested queries • Uncorrelated and correlated

More Related