1 / 40

data parallelism

data parallelism. Chris Olston Yahoo! Research. set-oriented computation. data management operations tend to be “set-oriented”, e.g.: apply f() to each member of a set compute intersection of two sets easy to parallelize

suzy
Download Presentation

data parallelism

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. data parallelism Chris Olston Yahoo! Research

  2. set-oriented computation • data management operations tend to be “set-oriented”, e.g.: • apply f() to each member of a set • compute intersection of two sets • easy to parallelize • parallel data management is parallel computing’s biggest success story

  3. history 1970’s • relational datatabase systems (declarative set-oriented primitives) • parallel relational database systems • renaissance: map-reduce etc. 1980’s now

  4. architectures • expense • scale • shared-memory • shared-disk • shared-nothing (clusters) • message overheads • skew

  5. early systems • XPRS (Berkeley, shared-memory) • Gamma (Wisconsin, shared-nothing) • Volcano (Colorado, shared-nothing) • Bubba (MCC, shared-nothing) • Terradata (shared-nothing) • Tandem non-stop SQL (shared-nothing)

  6. example • data: • pages(url, change_freq, spam_score, …) • links(from, to) • question: • how many inlinks from non-spam pages does each page have?

  7. parallel evaluation the answer … group by links.to join by pages.url = links.from … filter by spam score … … … pages links

  8. parallelism opportunities • inter-job • intra-job • inter-operator • pipeline • tree • intra-operator • partition f(g(X)) f(g(X), h(Y)) f(X) = f(X1)  f(X2)  …  f(Xn)

  9. parallelism obstacles • data dependencies • e.g. set intersection w/asymmetric hashing: must hash input1 before reading input2 • resource contention • e.g. many nodes transmit to node X simultaneously • startup & teardown costs

  10. metrics • speed-up • scale-up ideal throughput parallelism ideal throughput data size + parallelism

  11. talk outline • introduction • query processing • data placement • recent systems

  12. query evaluation • key primitives: • lookup • sort • group • join

  13. lookup by key • data partitioned on function of key? • great! • otherwise • d’oh … partitioned data

  14. sort • problems with this approach? … merge sorted runs … a-c w-z … sort local data …

  15. sort, improved • a key issue: avoiding skew • sample to estimate data distribution • choose ranges to get uniformity … receive & sort … a-c w-z … partition …

  16. group • again, skew is an issue • approaches: • avoid (choose partition function carefully) • react (migrate groups to balance load) … sort or hash … 0 99 … partition …

  17. join • alternatives: • symmetric repartitioning • asymmetric repartitioning • fragment and replicate • generalized f-and-r

  18. join: symmetric repartitioning … equality-based join … … partition partition … … input A input B

  19. join: asymmetric repartitioning … equality-based join … … partition input A … (already suitably partitioned) input B

  20. join: fragment and replicate input A … … input B

  21. join: generalized f-and-r input A input B

  22. join: other techniques … • semi-join • bloom-join

  23. query optimization • degrees of freedom • objective functions • observations • approaches

  24. degrees of freedom • conventional query planning stuff: • access methods, join order, join algorithms, selection/projection placement, … • parallel join strategy (repartition, f-and-r) • partition choices (“coloring”) • degree of parallelism • scheduling of operators onto nodes • pipeline vs. materialize between nodes

  25. objective functions • want: • low response time for jobs • low overhead (high system throughput) • these are at odds • e.g., pipelining two operators may decrease response time, but incurs more overhead

  26. proposed objective functions • Hong: • linear combination of response time and overhead • Ganguly: • minimize response time, with limit on extra overhead • minimize response time, as long as cost-benefit ratio is low

  27. observations • response time metric violates “principle of optimality” • every subplan of an optimal plan is optimal • dynamic programming relies on this property • hence, so does System-R (w/“interesting orders” patch) • example: 10 A join B 10 A index B index 20 A 10

  28. approaches • two-phase [Hong]: • find optimal sequential plan • find optimal parallelization of above (“coloring”) • optimal for shared-memory w/intra-operator parallelism only • one-phase (still open research): • model sources & deterrents of parallelism in cost formulae • can’t use DP, but can still prune search space using partial orders (i.e., some subplans dominate others) [Ganguly]

  29. talk outline • introduction • query processing • data placement • recent systems

  30. data placement • degrees of freedom: • declustering degree • which set of nodes • map records to nodes

  31. declustering degree • spread table across how many nodes? • function of table size • determine empirically, for a given system

  32. which set of nodes • three strategies [Mehta]: • random (worst) • round-robin • heat-based (best, given accurate workload model) • (none take into account locality for joins)

  33. map records to nodes • avoid hot-spots • hash partitioning works fairly well • range partitioning with careful ranges better [DeWitt] • add redundancy • chained declustering [DeWitt]: 1p 2s 2p 3s 3p 1s

  34. talk outline • introduction • query processing • data placement • recent systems

  35. academia • C-store [MIT] • separate transactional & read-only systems • compressed, column-oriented storage • k-way redundancy; copies sorted on different keys • River & Flux [Berkeley] • run-time adaptation to avoid skew • high availability for long-running queries via redundant computation

  36. Google (batch computation) • Map-Reduce: • grouped aggregation with UDFs • fault tolerance: redo failed operations • skew mitigation: fine-grained partitioning & redundant execution of “stragglers” • Sawzall language: • SELECT-FROM-GROUPBY style queries • schemas (“protocol buffers”) • convert errors into undefined values • primitives for operating on nested sets

  37. Google (random access) • Bigtable: • single logical table, physically distributed • horizontal partitioning • sorted base + deltas, with periodic coalescing • API: read/write cells, with versioning • one level of nesting: a top-level cell may contain a set • e.g. set of incoming anchortext strings

  38. Yahoo! • Pig (batch computation): • relational-algebra-style query language • map-reduce-style evaluation • PNUTS (random access): • primary & secondary indexes • transactional semantics

  39. IBM, Microsoft • Impliance [IBM; still on drawing board] • 3 kinds of nodes: data, processing, xact mgmt • supposed to handle loosely structured data • Dryad [Microsoft] • computation expressed as logical dataflow graph with explicit parallelism • query compiler superimposes graph onto cluster nodes

  40. summary • big data = a good app for parallel computing • the game: • partition & repartition data • avoid hotspots, skew • be prepared for failures • still an open area! • optimizing complex queries, caching intermediate results, horizontal vs. vertical storage, …

More Related