1 / 45

Optimizing and Parallelizing Ranked Enumeration

VLDB 2011 Seattle, WA. Optimizing and Parallelizing Ranked Enumeration. Background: DB Search at HebrewU. search. eu brussels. _. demo in SIGMOD’10 , implementation in SIGMOD’08 , algorithms in PODS’06. Initial implementation was too slow… Purchased a multi-core server

freira
Download Presentation

Optimizing and Parallelizing Ranked Enumeration

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. VLDB 2011 Seattle, WA Optimizing and Parallelizing Ranked Enumeration

  2. Background: DB Search at HebrewU search eu brussels _ demo in SIGMOD’10, implementation in SIGMOD’08, algorithms in PODS’06 • Initial implementation was too slow… • Purchased a multi-core server • Didn’t help: cores were usually idle • Due to the inherent flow of the enumeration technique we used • Needed deeper understanding of ranked enumeration to benefit from parallelization • This paper

  3. _

  4. Ranked Enumeration Problem User . . . best answer 2nd best answer 3rd best answer Huge number (e.g., 2|Problem|) of ranked answers • Examples: • Various graph optimizations • Shortest paths • Smallest spanning trees • Best perfect matchings • Top results of keyword search on DBs (graph search) • Most probable answers in probabilistic DBs • Best recommendations for schema integration (Can’t afford to instantiate all answers) • “Complexity”: • What is the delay between successive answers? • How much time to get top-k? Here

  5. Abstract Problem Formulation … 28 31 32 A collection of objects input O = Answers a⊆O A = 21 31 27 28 17 Huge, described by a condition on A’s subsets score(a) is high  ais of high-quality score() Goal:Find top-k answers … 17 a1 a2 a3 ak

  6. Graph Search in The Abstraction … Edges of G • Data graph G • Set Q of keywords O = Answers a⊆O A = Subtrees (edge sets) a containing all keywords in Q (w/o redundancy, see [GKS 2008]) Goal:Find top-k answers

  7. What is the Challenge? 17 31 32 O = . . . . . . start jth answer 1st (top) answer 2nd answer ? ? • ≠ previous (j-1) answers • bestremaining answer Optimization problem Conceivably, much more complicated than top-1! How to handle these constraints? (j may be large!)

  8. Lawler-Murty’s Procedure [Murty, 1968] [Lawler, 1972] Lawler-Murty’s gives a general reduction: Finding top-k answers then PTIME if PTIME Finding top-1 answer under simple constraints We understand optimization much better! Often, amounts to classical optimization, e.g., shortest path (but sometimes it may get involved, e.g., [KS 2006]) Other general top-k procedure: [Hamacher & Queyranne 84], very similar!

  9. Among the Uses of Lawler-Murty’s Graph/Combinatorial Algorithms: • Shortestsimple paths[Yen 1972] • Minimum spanning trees[Gabow 1977, Katoh et al., 1981] • Best solutions in resource allocation [Katoh et al. 1981] • Best perfect matchings, best cuts[Hamacher & Queyranne 1985] • Minimum Steiner trees[KS 2006] Bioinformatics: • Yen’s algorithm to find sets of metabolites connected by chemical reactions [Takigawa & Mamitsuka 2008] Data Management: • ORDER-BY queries [KS 2006, 2007] • Graph/XML search[GKS 2008] • Generation of forms over integrated data[Talukdar et al. 2008] • Course recommendation[Parameswaran & Garcia-Molina 2009] • Querying Markov sequences[K & Ré 2010]

  10. Lawler-Murty’s Method: Conceptual start

  11. 1. Find & Print the Top Answer In principle, at this point we should find thesecond-bestanswer But Instead… Output start

  12. 2. Partition the Remaining Answers • Inclusion constraint: “must contain ” • Exclusion constraint: “must not contain ” Partition defined by a set of simple constraints Output start

  13. 3. Find the Top of Each Set Output start

  14. 4. Find & Print the Second Answer Output start Next answer:Best among all the top answers in the partitions

  15. 5. Further Divide the Chosen Partition … and so on … (until k answers are printed) . . . Output start

  16. 34 30 Lawler-Murty’s: Actual Execution 19 24 18 Output Printed already Best of each partition best Partition Reps. + Best of Each

  17. 34 30 Lawler-Murty’s: Actual Execution 19 24 18 Output For each new partition, a task to find the best answer Partition Reps. + Best of Each

  18. 34 30 Lawler-Murty’s: Actual Execution 22 18 21 19 24 18 Output best… Partition Reps. + Best of Each

  19. _

  20. 34 30 Typical Bottleneck 12 14 24 Output Partition Reps. + Best of Each

  21. 34 30 Typical Bottleneck 15 20 22 12 14 24 Output In top k? Partition Reps. + Best of Each

  22. Progressive Upper Bound 12 • Throughout the execution, an optimization alg. often upper bounds it’s final solution’s score • Progressive: bound gets smaller in time • Often, nontrivial bounds, e.g., • Dijkstra's algorithm: distance at the top of the queue • Similarly: some Steiner-tree algorithms [DreyfusWagner72] • Viterbi algorithms: max intermediate probability • Primal-dual methods: value of dual LP solution ≤24 ≤22 ≤18 ≤14 Time

  23. 34 30 Freezing Tasks (Simplified) 12 14 24 Output Partition Reps. + Best of Each

  24. 34 30 Freezing Tasks (Simplified) 20 22 12 14 24 Output ≤24 ≤23 ≤24 ≤23 ≤22 Partition Reps. + Best of Each

  25. 34 30 Freezing Tasks (Simplified) 20 22 12 14 24 Output ≤24 ≤23 ≤20 22 > 20 Partition Reps. + Best of Each

  26. 34 30 Freezing Tasks (Simplified) 15 20 22 12 14 24 Output ≤15 ≤24 ≤23 ≤20 ≤18 ≤16 best ≤20 Partition Reps. + Best of Each

  27. Improvement of Freezing Experiments: Graph Search 2 Intel Xeon processors (2.67GHz), 4 cores each (8 total); 48GB memory Simple Lawler-Murty w/ Freezing Mondial k = 10 , 100 DBLP (part) k = 10 , 100 DBLP (full) k = 10 , 100 On average, freezing saved 56% of the running time

  28. _

  29. 34 30 Straightforward Parallelization Awaiting Tasks 12 14 24 Output

  30. 34 30 Straightforward Parallelization 15 20 22 12 14 24 Output Awaiting Tasks

  31. 34 30 Straightforward Parallelization 15 20 22 Awaiting Tasks 12 14 24 Output

  32. Not so fast… Typical: reduced 30% of running time Same for 2,3…,8 threads!

  33. 34 30 Idle Cores while Waiting Awaiting Tasks 12 14 24 Output

  34. 34 30 Idle Cores while Waiting 15 20 22 12 14 24 Output idle Awaiting Tasks

  35. 34 30 Early Popping 20 22 Awaiting Tasks 12 14 24 Output ≤22 • Skipped issues: • Thread synchronization • semaphores, locking, etc. • Correctness ≤22 ≤20 ≤23 ≤19 ≤24 22 > 20

  36. Improvement of Early Popping Experiments: Graph Search 2 Intel Xeon processors (2.67GHz), 4 cores each (8 total); 48GB memory Mondial short, medium-size & long queries DBLP (part) short, medium-size & long queries

  37. Early Popping vs. (Serial) Freezing Experiments: Graph Search 2 Intel Xeon processors (2.67GHz), 4 cores each (8 total); 48GB memory Mondial short, medium-size & long queries DBLP (part) short, medium-size & long queries • Need 4 threads to start gaining • And even then, fairly poor…

  38. Combining Freezing & Early Popping • We discuss additional ideas and techniques to further utilize the cores • Not here, see the paper • Main speedup by combining early popping with freezing • Cores kept busy…on high-potential tasks • Thread synchronization is quite involved • At the high level, the final algorithm has the following flow:

  39. 34 30 Combining: General Idea 20 17 25 15 26 12 24 Output Computed Answers (to-print) computed answers frozen + new tasks Threads work on frozen tasks Partition Reps. as Frozen Tasks

  40. 34 30 Combining: General Idea 20 17 25 15 20 12 24 Output Computed Answers (to-print) computed answers frozen + new tasks Threads work on frozen tasks Partition Reps. as Frozen Tasks

  41. 34 30 Combining: General Idea 20 22 22 20 22 25 17 15 12 24 Output Main task just pops computed results to print … but validates: no better results by frozen tasks Computed Answers (to-print) computed answers frozen + new tasks Threads work on frozen tasks Partition Reps. as Frozen Tasks

  42. Combined vs. (Serial) Freezing Experiments: Graph Search 2 Intel Xeon processors (2.67GHz), 4 cores each (8 total); 48GB memory Mondial DBLP Now, significant gain (≈50%) already w/ 2 threads

  43. Improvement of Combined Experiments: Graph Search 2 Intel Xeon processors (2.67GHz), 4 cores each (8 total); 48GB memory 3%-10% 4%-5% Mondial DBLP On average, with 8 threads we got 5.7% of the original running time

  44. _

  45. Conclusions • Considered Lawler-Murty’s ranked enumeration • Theoretical complexity guarantees • …but a direct implementation is very slow • Straightforward parallelization poorly utilizes cores • Ideas: progressive bounds, freezing, early popping • In the paper: additional ideas, combination of ideas • Most significant speedup by combining these ideas • Flow substantially differs from the original procedure • 20x faster on 8 cores • Test case: graph search; focus: general apps • Future: additional test cases Questions?

More Related