1 / 41

MRShare: Sharing Across Multiple Queries in MapReduce

MRShare: Sharing Across Multiple Queries in MapReduce. Presented by Xiaolan Wang and Pengfei Tang. By Tomasz Nykiel (University of Toronto) Michalis Potamias (Boston University) Chaitanya Mishra (University of Toronto, currently Facebook ) George Kollios (Boston University)

whitney
Download Presentation

MRShare: Sharing Across Multiple Queries in MapReduce

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. MRShare: Sharing Across Multiple Queries in MapReduce Presented by Xiaolan Wang and Pengfei Tang By Tomasz Nykiel(University of Toronto) MichalisPotamias (Boston University) ChaitanyaMishra (University of Toronto, currently Facebook) George Kollios (Boston University) Nick Koudas (University of Toronto)

  2. Motivation • Reducing the execution time • Reducing energy consumption • Monetary savings *http://aws.amazon.com/ec2/#pricing

  3. MRShare – a sharing framework for Map Reduce • MRShare framework: • Inspired by sharing primitives from relational domain • Introduces a cost model for Map Reduce jobs • Searches for the optimal sharing strategies • Does not change the Map Reduce computational model hsdhquweiquwijksajdajsdjhwhjadjhashdj

  4. Outline • Introduction • Map Reduce recap. • MRShare – Sharing opportunities in Map-Reduce • Cost model for MapReduce • MRShare – Grouping algorithms • MRShare Implementation and Evaluation • Summary

  5. Outline • Introduction • Map Reduce recap. • MRShare – Sharing opportunities in Map-Reduce • Cost model for MapReduce • MRShare – Grouping algorithms • MRShare Implementation and Evaluation • Summary

  6. Map Reduce recap. network Reduce Map I Output I I Output I HDFS HDFS

  7. Outline • Introduction • Map Reduce recap. • MRShare - Sharing opportunities in Map-Reduce • Sharing scans • Sharing intermediate data • Cost model for MapReduce • MRShare – Grouping algorithms • MRShareImplementation and Evaluation • Summary

  8. Sharing opportunities– sharing scans • SELECT COUNT(*) FROM user GROUP BY hometown • SELECT AVG(age) FROM user GROUP BY hometown SQL Map Map id1 Toronto student id1 Toronto student Toronto 1 Toronto 17 Map Reduce Reduce Reduce Toronto 1 Toronto 17 Toronto 1 Toronto 3 Toronto 19 Toronto 18 Toronto 1 Montreal 20 Montreal 20 Ottawa 1 Ottawa 23 Ottawa 2 Ottawa 24 Ottawa 1 Ottawa 25

  9. MRShare – sharing scans (map). Input Meta-map Map 1 Map 2 Map 3 Map 4 Map output

  10. MRShare – sharing scans (reduce) Meta-reduce Reduce 1 Reduce 2 Reduce 3 Reduce 4

  11. Sharing Map Output SELECT T.a, sum(T.b) SELECT T.a, avg(T.b) FROM T FROM T WHERE T.a>10 AND T.a<20 WHERE T.b>10 AND T.c<100 GROUP BY T.a GROUP BY T.a

  12. Sharing Map SELECT T.c, sum(T.b) SELECT T.a, avg(T.b) FROM T FROM T WHERE T.c> 10 WHERE T.c> 10 GROUP BY T.c GROUP BY T.a Same reducing.

  13. Sharing Parts of Map SELECT T.a, sum(T.b) SELECT T.a, avg(T.b) FROM T FROM T WHERE T.c>10 AND T.a<20 WHERE T.c>10 AND T.c<100 GROUP BY T.a GROUP BY T.a

  14. Outline • Introduction • Map Reduce recap. • MRShare – Sharing opportunities in Map-Reduce • Cost model for MapReduce • MRShare – Grouping algorithm • MRShareImplementation and Evaluation • Summary

  15. Cost model for Map Reduce (single job) Reading input Sorting int. data Transferring Writing output • Reading– f(input size) • Sorting– f(intermediate data size) • Transferring– f(intermediate data size) • Writing – f(output size) T(J) = Tread(J) + Tsort(J) + Ttr(J)

  16. Cost of executing a group of jobs J1 Read Sort Transfer Write J2 Read Sort Transfer Write J3 Read Sort Transfer Write J1+J2+J3 Read Sort Transfer Write Potential costs Potential savings Savings

  17. Cost without grouping n MapReduce jobs, J = {J1, . . . , Jn}, read from the same input file F. n – n jobs; m – m maps; r – r reduces; |Mi| - the average output size of a map task; |Ri| - the average input size of a reduce task; |Di| - the size of the intermediate data of job Ji. |Di| = |Mi| · m = |Ri| · r

  18. Cost with grouping Single group G contains all n jobs and execute it as a single job JG. m – m maps; r – r reduces; |Xm| - the average size of the combined output of map tasks; |Xr| - the average size of the combined input of reduce tasks; |XG| - the size of the intermediate data. | XG | = |Xm| · m = | Xr| · r

  19. Beneficial conditions n <= B

  20. Finding the optimal sharing strategy “NoShare” J3 J3 J2 J2 J5 J4 J4 J1 J1 J5 • An optimization problem J3 J2 J4 J1 J5 “GreedyShare”

  21. Sharing scans - cost based optimization J1 Read Sort • Savings come from reduced number of scans • The sorting cost might change • The costs of copying and writing the output do not change J1+J2+J3 J2 Read Sort Read Sort J3 Read Sort Savings Potential costs

  22. Outline • Introduction • Map Reduce recap. • MRShare – Sharing opportunities in Map-Reduce • Cost model for MapReduce • MRShare – Grouping algorithms • SplitJobs – cost based algorithm for sharing scans • MultiSplitJobs – an improvement of SplitJobs • MRShareEvaluation • Summary

  23. SplitJobs – a DP solution for sharing scans. • We reduce the problem of grouping to the problem of splitting a sorted list of jobs – by approximating the cost of sorting. J6 J5 J4 J3 J2 J1 • Using our cost model and the approximation, we employ a DP algorithm to find the optimal split points. J6 J5 J4 J3 J2 J1 SplitJobs G1 G2 G3

  24. SplitJobs (cont.) GS(i, l) = GAIN(i, l) − f c(l) is the savings of the optimal grouping of jobs J1,…Jl.

  25. MultiSplitJobs – an improvement of SplitJobs J8 J7 J6 J5 J4 J3 J2 J1 G1 G2 SplitJobs SplitJobs G3 SplitJobs G4 MultiSplitJobs

  26. MultiSplitJobs (cont.)

  27. Outline • Introduction • Map Reduce recap. • MRShare – Sharing primitives in Map-Reduce • MRShare – Cost based approach to sharing • MRShare Implementation and Evaluation • Summary

  28. Implementing MRShare • MRShare implement on Hadoop • First, acquire a batch of jobs from queries in a short time T • Second, MultiSplit Jobs is called to compute the optimal grouping of the jobs • Third, the groups are rewritten, using a meta-map and a meta-reduce function. These are MRShare specific container and their functionality relies on tagging. • Finally, new jobs are submitted for execution

  29. Tagging for Sharing Only Scans

  30. Tagging for Sharing Map Output

  31. Tagging for Sharing Map Output

  32. Tagging for Sharing Map Output

  33. Evaluation setup • 40 EC2 small instance virtual machines • Modified Hadoop engine • 30 GB text dataset consisting of blogs • Multiple grep-wordcount queries • Counts words matching a regular expression • Allows for variable intermediate data sizes • Generic aggregation Map Reduce job

  34. Validation of the Cost Model

  35. Evaluation goals • Sharing is not always beneficial. • ‘GreedyShare’ policy • How much can we save on sharing scans? • MRShare - MultiSplitJobs evaluation • How much can we save on sharing intermediate data? • MRShare - γ-MultiSplitJobs evaluation

  36. Is sharing always beneficial?- ‘GreedyShare’ policy

  37. How much we save on sharing scans – MRShare MultiSplitJobs

  38. How much we save on sharing Map-output – MRShareMultiSplitJobs

  39. How much we save on sharing intermediate data - MRShare - γ-MultiSplitJobs

  40. Summary • Introduction on MRShare – a framework for automatic work sharing in Map Reduce. • We identified sharing primitives and demonstrated the implementation thereof in a Map-Reduce engine. • We established a cost model and solved several work sharing optimization problems. • We demonstrated vast savings when using MRShare.

  41. Thank you!!! Questions?

More Related