1 / 21

Meeting Service Level Objectives of Pig Programs

Meeting Service Level Objectives of Pig Programs. Zhuoyao Zhang, Ludmila Cherkasova, Abhishek Verma, Boon Thau Loo University of Pennsylvania Hewlett-Packard Labs. Advantages Large amount of resources Elasticity Pay-as-you-go pricing model Challenges Distributed resources

britain
Download Presentation

Meeting Service Level Objectives of Pig Programs

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Meeting Service Level Objectives of Pig Programs Zhuoyao Zhang, Ludmila Cherkasova, Abhishek Verma, Boon ThauLoo University of Pennsylvania Hewlett-Packard Labs

  2. Advantages • Large amount of resources • Elasticity • Pay-as-you-go pricing model • Challenges • Distributed resources • Error-prone Cloud Environment

  3. MapReduce and Pig • MapReduce: Simple and fault tolerant framework for data processing in the cloud • Pig • Advanced MapReduce based platform • Widely used: Yahoo!, Twitter, LinkedIn • PigLatin: A high-level declaratice language for expressing data analysis tasks as Pig programs j2 j4 j6 j7 j1 j3 j5

  4. Motivation • Latency-sensitive applications • Personalized advertising • Spam and fraud detection • Real-time log analysis • How much resource does an application need to meet their deadlines?

  5. Contributions • Performance modeling for Pig programs • Given a Pig grogram, estimates its completion time as a function of assigned resource • Deadline driven resource allocation estimates for Pig programs • Given a completion time target, determine the amount of resources for a Pig program to achieve it

  6. Outline • Introduction • Building block • Performance model for single MapReduce jobs • Resource allocation for Pig programs • Evaluation • Conclusion and ongoing work

  7. Theoretical Makespan Bounds • Bounds- based makespan estimates • n tasks, k servers • avg: average duration of the n tasks • max: maximum duration of the n tasks • Lower bound • Upper bound

  8. Illustration Schedule 1:1432312 1 2 Makespan = 4 Lower bound = 4 3 4 Schedule 2:3123214 1 Makespan = 7 Upper bound = 8 2 3 4

  9. Estimate Completion Timefor Single MR Job • Estimate the bounds of the job completion time based on job profile • Most production jobs are executed routinely on new data sets • Job profile based on previous running • Map stage: Mavg, Mmax, AvgInputSize, Selectivity • Reduce stage: Shavg, Shmax, Ravg, Rmax, Selectivity • Predict the completion time for future running with the profile

  10. Estimate CompletionTime for Single MR Job • Estimating bounds on the duration of map and reduce stages • Map stage duration depends on: • NM -- the number of map tasks • SM -- the number of map slots • Reduce stage duration depends on: • NR -- the number of reduce tasks • SR -- the number of reduce slots • Job duration TJlow,TJup , Tjavg • Sum of the map and reduce stage duration

  11. Resource Allocation for Single MR Job • Given a deadline D and the job profile, find the minimal resource to complete the job within D Given number of map/reduce tasks Statistics from job profile Find the value of SMJ, SRJwith minimum value of SMJ+ SRJusing Lagrange's multipliers

  12. Outline • Introduction • Building block • Performance model for single MapReduce jobs • Resource allocation for Pig programs • Evaluation • Conclusion and ongoing work

  13. Performance Model for Pig Programs • Let P = {J1, J2,….JN } , extract the job profile of each job contained in P • Assign unique name for each job within a program • The program completion time  sum of the completion time of all the jobs contained in P

  14. Resource Allocation for Pig Programs • Possible strategy: find outan appropriate pair of map and reduce slots for each job in the program • Problem: difficult to implement and manage by the scheduler with

  15. Resource Allocation for Pig Programs • A simpler and more elegant solution • Allocate the same set of resource to the entire program instead of to each job • Rewrite the previous equations into Find the minimum set of map and reduce slots ( SMP , SRP ) for the entire Pig program

  16. Experiment Setup • 66 nodes cluster in 2 racks • 4 AMD 2.39GHz cores • 8 GB RAM, • two 160GB hard disks • Configuration • 1 jobtracker, 1 namenode, 64 worker nodes • 2 map slots and 1 reduce slot for each node

  17. Benchmark • Pigmix benchmark • 17 programs • 8 tables as the input data • Dataset • Test dataset • Generated with the Pig mix data generator • Total size around 1TB. • Experimental dataset • Same layout as the test dataset • 20% larger in size

  18. Model Accuracy • How well of our performance model captures Pig program completion time? Normalized results for predicted and measured completion time

  19. Meeting Deadlines • Are we meeting deadlines with our resource allocation mode? Pigmix executed on experimental data set : do we meet deadlines?

  20. Conclusion • Conclusion • The performance model can accurately estimate the completion time of MapReduce workflow • Enables automatic resource provisioning for MapReduce workflow with deadlines • Ongoing work • Refine the performance model for workflow with concurrent jobs • Incorporating failure scenarios in the current model

  21. Thank you

More Related