Meeting Service Level Objectives of Pig Programs - PowerPoint PPT Presentation

britain
meeting service level objectives of pig programs n.
Skip this Video
Loading SlideShow in 5 Seconds..
Meeting Service Level Objectives of Pig Programs PowerPoint Presentation
Download Presentation
Meeting Service Level Objectives of Pig Programs

play fullscreen
1 / 21
Download Presentation
Meeting Service Level Objectives of Pig Programs
95 Views
Download Presentation

Meeting Service Level Objectives of Pig Programs

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. Meeting Service Level Objectives of Pig Programs Zhuoyao Zhang, Ludmila Cherkasova, Abhishek Verma, Boon ThauLoo University of Pennsylvania Hewlett-Packard Labs

  2. Advantages • Large amount of resources • Elasticity • Pay-as-you-go pricing model • Challenges • Distributed resources • Error-prone Cloud Environment

  3. MapReduce and Pig • MapReduce: Simple and fault tolerant framework for data processing in the cloud • Pig • Advanced MapReduce based platform • Widely used: Yahoo!, Twitter, LinkedIn • PigLatin: A high-level declaratice language for expressing data analysis tasks as Pig programs j2 j4 j6 j7 j1 j3 j5

  4. Motivation • Latency-sensitive applications • Personalized advertising • Spam and fraud detection • Real-time log analysis • How much resource does an application need to meet their deadlines?

  5. Contributions • Performance modeling for Pig programs • Given a Pig grogram, estimates its completion time as a function of assigned resource • Deadline driven resource allocation estimates for Pig programs • Given a completion time target, determine the amount of resources for a Pig program to achieve it

  6. Outline • Introduction • Building block • Performance model for single MapReduce jobs • Resource allocation for Pig programs • Evaluation • Conclusion and ongoing work

  7. Theoretical Makespan Bounds • Bounds- based makespan estimates • n tasks, k servers • avg: average duration of the n tasks • max: maximum duration of the n tasks • Lower bound • Upper bound

  8. Illustration Schedule 1:1432312 1 2 Makespan = 4 Lower bound = 4 3 4 Schedule 2:3123214 1 Makespan = 7 Upper bound = 8 2 3 4

  9. Estimate Completion Timefor Single MR Job • Estimate the bounds of the job completion time based on job profile • Most production jobs are executed routinely on new data sets • Job profile based on previous running • Map stage: Mavg, Mmax, AvgInputSize, Selectivity • Reduce stage: Shavg, Shmax, Ravg, Rmax, Selectivity • Predict the completion time for future running with the profile

  10. Estimate CompletionTime for Single MR Job • Estimating bounds on the duration of map and reduce stages • Map stage duration depends on: • NM -- the number of map tasks • SM -- the number of map slots • Reduce stage duration depends on: • NR -- the number of reduce tasks • SR -- the number of reduce slots • Job duration TJlow,TJup , Tjavg • Sum of the map and reduce stage duration

  11. Resource Allocation for Single MR Job • Given a deadline D and the job profile, find the minimal resource to complete the job within D Given number of map/reduce tasks Statistics from job profile Find the value of SMJ, SRJwith minimum value of SMJ+ SRJusing Lagrange's multipliers

  12. Outline • Introduction • Building block • Performance model for single MapReduce jobs • Resource allocation for Pig programs • Evaluation • Conclusion and ongoing work

  13. Performance Model for Pig Programs • Let P = {J1, J2,….JN } , extract the job profile of each job contained in P • Assign unique name for each job within a program • The program completion time  sum of the completion time of all the jobs contained in P

  14. Resource Allocation for Pig Programs • Possible strategy: find outan appropriate pair of map and reduce slots for each job in the program • Problem: difficult to implement and manage by the scheduler with

  15. Resource Allocation for Pig Programs • A simpler and more elegant solution • Allocate the same set of resource to the entire program instead of to each job • Rewrite the previous equations into Find the minimum set of map and reduce slots ( SMP , SRP ) for the entire Pig program

  16. Experiment Setup • 66 nodes cluster in 2 racks • 4 AMD 2.39GHz cores • 8 GB RAM, • two 160GB hard disks • Configuration • 1 jobtracker, 1 namenode, 64 worker nodes • 2 map slots and 1 reduce slot for each node

  17. Benchmark • Pigmix benchmark • 17 programs • 8 tables as the input data • Dataset • Test dataset • Generated with the Pig mix data generator • Total size around 1TB. • Experimental dataset • Same layout as the test dataset • 20% larger in size

  18. Model Accuracy • How well of our performance model captures Pig program completion time? Normalized results for predicted and measured completion time

  19. Meeting Deadlines • Are we meeting deadlines with our resource allocation mode? Pigmix executed on experimental data set : do we meet deadlines?

  20. Conclusion • Conclusion • The performance model can accurately estimate the completion time of MapReduce workflow • Enables automatic resource provisioning for MapReduce workflow with deadlines • Ongoing work • Refine the performance model for workflow with concurrent jobs • Incorporating failure scenarios in the current model

  21. Thank you