A New Analytical Technique for Designing Provably Efficient MapReduce Schedulers

A New Analytical Technique for Designing Provably Efficient MapReduce Schedulers Yousi Zheng Dept. of ECE, The Ohio State University E-mail: zheng.540@osu.edu Joint work with Ness B. Shroff PrasunSinha

MapReduce • Data Centers • Facility containing a very large number of machines • They process very large datasets: MapReduce • MapReduce • Framework for processing highly parallel problems • Developed (popularized) by Google • Nearly ubiquitous (Google, IBM, Facebook, Microsoft, Yahoo, Amazon, eBay, twitter…) • Used in a variety of different applications • Distributed Grep, distributed sort, AI, scientific computation, Large scale pdf generation, Geographical data, Image processing…

MapReduce • Map phase: • Takes an input job and divides into many small sub-problems • Operations can run in parallel on potentially different machines • Reduce phase • Combines the output of Map • Occurs after the Map phase is completed • Runs on parallel machines… Goal:Schedule these Map and Reduce tasks in order to minimize the total flow time in the system

Model • Time is slotted • N “Machines”: Each machine runs 1 unit of workload per slot • Scheduling Constraint: Reduce job starts after all tasks in the Map job are completed. Each Reduce task can have multiple units of workload Ri(k)units of workload for task k for job i Each jobiconsists of Map tasks and Reduce tasks Ri(k) Each Map task has 1 unit of workload Total Mitasks for Map job i n jobs over T slots Reduce job i 1 Map Mi … units of workload for Reduce job i … Data Center with N machines

Types of Schedulers: Treatment of Reduce Tasks Machines • Preemptive • Reduce tasks can be interrupted at the end of a slot • Remaining workload in the task can be executed on any machine(s) • Reasonable if overhead of data-migration is small • Non-preemptive R2 Machine C Machine B R1 R1 Machine A 0 1 2 3 4 5 6 7 Time

Types of Schedulers: Treatment of Reduce Tasks Machines • Preemptive • Reduce tasks can be interrupted at the end of a slot • Remaining workload in the task can be executed on any machine(s) • Reasonable if overhead of data-migration is small • Non-preemptive • Once a Reduce task is started, it can’t be interrupted till the end of the task • An individual Reduce task cannot be completed on different machines • But different tasks from same job can be assigned to different machines • Note: Since Map tasks have unit workload, they cannot be interrupted. • In practice Map tasks may be > 1 unit, but small R2 R2 Machine C Machine B R1 R1 Machine A 0 1 2 3 4 5 6 7 Time

Flow-Time Minimization Problem: Preemptive Scenario Minimize flow-time Total # machines is N Total map workload is Mifor job i Total reduce workload is Rifor job i Scheduling Constraint

Flow-Time Minimization ProblemNon-Preemptive Scenario Minimize flow-time Total # machines is N Total map workload is Mifor job i Total reduce workload is R(k)ifor task k in job i Non-preemptive Both Problems are NP-hard in the strong sense

Competitive Ratio • : set of all possible schedulers (including non-causal) • : Total Flow time of scheduler (sum of FT of all jobs) • Define • Competitive ratio: A given online policy Sis said to have a competitive ratio of c if for any arrival pattern Theorem For any constant c0 and any online scheduler S, there are sequences of arrivals and workloads in the non-preemptive case, such that the competitive ratio c is greater than c0  No policy gives a finite competitive ratio

Efficiency Ratio Analysis • Efficiency ratio: A given online policy Sis said to have an efficiency ratio of  if (over some probability space of arrivals) Theorem • If the workload of Reduce tasks of each job is light-tailed distributed, then • Any work-conserving scheduler has a constant efficiency ratio, for both preemptive and non-preemptive scenarios.

So Far • No scheduler has a finite competitive ratio in non-preemptive scenario • An evaluation metric efficiency ratio is introduced • All work conserving schedulers have a finite efficiency ratio (g) for both preemptive & non-preemptive scenarios • But the performance of these schedulers could still be quite poor (g could be very large) • Key Question: Can we design a simple scheduler with a provably small efficiency ratio in the MapReduce paradigm? Preemptive scenario: Yes! (ASRPT)

ASRPT: Available Shortest Remaining Processing Time • SRPT: An infeasible scheduler • Priority given to jobs with smallest remaining workload • In SRPT, Map and Reduce for a given job can be scheduled in the same time-slot (infeasible in reality) • SRPT results in a lower flow time than any feasible (including non-causal) scheduler • ASRPT • Map tasks are scheduled according to SRPT (or sooner) • By running SRPT in a virtual manner • Reduce tasks greedily fill up the remaining machines • In the order determined by the shortest remaining processing time

ASRPT: Available Shortest Remaining Processing Time 4 4 Theorem: If the job arrivals and workload are i.i.d., then ASRPT has an efficiency ratio of 2 in the preemptive scenario. R1 R1 R1 Number of machines, N Number of machines, N 3 3 R1 2 2 M1 M1 R2 1 1 M2 R1 R2 M2 1 1 3 3 2 4 2 4 Time Time SRPT : Total Flow-time 4 units : Total Flow-time 5 units ASRPT

Simulation • N = 100 machines, total time slots T = 800 • #Reduce tasks in each job is 10 • Job arrival process: Poisson; λ = 2 jobs per time slot. • Preemptive Scenario • Schedulers • ASRPT • Fair • FIFO • ASRPT has smaller efficiency ratio than Fair and FIFO.

Conclusion • Investigated the flow time minimization problem inMapReduce schedulers • No online policy has a finitecompetitive ratio in non-preemptive scenarios • All work-conserving schedulers have constant efficiency ratios • Preemptive and non-preemptive scenarios. • A specific algorithm (ASRPT) can guarantee an efficiency ratio of 2 in the preemptive scenario. • Is efficiency ratio of ASRPT similarly small in non-pre-emptive case? • Efficiency Ratio based analysis provides worse case (w.p.1) guarantees (compared to competitive ratio), but gives more design flexibility. • Has the potential to be applied to other problems/domains

Future Work • Consider Shuffle Phase • Introduce more information (e.g., dependency Graph) to the scheduler • So that all reduce tasks don’t wait until Map tasks are completed • Consider cost of data migration • Combine preemptive and non-preemptive scenarios • Consider multiple phases • With phase precedence (for complex processing) • Develop Green Energy solutions • Combine with sleep-wake scheduling • Consider renewable energy fluctuations

Thank You!

Backup Slides

Main Results • Introduced a metric Efficiency Ratio to evaluate the performance of schedulers in MapReduce • Proved that no online scheduler has a finite competitive ratio • Guarantee with probability 1 • May have greater modeling implications for analysis of algorithms • Proved that Efficiency Ratio is bounded by a constant for anywork-conserving scheduler • For both bounded and light-tailed workload for Reduce phase • For both preemptive and non-preemptive scenarios • Developed a specific work-conserving algorithm (ASRPT) that has an efficiency ratio of 2 for preemptive scenario.

Efficiency Ratio: Preemptive & Non-preemptive • For any work-conserving scheduling policy: Preemptive: Non-preemptive: • Similar result holds but with different constants ( ) wherep0= Prob (a slot has no job arrivals)

MapReduce:FIFO scheduler R1 FCFS scheduler with 4 machines, 2 jobs Map must finish before Reduce can start M1 M2 R2 4 R1 Number of machines 3 Flow-time of a job: time spent by job in the system FT of Job 1: 3 time slots FT of Job 2: 3 time slots Total FT : 6 time slots 2 M2 M1 1 R1 R2 1 3 2 4 Time

MapReduce: Smarter Scheduler R1 M1 “Smarter Scheduler” M2 R2 4 R1 Number of machines 3 Flow-time: time in the system FT of Job 1: 3 time slots FT of Job 2: 2 time slots Total FT: 5 time slots R2 2 M1 R1 1 M2 1 3 2 4 Time Goal: Minimize total flow-time

Future Work • Consider Shuffle Phase • Introduce more information (e.g., dependency Graph) to the scheduler • So that all reduce tasks don’t wait until Map tasks are completed • Consider cost of data migration • Combine preemptive and non-preemptive scenarios • Consider multiple phases • With phase precedence (for complex processing) • Develop Green Energy solutions • Combine with sleep-wake scheduling • Consider renewable energy fluctuations

Analysis Outline • Divide up time into repeating intervals • Interval corresponds to the consecutive times slots until a time-slot occurs with an idle machine(s) • Bound the moments of these intervals with constants • Bound the efficiency ratio using the moments • Analysis applies to: • Preemptive and non-preemptive scenarios • Allows jobs to have infinite but light-tailed support

Preemptive Scenario: An Example Incomplete time-slot (all machines not used) Observations for a job arriving in interval j(work-conserving) • Map task finishes in interval j (otherwise last time-slot would be complete) • Reduce tasks finish in interval jor interval j+1 •  Flow time per job is bounded, wp1, if the interval sizes (moments) are bounded •  g is bounded wp1. Complete time-slot (all machines used) Kj: Length of j-thinterval (ends in incomplete slot)

Non-preemptive Scenario: An Example Observation for reduce tasks: • Unlike preemptive scenario, in the non-preemptive scenario, each job arriving in interval j will start (not finish) all its reduce tasks in intervaljor interval j+1 • Reduce tasks must be finished by the end of interval j+1 plus time Ri-1. • Hence, if (all moments) of the interval are bounded then the flow times (per job) are bounded wp1  g is bounded wp1.

Bounds on the Moments Lemma: If the scheduler is work-conserving, then for any given number , there exists a constant , such that , for all j. where Rough idea: • Since the traffic intensity ρ<1, and the workload in each slot is light-tailed, the number of consecutive complete time slots will decay exponentially (Chernoff’s Theorem) •  The moments of interval K are bounded

Many unanswered questions

Main schedulers proposed for MapReduce framework • FIFO scheduler: • Default in Hadoop • Suffers from the well known head-of-line blocking problem • Fair scheduler [Zaharia’09] • Mitigate head-of-line blocking problem of FIFO • Jobs stick to the machines on which they are initially scheduled • Coupling scheduler [Tan’12] • Mitigate the problem that jobs stick to machines

Main schedulers proposed for MapReduce framework • Consider machines with speed-up [Moseley’11] • O(1/ ε5) competitive algorithm with (1 + ε) speed for the online case, where 0 < ε ≤ 1 • Speed-up of machines is necessary in the algorithm (if ε decreases to 0, the competitive ratio will increase to infinity correspondingly). • Consider minimizing total completion time [Chang’11, Chen’12] • Minimizing total completion time = minimizing total flow time. • Competitive ratio of “minimizing total completion time” ≠ competitive ratio of “minimizing total flow time”.

ASRPT • ASRPT runs SRPT in a virtual fashion and keeps track of how SRPT would have scheduled jobs in any given slot. • ASRPT assigns machines to the tasks in the following priority order (from high to low): • (Only in the non-preemptive scenario) The previously scheduled Reduce tasks which are not yet finished, • The Map tasks which are scheduled in the SRPT, • The available Reduce tasks, • The available Map tasks. • For each priority group, the algorithm greedily assign machines to the corresponding tasks through the sorted available workload.

Light-tailed Distributed Workload where

Simulations

Main Results • Competitive Ratio: Deterministic guarantee under adversarial arrivals • Unbounded for all non-preemptive scheduler • Efficiency Ratio • Asymptotic • Guarantee with probability 1 • May have greater modeling implications for analysis of algorithms • Efficiency Ratio is bounded by a constant for anywork-conserving scheduler • For bounded workload for Reduce phase (Ri < Rmax) • For light-tailed workload for Reduce phase • An Algorithm (ASRPT) with efficiency ratio of 2 for preemptive scenario.

Types of Schedulers: Treatment of Reduce Tasks • Non-preemptive • One task can only be allocated to one machine (preserves data locality) • Once started cannot be interrupted till the end of the task

Types of Schedulers: Treatment of Reduce Tasks • Preemptive • Every task can be interrupted at the end of a slot • Remaining workload can be executed on any machine

More in Practice Shuffle Phase

ASRPT: Available Shortest Remaining Processing Time [Example] 4 4 R1 R1 R1 Number of machines, N Number of machines, N 3 3 R2 2 2 R1 M1 M1 1 1 M2 R2 M2 1 1 3 3 2 4 2 4 Time Time SRPT : Total Flow-time 4 units : Total Flow-time 5 units ASRPT

ASRPT: Available Shortest Remaining Processing Time [Example] 4 4 R1 R1 Number of machines, N Number of machines, N 3 3 R2 2 2 M2 R1 M1 M1 1 1 R2 R1 M2 1 1 3 3 2 4 2 4 Time Time : Total Flow-time 6 units : Total Flow-time 5 units FIFO ASRPT

Bounds on the Moments Lemma: If the scheduler is work-conserving, then for any given number , there exists a constant , such that , for all j. where

A New Analytical Technique for Designing Provably Efficient MapReduce Schedulers

A New Analytical Technique for Designing Provably Efficient MapReduce Schedulers

Presentation Transcript

Schedulers

Introduction to Light Scattering A bulk analytical technique

DSC: Differential Scanning Calorimetry A bulk analytical technique

Designing VM Schedulers for Embedded Real-Time Applications

Provably Efficient GPU Algorithms

Design Patterns for Efficient Graph Algorithms in MapReduce

A new provably secure certificateless short signature scheme

Design Patterns for Efficient Graph Algorithms in MapReduce

Designing MapReduce Algorithms

Towards Energy Efficient MapReduce

Designing experiments for efficient WTP estimates

A-Ports: A Distributed, Efficient Technique for Performance Models on FPGAs

Energy Analytical Models: A Holistic Energy Assessment Technique

Provably Efficient Authenticated Key Agreement Protocol For Multi-Servers

DESIGNING EFFICIENT ORGANIZATIONS FOR MICROFINANCE OPERATIONS:

Designing Efficient Libraries

Matchmaking: A New MapReduce Scheduling Technique

Choice of analytical technique(s)

A New Provably Secure Certificateless Signature Scheme

Efficient Skyline Computation in MapReduce

New Web Designing Technique And Trends

Designing Efficient Systems