Natjam supporting deadlines and priorities in a mapreduce cluster
Sponsored Links
This presentation is the property of its rightful owner.
1 / 21

Natjam : Supporting Deadlines and Priorities in a Mapreduce Cluster PowerPoint PPT Presentation

  • Uploaded on
  • Presentation posted in: General

Natjam : Supporting Deadlines and Priorities in a Mapreduce Cluster. Brian Cho (Samsung/Illinois), Muntasir Rahman , Tej Chajed , Indranil Gupta , Cristina Abad, Nathan Roberts (Yahoo! Inc.), Philbert Lin University of Illinois (Urbana-Champaign). Hadoop Jobs have Priorities.

Download Presentation

Natjam : Supporting Deadlines and Priorities in a Mapreduce Cluster

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

Natjam: Supporting Deadlines and Priorities in a Mapreduce Cluster

Brian Cho (Samsung/Illinois), MuntasirRahman, TejChajed, Indranil Gupta,

Cristina Abad, Nathan Roberts (Yahoo! Inc.), Philbert Lin

University of Illinois (Urbana-Champaign)

Distributed Protocols Research Group (DPRG):

Hadoop Jobs have Priorities

  • Dual Priority Case

    • Production jobs (high priority)

      • Time sensitive

      • Directly affect criticality or revenue

    • Research jobs (low priority)

      • e.g., long term analysis

  • Example: Ad provider

Ad click-through logs

Count clicks

Is there a better way to place ads?

Update ads

Slow counts → Show old ads → Don’t get paid $$$

Run machine learning analysis

Prioritize production jobs

Daily and Historical logs.

State-of-the-art: Separate clusters

  • Production cluster receives production jobs (high priority)

  • Research cluster receives research jobs (low priority)

  • Traces reveal large periods of under-utilization in each cluster

    • Long job completion times

    • Human involvement in job management

  • Goal: single consolidated cluster for all priorities and deadlines

    • Prioritize production jobs and yet affect research jobs least

  • Today’s Options:

    • Wait for research tasks to finish(e.g., Capacity Scheduler)

       Prolongs production jobs

    • Kill research tasks (e.g., Fair Scheduler) can lead to repeated work

       Prolongs research jobs

Natjam’s Techniques

  • Scale down research jobs by

    • Preempting some Reduce tasks

    • Fast on-demand automated checkpointing of task state

    • Later, reduces can resume where they left off

      • Focus on Reduces: Reduce tasks take longer, so more work to lose (median Map 19 seconds vs. Reduce 231 seconds [Facebook])

  • Job Eviction Policies

  • Task Eviction Policies

Natjam built into Hadoop YARN Architecture

Resource Manager

  • Preemptor

    • Chooses Victim Job

    • Reclaims queue resources

  • Releaser

    • Chooses Victim Task

  • Local Suspender

    • Saves state of Victim Task

Capacity Scheduler



ask container

Node A

Node B

# containers to release

Node Manager A

Node Manager B


Task (App1)

Application Master 1

Task (App2)

Application Master 2

Task (App2)


saved state


(empty container)

Local Suspender

Local Suspender



Suspending and Resuming Tasks


Container freed,

Suspend state saved


Task Attempt 1


  • Existing intermediate data used

    • Reduce inputs,stored at local host

    • Reduce outputs,stored on HDFS

    • Suspended task state saved locally, so resume can avoid network overhead

  • Checkpoint state saved

    • Key counter

    • Reduce input path

    • Hostname

    • List of suspended task attempt IDs







(Resumed) Task Attempt 2




Two-level Eviction Policies

Resource Manager

Capacity Scheduler

  • On a container request in a full cluster:

  • JobEviction

    • @Preemptor

  • Task Eviction

    • @Releaser



Node A

Node B

Node Manager A

Node Manager B

# containers to release

Application Master 1

Task (App2)

Application Master 2

Task (App2)


Local Suspender

Local Suspender



Job Eviction Policies

  • Based on total amount of resources (e.g., containers) held by victim job (known at Resource Manager)

  • Least Resources (LR)

     Large research jobs unaffected

     Starvation for small research jobs (e.g., repeated production arrivals)

  • Most Resources (MR)

     Small research jobs unaffected

     Starvation for the largest research job

  • Probabilistically-weighted on Resources (PR)

     Weigh jobs by number of containers: treats all tasks same, across jobs

     Affects multiple research jobs

Task Eviction Policies

  • Based on time remaining (known at Application Master)

  • Shortest Remaining Time (SRT)

     Leaves the tail of research job alone

     Holds on to containers that would be released soon

  • Longest Remaining Time (LRT)

     May lengthen the tail

    • Releases more containers earlier

  • However: SRTprovably optimal under some conditions

    • Counter-intuitive. SRT = Longest-job-first scheduling.


Eviction Policies in Practice

  • Task Eviction

    • SRT 20% faster than LRT for research jobs

    • Production job similar across SRT vs. LRT

    • Theorem: When research tasks resume simultaneously, SRT results in shortest job completion time.

  • Job Eviction

    • MR best

    • PR very close behind

    • LR 14%-23% worse than MR

  • MR + SRT best combination

Natjam-R: Multiple Priorities

  • Special case of priorities: jobs with real-time deadlines

  • Best-effort only (no admission control)

  • Resource Manager keeps single queue of jobs sorted by increasing priority (derived from deadline)

    • Periodically scans queue: evicts later job to give to earlier waiting job

  • Job Eviction Policies

  • Maximum Deadline First (MDF): Priority = Deadline

    • Prefers short deadline jobs

       May miss deadlines, e.g., schedules a large job instead of a small job with a slightly large deadline

  • Maximum Laxity First

    • Priority = Laxity = Deadline minus Job’s Projected Completion time

    • Pays attention to job’s resource requirements

MDF vs. MLF in Practice

Job deadlines

MDF prefers short


MLF moves in lockstep

Misses all deadlines

  • 8 node cluster

  • Yahoo! trace experiments in paper

Natjam vs. Alternatives

time (seconds)

  • Microbenchmark:

  • 7 node cluster

7% worse than Ideal

40% better than

Soft cap

50% worse than ideal

90% worse than ideal

20% worse than ideal

2% worse than ideal

15% better than Killing

Empty Cluster

t=50s Production-S

(25% of cluster)

t=0s Research-XL

(100% of cluster)

Large Experiments

  • 250 nodes @Yahoo!, Driven by Yahoo! traces

  • Natjamvs. Waiting for research tasks (Hadoop Capacity Scheduler: Soft cap)

    • Production jobs: 53% benefit, 97% delayed < 5 s

    • Research jobs: 63% benefit, very few outliers (low starvation)

  • Natjamvs. Killing research tasks

    • Production jobs: largely unaffected

    • Research jobs:

      • 38% finish faster than 100 s

      • 5th percentile faster than 750 s

      • Biggest improvement: 1880 s

      • Negligible starvation

Related Work

  • Single cluster job scheduling has focused on:

    • Locality of Map tasks [Quincy, Delay Scheduling]

    • Speculative execution [LATE Scheduler]

    • Average fairness between queues [Capacity Scheduler, Fair Scheduler]

    • Recent work: Elastic queues but uses Sailfish – needs special intermediate file system, does not work with Hadoop [Amoeba]

    • Mapreduce-5269 JIRA: Preemption in Hadoop


  • Natjam supports dual priority and arbitrary priorities (derived from deadlines)

  • SRT (Shortest remaining time) best policy for task eviction

  • MR (Most resources) best policy for job eviction

  • MDF (Maximum deadline first) best policy for job eviction in Natjam-R

  • 2-7% Overhead for dual priority case

  • Please see our poster + demo video later today!

Backup slides


  • Our system Natjam allows us to

    • Maintain one cluster

    • With a production queue and a research queue

    • Prioritize production jobs and complete them quickly

    • While affecting research jobs the least

    • (Later: Extend to multiple priorities.)

Hadoop 23’s Capacity Scheduler

  • Limitation: research jobs cannot scale down

  • Hadoop capacity shared using queues

    • Guaranteed capacity (G)

    • Maximum capacity(M)

  • Example

    • Production (P) queue:G 80%/M 80%

    • Research (R) queue:G 20%/M 40%

  • Production jobsubmitted first:

  • Research jobsubmitted first:


P takes 80%

R takes 40%

time →

R can only grow to 40%

P cannot grow beyond 60%


time →

Natjam Scheduler

  • Does not require Maximum capacity

  • Scales down research jobs by

    • Preempting Reduce tasks

    • Fast on-demand automated checkpointing of task state

    • Resumption where it left off

      • Focus on Reduces: Reduce tasks take longer, so more work to lose (median Map 19 seconds vs. Reduce 231 seconds [Facebook])

  • P/R Guaranteed: 80%/20%

  • P/RGuaranteed: 100%/0%

R takes 100%

R takes 100%

time →

P takes 80%

P takes 100%

time →

Prioritize Production Jobs

Yahoo! Hadoop Traces:CDF of differences (negative is good)

7-node cluster




Only two starved

jobs 260 s and 390 s

Largest benefit

1880 s

  • Login