natjam supporting deadlines and priorities in a mapreduce cluster
Download
Skip this Video
Download Presentation
Natjam : Supporting Deadlines and Priorities in a Mapreduce Cluster

Loading in 2 Seconds...

play fullscreen
1 / 21

Natjam : Supporting Deadlines and Priorities in a Mapreduce Cluster - PowerPoint PPT Presentation


  • 84 Views
  • Uploaded on

Natjam : Supporting Deadlines and Priorities in a Mapreduce Cluster. Brian Cho (Samsung/Illinois), Muntasir Rahman , Tej Chajed , Indranil Gupta , Cristina Abad, Nathan Roberts (Yahoo! Inc.), Philbert Lin University of Illinois (Urbana-Champaign). Hadoop Jobs have Priorities.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Natjam : Supporting Deadlines and Priorities in a Mapreduce Cluster' - yakov


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
natjam supporting deadlines and priorities in a mapreduce cluster

Natjam: Supporting Deadlines and Priorities in a Mapreduce Cluster

Brian Cho (Samsung/Illinois), MuntasirRahman, TejChajed, Indranil Gupta,

Cristina Abad, Nathan Roberts (Yahoo! Inc.), Philbert Lin

University of Illinois (Urbana-Champaign)

Distributed Protocols Research Group (DPRG): http://dprg.cs.uiuc.edu

hadoop jobs have priorities
Hadoop Jobs have Priorities
  • Dual Priority Case
    • Production jobs (high priority)
      • Time sensitive
      • Directly affect criticality or revenue
    • Research jobs (low priority)
      • e.g., long term analysis
  • Example: Ad provider

Ad click-through logs

Count clicks

Is there a better way to place ads?

Update ads

Slow counts → Show old ads → Don’t get paid $$$

Run machine learning analysis

Prioritize production jobs

Daily and Historical logs.

http://dprg.cs.uiuc.edu

state of the art separate clusters
State-of-the-art: Separate clusters
  • Production cluster receives production jobs (high priority)
  • Research cluster receives research jobs (low priority)
  • Traces reveal large periods of under-utilization in each cluster
    • Long job completion times
    • Human involvement in job management
  • Goal: single consolidated cluster for all priorities and deadlines
    • Prioritize production jobs and yet affect research jobs least
  • Today’s Options:
    • Wait for research tasks to finish(e.g., Capacity Scheduler)

 Prolongs production jobs

    • Kill research tasks (e.g., Fair Scheduler) can lead to repeated work

 Prolongs research jobs

http://dprg.cs.uiuc.edu

natjam s techniques
Natjam’s Techniques
  • Scale down research jobs by
    • Preempting some Reduce tasks
    • Fast on-demand automated checkpointing of task state
    • Later, reduces can resume where they left off
      • Focus on Reduces: Reduce tasks take longer, so more work to lose (median Map 19 seconds vs. Reduce 231 seconds [Facebook])
  • Job Eviction Policies
  • Task Eviction Policies

http://dprg.cs.uiuc.edu

natjam built into hadoop yarn architecture
Natjam built into Hadoop YARN Architecture

Resource Manager

  • Preemptor
    • Chooses Victim Job
    • Reclaims queue resources
  • Releaser
    • Chooses Victim Task
  • Local Suspender
    • Saves state of Victim Task

Capacity Scheduler

preempt()

Preemptor

ask container

Node A

Node B

# containers to release

Node Manager A

Node Manager B

suspend

Task (App1)

Application Master 1

Task (App2)

Application Master 2

Task (App2)

resume()

saved state

release()

(empty container)

Local Suspender

Local Suspender

Releaser

Releaser

http://dprg.cs.uiuc.edu

suspending and resuming tasks
Suspending and Resuming Tasks

(Suspended)

Container freed,

Suspend state saved

HDFS

Task Attempt 1

tmp/task_att_1

  • Existing intermediate data used
    • Reduce inputs,stored at local host
    • Reduce outputs,stored on HDFS
    • Suspended task state saved locally, so resume can avoid network overhead
  • Checkpoint state saved
    • Key counter
    • Reduce input path
    • Hostname
    • List of suspended task attempt IDs

Key

Counter

Key

Counter

outdir/

Inputs

(Resumed) Task Attempt 2

tmp/task_att_2

(skip)

Inputs

http://dprg.cs.uiuc.edu

two level eviction policies
Two-level Eviction Policies

Resource Manager

Capacity Scheduler

  • On a container request in a full cluster:
  • JobEviction
    • @Preemptor
  • Task Eviction
    • @Releaser

preempt()

Preemptor

Node A

Node B

Node Manager A

Node Manager B

# containers to release

Application Master 1

Task (App2)

Application Master 2

Task (App2)

release()

Local Suspender

Local Suspender

Releaser

Releaser

http://dprg.cs.uiuc.edu

job eviction p olicies
Job Eviction Policies
  • Based on total amount of resources (e.g., containers) held by victim job (known at Resource Manager)
  • Least Resources (LR)

 Large research jobs unaffected

 Starvation for small research jobs (e.g., repeated production arrivals)

  • Most Resources (MR)

 Small research jobs unaffected

 Starvation for the largest research job

  • Probabilistically-weighted on Resources (PR)

 Weigh jobs by number of containers: treats all tasks same, across jobs

 Affects multiple research jobs

http://dprg.cs.uiuc.edu

task eviction policies
Task Eviction Policies
  • Based on time remaining (known at Application Master)
  • Shortest Remaining Time (SRT)

 Leaves the tail of research job alone

 Holds on to containers that would be released soon

  • Longest Remaining Time (LRT)

 May lengthen the tail

    • Releases more containers earlier
  • However: SRTprovably optimal under some conditions
    • Counter-intuitive. SRT = Longest-job-first scheduling.

Now

http://dprg.cs.uiuc.edu

eviction policies in practice
Eviction Policies in Practice
  • Task Eviction
    • SRT 20% faster than LRT for research jobs
    • Production job similar across SRT vs. LRT
    • Theorem: When research tasks resume simultaneously, SRT results in shortest job completion time.
  • Job Eviction
    • MR best
    • PR very close behind
    • LR 14%-23% worse than MR
  • MR + SRT best combination

http://dprg.cs.uiuc.edu

natjam r multiple priorities
Natjam-R: Multiple Priorities
  • Special case of priorities: jobs with real-time deadlines
  • Best-effort only (no admission control)
  • Resource Manager keeps single queue of jobs sorted by increasing priority (derived from deadline)
    • Periodically scans queue: evicts later job to give to earlier waiting job
  • Job Eviction Policies
  • Maximum Deadline First (MDF): Priority = Deadline
    • Prefers short deadline jobs

 May miss deadlines, e.g., schedules a large job instead of a small job with a slightly large deadline

  • Maximum Laxity First
    • Priority = Laxity = Deadline minus Job’s Projected Completion time
    • Pays attention to job’s resource requirements
mdf vs mlf in practice
MDF vs. MLF in Practice

Job deadlines

MDF prefers short

deadlines

MLF moves in lockstep

Misses all deadlines

  • 8 node cluster
  • Yahoo! trace experiments in paper
natjam vs alternatives
Natjam vs. Alternatives

time (seconds)

  • Microbenchmark:
  • 7 node cluster

7% worse than Ideal

40% better than

Soft cap

50% worse than ideal

90% worse than ideal

20% worse than ideal

2% worse than ideal

15% better than Killing

Empty Cluster

t=50s Production-S

(25% of cluster)

t=0s Research-XL

(100% of cluster)

large experiments
Large Experiments
  • 250 nodes @Yahoo!, Driven by Yahoo! traces
  • Natjamvs. Waiting for research tasks (Hadoop Capacity Scheduler: Soft cap)
    • Production jobs: 53% benefit, 97% delayed < 5 s
    • Research jobs: 63% benefit, very few outliers (low starvation)
  • Natjamvs. Killing research tasks
    • Production jobs: largely unaffected
    • Research jobs:
      • 38% finish faster than 100 s
      • 5th percentile faster than 750 s
      • Biggest improvement: 1880 s
      • Negligible starvation

http://dprg.cs.uiuc.edu

related work
Related Work
  • Single cluster job scheduling has focused on:
    • Locality of Map tasks [Quincy, Delay Scheduling]
    • Speculative execution [LATE Scheduler]
    • Average fairness between queues [Capacity Scheduler, Fair Scheduler]
    • Recent work: Elastic queues but uses Sailfish – needs special intermediate file system, does not work with Hadoop [Amoeba]
    • Mapreduce-5269 JIRA: Preemption in Hadoop

http://dprg.cs.uiuc.edu

takeaways
Takeaways
  • Natjam supports dual priority and arbitrary priorities (derived from deadlines)
  • SRT (Shortest remaining time) best policy for task eviction
  • MR (Most resources) best policy for job eviction
  • MDF (Maximum deadline first) best policy for job eviction in Natjam-R
  • 2-7% Overhead for dual priority case
  • Please see our poster + demo video later today!

http://dprg.cs.uiuc.edu

backup slides

Backup slides

http://dprg.cs.uiuc.edu

contributions
Contributions
  • Our system Natjam allows us to
    • Maintain one cluster
    • With a production queue and a research queue
    • Prioritize production jobs and complete them quickly
    • While affecting research jobs the least
    • (Later: Extend to multiple priorities.)

http://dprg.cs.uiuc.edu

hadoop 23 s capacity scheduler
Hadoop 23’s Capacity Scheduler
  • Limitation: research jobs cannot scale down
  • Hadoop capacity shared using queues
    • Guaranteed capacity (G)
    • Maximum capacity(M)
  • Example
    • Production (P) queue:G 80%/M 80%
    • Research (R) queue:G 20%/M 40%
  • Production jobsubmitted first:
  • Research jobsubmitted first:

(under-utilization)

P takes 80%

R takes 40%

time →

R can only grow to 40%

P cannot grow beyond 60%

(under-utilization)

time →

http://dprg.cs.uiuc.edu

natjam scheduler
Natjam Scheduler
  • Does not require Maximum capacity
  • Scales down research jobs by
    • Preempting Reduce tasks
    • Fast on-demand automated checkpointing of task state
    • Resumption where it left off
      • Focus on Reduces: Reduce tasks take longer, so more work to lose (median Map 19 seconds vs. Reduce 231 seconds [Facebook])
  • P/R Guaranteed: 80%/20%
  • P/RGuaranteed: 100%/0%

R takes 100%

R takes 100%

time →

P takes 80%

P takes 100%

time →

Prioritize Production Jobs

http://dprg.cs.uiuc.edu

yahoo hadoop traces cdf of differences negative is good
Yahoo! Hadoop Traces:CDF of differences (negative is good)

7-node cluster

250-node

Yahoo!

cluster

Only two starved

jobs 260 s and 390 s

Largest benefit

1880 s

ad