sparrow n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Sparrow PowerPoint Presentation
Download Presentation
Sparrow

Loading in 2 Seconds...

play fullscreen
1 / 43

Sparrow - PowerPoint PPT Presentation


  • 171 Views
  • Uploaded on

Sparrow. Kay Ousterhout, Patrick Wendell, Matei Zaharia , Ion Stoica. Distributed Low-Latency Scheduling. Sparrow schedules tasks in clusters using a decentralized, randomized approach. Sparrow schedules tasks in clusters using a decentralized, randomized approach

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Sparrow' - talib


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
sparrow

Sparrow

Kay Ousterhout, Patrick Wendell, MateiZaharia, Ion Stoica

Distributed Low-Latency Scheduling

slide2

Sparrow schedules tasks in clusters

using a decentralized, randomized approach

slide3

Sparrow schedules tasks in clusters

using a decentralized, randomized approach

support constraints and fair sharing

and provides response times

within 12% of ideal

scheduling setting
Scheduling Setting

Map Reduce/Spark/Dryad Job

Task

Task

Task

Map Reduce/Spark/Dryad Job

Task

Task

job latencies rapidly decreasing
Job Latencies Rapidly Decreasing

2012: Impala query

2010: Dremel Query

2010:

In-memory Spark query

2004: MapReduce

batch job

2009: Hive query

2013:

Spark streaming

10 sec.

10 min.

100 ms

1 ms

slide6

Scheduling challenges:

Millisecond Latency

Quality Placement

Fault Tolerant

High Throughput

1000 16 core machines
1000 16-core machines

2012: Impala query

2010: Dremel Query

2010:

In-memory Spark query

2004: MapReduce

batch job

2009: Hive query

2013:

Spark streaming

10 sec.

10 min.

100 ms

1 ms

1.6K

decisions/second

26

decisions/second

160K

decisions/second

16M

decisions/second

Scheduler throughput

slide8

Today: Completely Centralized

Sparrow:

Completely Decentralized

Less centralization

Millisecond Latency

Quality Placement

Fault Tolerant

High Throughput

?

sparrow1
Sparrow

Decentralized approach

Existing randomized approaches

Batch Sampling

Late Binding

Analytical performance evaluation

Handling constraints

Fairness and policy enforcement

Within 12% of ideal on 100 machines

scheduling with sparrow
Scheduling with Sparrow

Worker

Scheduler

Worker

Job

Scheduler

Worker

Worker

Scheduler

Worker

Scheduler

Worker

random
Random

Worker

Scheduler

Worker

Job

Scheduler

Worker

Worker

Scheduler

Worker

Scheduler

Worker

simulated results
Simulated Results

100-task jobs in 10,000-node cluster, exp. task durations

Omniscient: infinitely fast centralized scheduler

per task sampling
Per-task sampling

Worker

Scheduler

Worker

Job

Scheduler

Worker

Worker

Scheduler

Worker

Scheduler

Worker

Power of Two Choices

per task sampling1
Per-task sampling

Worker

Scheduler

Worker

Job

Scheduler

Worker

Worker

Scheduler

Worker

Scheduler

Worker

Power of Two Choices

simulated results1
Simulated Results

100-task jobs in 10,000-node cluster, exp. task durations

per task sampling2
Per-Task Sampling

Per-task

Worker

Task 1

Scheduler

Worker

Job

Scheduler

Worker

Task 2

Worker

Scheduler

Worker

Scheduler

Worker

per task sampling3
Per-task Sampling

Batch

Per-task

4 probes (d = 2)

Worker

Scheduler

Worker

Job

Scheduler

Worker

Worker

Scheduler

Worker

Scheduler

Worker

Place m tasks on the least loaded of dmslaves

simulated results2
Simulated Results

100-task jobs in 10,000-node cluster, exp. task durations

slide21

80 ms

Queue length poor predictor of wait time

155 ms

Worker

Worker

530 ms

Poor performance on heterogeneous workloads

late binding
Late Binding

4 probes (d = 2)

Worker

Scheduler

Worker

Job

Scheduler

Worker

Worker

Scheduler

Worker

Scheduler

Worker

Place m tasks on the least loaded of dmslaves

late binding1
Late Binding

4 probes (d = 2)

Worker

Scheduler

Worker

Job

Scheduler

Worker

Worker

Scheduler

Worker

Scheduler

Worker

Place m tasks on the least loaded of dmslaves

late binding2
Late Binding

Worker requests task

Worker

Scheduler

Worker

Job

Scheduler

Worker

Worker

Scheduler

Worker

Scheduler

Worker

Place m tasks on the least loaded of dmslaves

simulated results3
Simulated Results

100-task jobs in 10,000-node cluster, exp. task durations

job constraints
Job Constraints

Restrict probed machines to those that satisfy the constraint

Worker

Scheduler

Worker

Job

Scheduler

Worker

Worker

Scheduler

Worker

Scheduler

Worker

per task constraints
Per-Task Constraints

Probe separately for each task

Worker

Scheduler

Worker

Job

Scheduler

Worker

Worker

Scheduler

Worker

Scheduler

Worker

technique recap
Technique Recap

Worker

Scheduler

Worker

Batch sampling

+

Late binding

+

Constraint handling

Scheduler

Worker

Worker

Scheduler

Worker

Scheduler

Worker

spark on sparrow
Spark on Sparrow

Sparrow

Scheduler

Worker

Worker

Query: DAG of Stages

Worker

Job

Worker

Worker

Worker

spark on sparrow1
Spark on Sparrow

Sparrow

Scheduler

Worker

Worker

Query: DAG of Stages

Worker

Worker

Job

Worker

Worker

spark on sparrow2
Spark on Sparrow

Sparrow

Scheduler

Worker

Worker

Query: DAG of Stages

Worker

Worker

Worker

Job

Worker

how does sparrow compare to spark s native scheduler
How does Sparrow compare to Spark’s native scheduler?

100 16-core EC2 nodes, 10 tasks/job, 10 schedulers, 80% load

tpc h queries background
TPC-H Queries: Background

TPC-H: Common benchmark for analytics workloads

Shark: SQL execution engine

Spark: Distributed in-memory analytics framework

Sparrow

tpc h queries
TPC-H Queries

Percentiles

100 16-core EC2 nodes, 10 schedulers, 80% load

95

75

50

25

5

tpc h queries1
TPC-H Queries

100 16-core EC2 nodes, 10 schedulers, 80% load

Within 12% of ideal

Median queuing delay of 9ms

fault tolerance
Fault Tolerance

Timeout: 100ms

Failover: 5ms

Re-launch queries: 15ms

Scheduler 1

Spark Client 1

Spark Client 2

Scheduler 2

related work
Related Work

Centralized task schedulers: e.g., Quincy

Two level schedulers: e.g., YARN, Mesos

Coarse-grained cluster schedulers: e.g., Omega

Load balancing: single task

www github com radlab sparrow
www.github.com/radlab/sparrow

Worker

Scheduler

Worker

Batch sampling

+

Late binding

+

Constraint handling

Scheduler

Worker

Worker

Scheduler

Worker

Scheduler

Worker

Sparrows provides near-ideal job response times without global visibility

policy enforcement
Policy Enforcement

Can we do better without losing simplicity?

Priorities

Serve queues based on strict priorities

Fair Shares

Serve queues using weighted fair queuing

Slave

Slave

High Priority

User A (75%)

User B (25%)

Low Priority