Scheduling streaming computations
This presentation is the property of its rightful owner.
Sponsored Links
1 / 37

Scheduling Streaming Computations PowerPoint PPT Presentation


  • 81 Views
  • Uploaded on
  • Presentation posted in: General

Scheduling Streaming Computations. Kunal Agrawal. The Streaming Model. Computation is represented by a directed graph: Nodes: Computation Modules. Edges: FIFO Channels between nodes. Infinite input stream. We only consider acyclic graphs ( dags ).

Download Presentation

Scheduling Streaming Computations

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Scheduling streaming computations

Scheduling Streaming Computations

Kunal Agrawal


The streaming model

The Streaming Model

  • Computation is represented by a directed graph:

    • Nodes: Computation Modules.

    • Edges: FIFO Channels between nodes.

    • Infinite input stream.

    • We only consider acyclic graphs (dags).

  • When modules fire, they consume data from incoming channels and produce data on outgoing channels.


Cache conscious scheduling of streaming applications

Cache-Conscious Scheduling of Streaming Applications

  • Goal: Schedule the computation to minimize the number of cache misses on a sequential machine.

with Jeremy T. Fineman, Jordan Krage, Charles E. Leiserson, and Sivan Toledo


Disk access model

Disk Access Model

cache

Slow

Memory

  • The cache has M/B blocks each of size B.

  • Cost = Number of cache misses.

  • If CPU accesses data in cache, the cost is 0.

  • If CPU accesses data not in cache, then there is a cache miss of cost 1. The block containing the requested data is read into cache.

  • If the cache is full, some block is evicted from cache to make room for new blocks.

M/B

CPU

block

B


Contributions

Contributions

  • The problem of minimizing cache misses is reduced to a problem of graph partitioning.

  • Theorem: If the optimal algorithm has X cache misses given a cache of size M, there exists a partitioned schedule that incurs O(X) cache misses given a cache of size O(M).

  • In other words, some partitioned schedule is O(1) competitive given O(1) memory augmentation.


Outline

Outline

  • Cache Conscious Scheduling

    • Streaming Application Model

    • The Sources of Cache Misses and Intuition Behind Partitioning

    • Proof Intuition

    • Thoughts

  • Deadlock Avoidance

    • Model and Source of Deadlocks

    • Deadlock Avoidance Using Dummy Items.

    • Thoughts


Streaming applications

Streaming Applications

b

i:4

s:20

o:2

a

d

o:2

i:1

i:1

o:1

s:60

When a module vfires, it

  • must load s(v) state,

  • consumes i(u,v) items from incoming edge(s) (u,v), and

  • produces o(v,w) items on outgoing edge(s) (v,w).

s:35

o:4

i:1

s:40

i:4

o:1

c

  • Assumptions:

    • All items are unit sized.

    • The source consumes 1 item each time it fires.

    • Input/output rates and state sizes are known.

    • The state size of modules is at most M.


Definition gain

Definition: Gain

b

gain: 1/2

i:4

s:20

o:2

a

d

o:2

i:1

i:1

o:1

s:60

Vertex Gain: Number of vertex u firings per source firing.

, where p is a path from s to u.

s:35

o:4

i:1

s:40

gain: 1

i:4

o:1

c

gain: 1

Edge Gain: The number of items produced along the edge (u,v) per source firing.

A graph is well-formed iff all gains are well-defined.


Outline1

Outline

  • Cache Conscious Scheduling

    • Streaming Application Model

    • The Sources of Cache Misses and Intuition Behind Partitioning

    • Proof Intuition

    • Thoughts

  • Deadlock Avoidance

    • Model and Source of Deadlocks

    • Deadlock Avoidance Using Dummy Items.

    • Thoughts


Cache misses due to state load

Cache Misses Due to State Load

s:40

s:60

s:20

s:35

1

4

1

2

1

1

8

1

Strategy: Push items through.

Cost Per Input Item: The sum of the state sizes

Idea: Reuse the state once loaded.

B:1, M:100

Cache

Slow

Memory


Cache misses due to data items

Cache Misses Due to Data Items

s4:35

s3:40

s1:60

s2:20

1

4

1

2

1

1

8

1

Strategy: Once loaded, execute module many times by adding large buffers between modules.

Cost Per Input Item: Total number of items produced on all channels per input item

B:1, M:100

Cache

Slow

Memory


Partitioning reduce cache misses

Partitioning: Reduce Cache Misses

s3:40

s1:60

s2:20

s4:35

1

4

1

2

1

1

8

1

Strategy: Partitioninto segments that fit in cache and only add buffers on cross edgesC --- edges that go between partitions.

Cost Per Input Item:

B:1, M:100

Cache

Slow

Memory


Which partition

Which Partition?

s3:40

s1:60

s2:20

s4:35

1

4

1

2

1

1

8

1

Strategy: Partitioninto segments that fit in cache and only add buffers on cross edgesC --- edges that go between partitions.

Cost Per Input Item:

B:1, M:100

Cache

Slow

Memory

Lesson: Cut small gain edges.


Outline2

Outline

  • Cache Conscious Scheduling

    • Streaming Application Model

    • The Sources of Cache Misses and Intuition Behind Partitioning

    • Proof Intuition

    • Thoughts

  • Deadlock Avoidance

    • Model and Source of Deadlocks

    • Deadlock Avoidance Using Dummy Items.

    • Thoughts


Is partitioning good

Is Partitioning Good?

  • Show that the optimal scheduler can not do much better than the best partitioned scheduler.

  • Theorem: On processing T items, if the optimal algorithm given M-sized cache has X cache misses, then some partitioning algorithm given O(M) cache has at most O(X) cache misses.

  • The number of cache misses due to a partitioned scheduler is The best partitioned scheduler should minimize

  • We must prove the matching lower bound on the optimal scheduler’s cache misses.


Optimal scheduler with cache m

Optimal Scheduler With Cache M

S: segment with state size at least 2M.

e = gm(S): the edge with the minimum gain within S.

u

v

e

S

  • u fires X times.

  • Case 1: At least 1 item produced by u is processed by v.

    • Cost

  • Case 2: All items are buffered within S.

    • The cheapest place to buffer is at e.

    • Cost

    • If , Cost

  • In both cases, Cost/firing ofu


Lower bound

Lower Bound

  • Divide the pipeline into segments of size between 2M and 3M.

  • Source node fires T times.

  • Consider the optimal scheduler with M cache.

    • Number of firings of ui

    • Cost due to Si per firing of ui

    • Total cost due to Si

    • Total Cost over all segments

ui

vi

ek

ei

e1

Si


Matching upper bound

Matching Upper Bound

  • Divide the pipeline into segments of size between 2M and 3M.

  • Source node fires T times.

  • Cost of optimal scheduler with M cache

  • Consider the partitioned schedule that cuts all ei.

    • Each segment has size at most 6M.

    • The total cost of that schedule is

  • Therefore, if this partitioned schedule has constant factor memory augmentation, it provides constant-competitiveness in the number of cache misses.

ui

vi

ek

ei

e1

Si


Generalization to dag

Generalization to DAG

  • Say we partition a DAG such that

    • Each component has size at most O(M).

    • When contracted, the components form a dag.


Generalization to dag1

Generalization to DAG

  • Say we partition a DAG such that

    • Each component has size at most O(M).

    • When contracted, the components form a dag.

    • If C is the set of cross edges,

      is minimized over all such partitions.

  • The optimal schedule has cost/item

    .

  • Given constant factor memory augmentation, a partitioned schedule has cost/item .


When b 1

When B ≠ 1

  • Lower Bound: The optimal algorithm has cost

  • Upper Bound: With constant factor memory augmentation:

    • Pipelines: Upper bound matches the lower bound.

    • DAGs: Upper bound matches the lower bound as long as each component of the partition has O(M/B) incident cross edges.


Finding a good partition

Finding A Good Partition

  • For pipelines, we can find a good-enough partition greedily and the best partition using dynamic programming.

  • For general DAGs, finding the best partition is NP-complete.

  • Our proof is approximation-preserving. An approximation algorithm for the partitioning problem, will work for our problem.


Conclusions and future work

Conclusions and Future Work

  • We can reduce the problem of minimizing cache misses to the problem of calculating the best partition.

  • Solving the partitioning problem:

    • Approximation algorithms.

    • Exact solution for special cases such as SP-DAGs.

  • Space bounds: Bound the buffer sizes on cross edges.

  • Cache-conscious scheduling for multicores.


Deadlock avoidance for streaming computations with filtering

Deadlock Avoidance for Streaming Computations with Filtering

  • Goal: Devise mechanisms to avoid deadlocks on applications with filtering and finite buffers.

with Peng Li, Jeremy Buhler, and Roger D. Chamberlain


Outline3

Outline

  • Cache Conscious Scheduling

    • Streaming Application Model

    • The Sources of Cache Misses and Intuition Behind Partitioning

    • Proof Intuition

    • Thoughts

  • Deadlock Avoidance

    • Model and Source of Deadlocks

    • Deadlock Avoidance Using Dummy Items.

    • Thoughts


Filtering applications model

Filtering Applications Model

  • Data dependent filtering: The number of items produced depends on the data.

  • When a node fires, it

    • has a compute index (CI), which monotonically increases,

    • consumes/produces 0 or 1items from input/output channels,

    • input/output items must have index = CI.

  • A node can not proceed until it is sure that it has received allitems of its current CI.

  • Channels can have unbounded delays.

3

1

2

U

A

X

3

2

1

1

2

B

2

1

Y

3

2

1

C

Compute index

1

Aitemwith index 1


A deadlock demo

A Deadlock Demo

Filtering can cause deadlocks due to finite buffers.

v

1

2

3

4

3

5

2

full

6

full

1

u

x

empty

empty

4

3

w

  • A deadlock example (channel buffer size is 3).


Contributions1

Contributions

  • Deadlock avoidance mechanism using dummy or heartbeatmessages sent at regular intervals

    • Provably correct --- guarantees deadlock freedom.

    • No global synchronization.

    • No dynamic buffer resizing.

  • Efficient algorithms to compute dummy intervals for structured DAGs such as series parallelDAGs and CS4DAGs


Outline4

Outline

  • Cache Conscious Scheduling

    • Streaming Application Model

    • The Sources of Cache Misses and Intuition Behind Partitioning

    • Proof Intuition

    • Thoughts

  • Deadlock Avoidance

    • Model and Source of Deadlocks

    • Deadlock Avoidance Using Dummy Items.

    • Thoughts


The na ve algorithm

The Naïve Algorithm

  • Filtering Theorem

    • If no node ever filters any token, then the system cannot deadlock

  • The Naïve Algorithm

    • Sends a dummy on every filtereditem.

    • Changes a filtering system to a non-filtering system.

u

2

1

2

1

A

X

1

A token with index 1

1

A dummy with index 1


Comments on the na ve algorithm

Comments on the Naïve Algorithm

  • Pros

    • Easy to schedule dummy items

  • Cons

    • Doesn’t utilize channel buffer sizes.

    • Sends many unnecessary dummy items, wastingboth computation and bandwidth.

  • Next step, reduce thenumberof dummy items.


The propagation algorithm

The Propagation Algorithm

  • Computes a static dummy schedule.

  • Sends dummies periodically based on dummy intervals.

  • Dummy items mustbe propagated to all downstream nodes.

v

4

3

3

2

5

6

2

5

6

3, ∞

4

1

3, 8

1

Dummy interval

u

x

Channel buffer size

4, ∞

4, 6

4

3

6

6

w

Comp. Index: 6

Index of last dummy: 0

6 – 0 >= 6, send a dummy


Comments on the propagation algorithm

Comments on the Propagation Algorithm

  • Pros

    • Takes advantage of channel buffer sizes.

    • Greatly reduces the number of dummy items compared tothe Naïve Algorithm.

  • Cons

    • Does not utilize filtering history.

    • Dummy items mustbe propagated.

  • Next step, eliminate propagation

    • Use shorter dummy intervals.

    • Use filtering history for dummy scheduling.


The non propagation algorithm

The Non-Propagation Algorithm

  • Send dummy items based on filtering history

  • Dummy items do not propagate.

  • If (index offiltered item– index ofprevioustoken/dummy) >= dummy interval, send a dummy

v

4

3

3

2

5

2

5

6

3, 4

1

4

3, 4

1

Dummy interval

u

x

Channel buffer size

4, 3

4, 3

Data filtered

Current Index: 3

Index of last token/dummy: 0

3 – 0 >= 3, send a dummy

3

3

4

w


Comparison of the algorithms

Comparison of the Algorithms

  • Performance measurement

    • # of dummies sent

    • Fewer dummies are better

  • Non-Propagation Algorithm is expected to be the best in most cases

  • Experimental data

    • Mercury BLASTN (biological app.)

    • 787 billion input elements


How do we compute these intervals

How Do we Compute These Intervals

  • Exponential time algorithms for general DAGs, since we have to enumerate cycles.

  • Can we do better for structured DAGs?

    • Yes.

    • Polynomial time algorithms for SP DAGs

    • Polynomial time algorithms for CS4 DAGs --- a class of DAGs where every undirected cycle has a single source and a single sink.


Conclusions and future work1

Conclusions and Future Work

  • Designed efficient deadlock-avoidance algorithms using dummy messages.

  • Find polynomial algorithms to compute dummy intervalfor general DAGs.

  • Consider general models: allowing multiple outputs from one input and feedback loops.

  • The reverse problem: computing efficient buffer sizes from dummy intervals.


  • Login