- 85 Views
- Uploaded on
- Presentation posted in: General

Scheduling Streaming Computations

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Scheduling Streaming Computations

Kunal Agrawal

- Computation is represented by a directed graph:
- Nodes: Computation Modules.
- Edges: FIFO Channels between nodes.
- Infinite input stream.
- We only consider acyclic graphs (dags).

- When modules fire, they consume data from incoming channels and produce data on outgoing channels.

Cache-Conscious Scheduling of Streaming Applications

- Goal: Schedule the computation to minimize the number of cache misses on a sequential machine.

with Jeremy T. Fineman, Jordan Krage, Charles E. Leiserson, and Sivan Toledo

cache

Slow

Memory

- The cache has M/B blocks each of size B.
- Cost = Number of cache misses.
- If CPU accesses data in cache, the cost is 0.
- If CPU accesses data not in cache, then there is a cache miss of cost 1. The block containing the requested data is read into cache.
- If the cache is full, some block is evicted from cache to make room for new blocks.

M/B

CPU

block

B

- The problem of minimizing cache misses is reduced to a problem of graph partitioning.
- Theorem: If the optimal algorithm has X cache misses given a cache of size M, there exists a partitioned schedule that incurs O(X) cache misses given a cache of size O(M).
- In other words, some partitioned schedule is O(1) competitive given O(1) memory augmentation.

- Cache Conscious Scheduling
- Streaming Application Model
- The Sources of Cache Misses and Intuition Behind Partitioning
- Proof Intuition
- Thoughts

- Deadlock Avoidance
- Model and Source of Deadlocks
- Deadlock Avoidance Using Dummy Items.
- Thoughts

b

i:4

s:20

o:2

a

d

o:2

i:1

i:1

o:1

s:60

When a module vfires, it

- must load s(v) state,
- consumes i(u,v) items from incoming edge(s) (u,v), and
- produces o(v,w) items on outgoing edge(s) (v,w).

s:35

o:4

i:1

s:40

i:4

o:1

c

- Assumptions:
- All items are unit sized.
- The source consumes 1 item each time it fires.
- Input/output rates and state sizes are known.
- The state size of modules is at most M.

b

gain: 1/2

i:4

s:20

o:2

a

d

o:2

i:1

i:1

o:1

s:60

Vertex Gain: Number of vertex u firings per source firing.

, where p is a path from s to u.

s:35

o:4

i:1

s:40

gain: 1

i:4

o:1

c

gain: 1

Edge Gain: The number of items produced along the edge (u,v) per source firing.

A graph is well-formed iff all gains are well-defined.

- Cache Conscious Scheduling
- Streaming Application Model
- The Sources of Cache Misses and Intuition Behind Partitioning
- Proof Intuition
- Thoughts

- Deadlock Avoidance
- Model and Source of Deadlocks
- Deadlock Avoidance Using Dummy Items.
- Thoughts

s:40

s:60

s:20

s:35

1

4

1

2

1

1

8

1

Strategy: Push items through.

Cost Per Input Item: The sum of the state sizes

Idea: Reuse the state once loaded.

B:1, M:100

Cache

Slow

Memory

s4:35

s3:40

s1:60

s2:20

1

4

1

2

1

1

8

1

Strategy: Once loaded, execute module many times by adding large buffers between modules.

Cost Per Input Item: Total number of items produced on all channels per input item

B:1, M:100

Cache

Slow

Memory

s3:40

s1:60

s2:20

s4:35

1

4

1

2

1

1

8

1

Strategy: Partitioninto segments that fit in cache and only add buffers on cross edgesC --- edges that go between partitions.

Cost Per Input Item:

B:1, M:100

Cache

Slow

Memory

s3:40

s1:60

s2:20

s4:35

1

4

1

2

1

1

8

1

Strategy: Partitioninto segments that fit in cache and only add buffers on cross edgesC --- edges that go between partitions.

Cost Per Input Item:

B:1, M:100

Cache

Slow

Memory

Lesson: Cut small gain edges.

- Cache Conscious Scheduling
- Streaming Application Model
- The Sources of Cache Misses and Intuition Behind Partitioning
- Proof Intuition
- Thoughts

- Deadlock Avoidance
- Model and Source of Deadlocks
- Deadlock Avoidance Using Dummy Items.
- Thoughts

- Show that the optimal scheduler can not do much better than the best partitioned scheduler.
- Theorem: On processing T items, if the optimal algorithm given M-sized cache has X cache misses, then some partitioning algorithm given O(M) cache has at most O(X) cache misses.
- The number of cache misses due to a partitioned scheduler is The best partitioned scheduler should minimize
- We must prove the matching lower bound on the optimal scheduler’s cache misses.

S: segment with state size at least 2M.

e = gm(S): the edge with the minimum gain within S.

u

v

e

S

- u fires X times.
- Case 1: At least 1 item produced by u is processed by v.
- Cost

- Case 2: All items are buffered within S.
- The cheapest place to buffer is at e.
- Cost
- If , Cost

- In both cases, Cost/firing ofu

- Divide the pipeline into segments of size between 2M and 3M.

- Source node fires T times.
- Consider the optimal scheduler with M cache.
- Number of firings of ui
- Cost due to Si per firing of ui
- Total cost due to Si
- Total Cost over all segments

ui

vi

ek

ei

e1

Si

- Divide the pipeline into segments of size between 2M and 3M.

- Source node fires T times.
- Cost of optimal scheduler with M cache
- Consider the partitioned schedule that cuts all ei.
- Each segment has size at most 6M.
- The total cost of that schedule is

- Therefore, if this partitioned schedule has constant factor memory augmentation, it provides constant-competitiveness in the number of cache misses.

ui

vi

ek

ei

e1

Si

- Say we partition a DAG such that
- Each component has size at most O(M).
- When contracted, the components form a dag.

- Say we partition a DAG such that
- Each component has size at most O(M).
- When contracted, the components form a dag.
- If C is the set of cross edges,
is minimized over all such partitions.

- The optimal schedule has cost/item
.

- Given constant factor memory augmentation, a partitioned schedule has cost/item .

- Lower Bound: The optimal algorithm has cost

- Upper Bound: With constant factor memory augmentation:
- Pipelines: Upper bound matches the lower bound.
- DAGs: Upper bound matches the lower bound as long as each component of the partition has O(M/B) incident cross edges.

- For pipelines, we can find a good-enough partition greedily and the best partition using dynamic programming.
- For general DAGs, finding the best partition is NP-complete.
- Our proof is approximation-preserving. An approximation algorithm for the partitioning problem, will work for our problem.

- We can reduce the problem of minimizing cache misses to the problem of calculating the best partition.
- Solving the partitioning problem:
- Approximation algorithms.
- Exact solution for special cases such as SP-DAGs.

- Space bounds: Bound the buffer sizes on cross edges.
- Cache-conscious scheduling for multicores.

Deadlock Avoidance for Streaming Computations with Filtering

- Goal: Devise mechanisms to avoid deadlocks on applications with filtering and finite buffers.

with Peng Li, Jeremy Buhler, and Roger D. Chamberlain

- Cache Conscious Scheduling
- Streaming Application Model
- The Sources of Cache Misses and Intuition Behind Partitioning
- Proof Intuition
- Thoughts

- Deadlock Avoidance
- Model and Source of Deadlocks
- Deadlock Avoidance Using Dummy Items.
- Thoughts

- Data dependent filtering: The number of items produced depends on the data.
- When a node fires, it
- has a compute index (CI), which monotonically increases,
- consumes/produces 0 or 1items from input/output channels,
- input/output items must have index = CI.

- A node can not proceed until it is sure that it has received allitems of its current CI.
- Channels can have unbounded delays.

3

1

2

U

A

X

3

2

1

1

2

B

2

1

Y

3

2

1

C

Compute index

1

Aitemwith index 1

Filtering can cause deadlocks due to finite buffers.

v

1

2

3

4

3

5

2

full

6

full

1

u

x

empty

empty

4

3

w

- A deadlock example (channel buffer size is 3).

- Deadlock avoidance mechanism using dummy or heartbeatmessages sent at regular intervals
- Provably correct --- guarantees deadlock freedom.
- No global synchronization.
- No dynamic buffer resizing.

- Efficient algorithms to compute dummy intervals for structured DAGs such as series parallelDAGs and CS4DAGs

- Cache Conscious Scheduling
- Streaming Application Model
- The Sources of Cache Misses and Intuition Behind Partitioning
- Proof Intuition
- Thoughts

- Deadlock Avoidance
- Model and Source of Deadlocks
- Deadlock Avoidance Using Dummy Items.
- Thoughts

- Filtering Theorem
- If no node ever filters any token, then the system cannot deadlock

- The Naïve Algorithm
- Sends a dummy on every filtereditem.
- Changes a filtering system to a non-filtering system.

u

2

1

2

1

A

X

1

A token with index 1

1

A dummy with index 1

- Pros
- Easy to schedule dummy items

- Cons
- Doesn’t utilize channel buffer sizes.
- Sends many unnecessary dummy items, wastingboth computation and bandwidth.

- Next step, reduce thenumberof dummy items.

- Computes a static dummy schedule.
- Sends dummies periodically based on dummy intervals.
- Dummy items mustbe propagated to all downstream nodes.

v

4

3

3

2

5

6

2

5

6

3, ∞

4

1

3, 8

1

Dummy interval

u

x

Channel buffer size

4, ∞

4, 6

4

3

6

6

w

Comp. Index: 6

Index of last dummy: 0

6 – 0 >= 6, send a dummy

- Pros
- Takes advantage of channel buffer sizes.
- Greatly reduces the number of dummy items compared tothe Naïve Algorithm.

- Cons
- Does not utilize filtering history.
- Dummy items mustbe propagated.

- Next step, eliminate propagation
- Use shorter dummy intervals.
- Use filtering history for dummy scheduling.

- Send dummy items based on filtering history
- Dummy items do not propagate.
- If (index offiltered item– index ofprevioustoken/dummy) >= dummy interval, send a dummy

v

4

3

3

2

5

2

5

6

3, 4

1

4

3, 4

1

Dummy interval

u

x

Channel buffer size

4, 3

4, 3

Data filtered

Current Index: 3

Index of last token/dummy: 0

3 – 0 >= 3, send a dummy

3

3

4

w

- Performance measurement
- # of dummies sent
- Fewer dummies are better

- Non-Propagation Algorithm is expected to be the best in most cases
- Experimental data
- Mercury BLASTN (biological app.)
- 787 billion input elements

- Exponential time algorithms for general DAGs, since we have to enumerate cycles.
- Can we do better for structured DAGs?
- Yes.
- Polynomial time algorithms for SP DAGs
- Polynomial time algorithms for CS4 DAGs --- a class of DAGs where every undirected cycle has a single source and a single sink.

- Designed efficient deadlock-avoidance algorithms using dummy messages.
- Find polynomial algorithms to compute dummy intervalfor general DAGs.
- Consider general models: allowing multiple outputs from one input and feedback loops.
- The reverse problem: computing efficient buffer sizes from dummy intervals.