Presenter

Presenter MaxAcademy Lecture Series – V1.0, September 2011 Stream Scheduling

Overview • Latencies in stream computing • Scheduling algorithms • Stream offsets

Latencies in Stream Computing • Consider a simple arithmetic pipeline • Each operation has a latency • Number of cycles from input to output • May be zero • Throughput is still 1 value per cycle, L values can be in-flight in the pipeline (A + B) + C

Input A Input B InputC Output + + Basic hardware implementation

Input A Input B InputC Output 1 2 3 + + Data propagates through the circuit in “lock step”

Input A Input B InputC Output 1 2 + 3 +

Input A Input B InputC Output + 1 2 + X 3 Data arrives at wrong time due to pipeline latency

Input A Input B InputC Output + + Insert buffering to correct

Input A Input B InputC Output 1 2 3 + + Now with buffering

Input A Input B InputC Output + 1 2 3 +

Input A Input B InputC Output + 3 3 +

Input A Input B InputC Output + + 3 3

Input A Input B InputC Output + + 6

Input A Input B InputC Output + + Success! 6

Stream Scheduling Algorithms • A stream scheduling algorithm transforms an abstract dataflow graph into one that produces the correct results given the latencies of the operations • Can be automatically applied on a large dataflow graph (many thousands of nodes) • Can try to optimize for various metrics • Latency from inputs to outputs • Amount of buffering inserted  generally most interesting • Area (resource sharing)

ASAP As Soon As Possible

Input Input A Input A Input Input B InputC 0 0 0 Build up circuit incrementally Keeping track of latencies

Input Input A Input A Input Input B InputC 0 0 0 + 1

Input Input A Input A Input Input B InputC 0 0 0 + 1 + Input latencies are mismatched

Input Input A Input A Input Input B InputC 0 0 0 + 1 1 + 2 Insert buffering

Input Input A Input A Input Input B InputC Output 0 0 0 + 1 1 + 2

ALAP As Late As Possible

Output Start at output 0

Output Latencies are negative relative to end of circuit + -1 -1 0

InputC Output -2 + -2 + -1 -1 0

Input Input A Input A Input Input B InputC Output -2 + -2 + -1 -1 0

Input Input A Input A Input Input B InputC Output Buffering is saved -2 + -2 + -1 -1 0

Input Input A Input A Input Input B InputC Output 2 Output 1 Sometimes this is suboptimal + + What if we add an extra output?

Input Input A Input A Input Input B InputC Output 2 Output 1 Unnecessary buffering is added -2 + -2 + -1 -1 Neither ASAP nor ALAP can schedule this design optimally 0 0

Optimal Scheduling • ASAP and ALAP both fix either inputs or outputs in place • More complex scheduling algorithms may be able to develop a more optimal schedule e.g. using ILP

Buffering data on-chip • Consider: • We can see that we might need some explicit buffering to hold more than one data element on-chip • We could do this explicitly, with buffering elements a[i] = a[i] + (a[i - 1] + b[i - 1]) a = a + (buffer(a, 1) + buffer(b, 1))

Input A Input B Output Buffer(1) Buffer(1) + + The buffer has zero latency in the schedule

Input A Input B Output 0 0 Buffer(1) Buffer(1) 0 0 + 1 + 1 2 This will schedule thus Buffering = 3

Buffers and Latency • Accessing previous values with buffers is looking backwards in the stream • This is equivalent to having a wire with negative latency • Can not be implemented directly, but can affect the schedule

Input A Input B Output 0 0 Offset(-1) Offset(-1) -1 -1 -1 + -1 + 0 1 Offset wires can have negative latency

Input A Input B Output 0 0 Offset(-1) Offset(-1) -1 -1 -1 + -1 + 0 1 This is scheduled Buffering = 0

Stream Offsets • A stream offset is just a wire with a positive or negative latency • Negative latencies look backwards in the stream • Positive latencies look forwards in the stream • The entire dataflow graph will re-schedule to make sure the right data value is present when needed • Buffering could be placed anywhere, or pushed into inputs or outputs  more optimal than manual instantiation

Input A Output 0 a[i] = a + a[i + 1] Offset(1) + a = a + stream.offset(a, +1)

Input A Output 0 Offset(1) 1 1 + 2 Scheduling produces a circuit with 1 buffer

Exercises For the questions below, assume that the latency of an addition operation is 10 cycles, and a multiply takes 5 cycles, while inputs/outputs take 0 cycles. • Write pseudo-code algorithms for ASAP and ALAP scheduling of a dataflow graph • Consider a MaxCompiler kernel with inputs a1, a2, a3, a4 and an output c. Draw the dataflow graph and draw the buffering introduced by ASAP scheduling to: • c = ( (a1 + a2) + a3) + a4 • c = (a1 + a2) + (a3 + a4) • Consider a MaxCompiler kernel with inputs a1, a2, a3, a4 and an output c. Draw the dataflow graph and write out the inequalities that must be satisfied to schedule: • c = ((a1 * a2) + (a3 * a4)) + a1 • c = stream.offset(a1, -10)*a2 + stream.offset(a1, -5)*a3 + stream.offset(a1, +15)*a4 How many values of stream a1 will be buffered on-chip for (b)?

Presenter

Presenter

Presentation Transcript

Presenter

Presenter

Presenter

Presenter:

Presenter

PRESENTER: NAME OF PRESENTER

Presenter

Presenter

Presenter

Presenter

Presenter

Presenter

Presenter

Presenter: ______________

Presenter Name, Presenter Institution

Presenter:

Presenter

PRESENTER

PRESENTER

Presenter:

Presenter

Presenter