presenter n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Presenter PowerPoint Presentation
Download Presentation
Presenter

Loading in 2 Seconds...

  share
play fullscreen
1 / 40
sanne

Presenter - PowerPoint PPT Presentation

115 Views
Download Presentation
Presenter
An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. Presenter MaxAcademy Lecture Series – V1.0, September 2011 Stream Scheduling

  2. Overview • Latencies in stream computing • Scheduling algorithms • Stream offsets

  3. Latencies in Stream Computing • Consider a simple arithmetic pipeline • Each operation has a latency • Number of cycles from input to output • May be zero • Throughput is still 1 value per cycle, L values can be in-flight in the pipeline (A + B) + C

  4. Input A Input B InputC Output + + Basic hardware implementation

  5. Input A Input B InputC Output 1 2 3 + + Data propagates through the circuit in “lock step”

  6. Input A Input B InputC Output 1 2 + 3 +

  7. Input A Input B InputC Output + 1 2 + X 3 Data arrives at wrong time due to pipeline latency

  8. Input A Input B InputC Output + + Insert buffering to correct

  9. Input A Input B InputC Output 1 2 3 + + Now with buffering

  10. Input A Input B InputC Output + 1 2 3 +

  11. Input A Input B InputC Output + 3 3 +

  12. Input A Input B InputC Output + + 3 3

  13. Input A Input B InputC Output + + 6

  14. Input A Input B InputC Output + + Success! 6

  15. Stream Scheduling Algorithms • A stream scheduling algorithm transforms an abstract dataflow graph into one that produces the correct results given the latencies of the operations • Can be automatically applied on a large dataflow graph (many thousands of nodes) • Can try to optimize for various metrics • Latency from inputs to outputs • Amount of buffering inserted  generally most interesting • Area (resource sharing)

  16. ASAP As Soon As Possible

  17. Input Input A Input A Input Input B InputC 0 0 0 Build up circuit incrementally Keeping track of latencies

  18. Input Input A Input A Input Input B InputC 0 0 0 + 1

  19. Input Input A Input A Input Input B InputC 0 0 0 + 1 + Input latencies are mismatched

  20. Input Input A Input A Input Input B InputC 0 0 0 + 1 1 + 2 Insert buffering

  21. Input Input A Input A Input Input B InputC Output 0 0 0 + 1 1 + 2

  22. ALAP As Late As Possible

  23. Output Start at output 0

  24. Output Latencies are negative relative to end of circuit + -1 -1 0

  25. InputC Output -2 + -2 + -1 -1 0

  26. Input Input A Input A Input Input B InputC Output -2 + -2 + -1 -1 0

  27. Input Input A Input A Input Input B InputC Output Buffering is saved -2 + -2 + -1 -1 0

  28. Input Input A Input A Input Input B InputC Output 2 Output 1 Sometimes this is suboptimal + + What if we add an extra output?

  29. Input Input A Input A Input Input B InputC Output 2 Output 1 Unnecessary buffering is added -2 + -2 + -1 -1 Neither ASAP nor ALAP can schedule this design optimally 0 0

  30. Optimal Scheduling • ASAP and ALAP both fix either inputs or outputs in place • More complex scheduling algorithms may be able to develop a more optimal schedule e.g. using ILP

  31. Buffering data on-chip • Consider: • We can see that we might need some explicit buffering to hold more than one data element on-chip • We could do this explicitly, with buffering elements a[i] = a[i] + (a[i - 1] + b[i - 1]) a = a + (buffer(a, 1) + buffer(b, 1))

  32. Input A Input B Output Buffer(1) Buffer(1) + + The buffer has zero latency in the schedule

  33. Input A Input B Output 0 0 Buffer(1) Buffer(1) 0 0 + 1 + 1 2 This will schedule thus Buffering = 3

  34. Buffers and Latency • Accessing previous values with buffers is looking backwards in the stream • This is equivalent to having a wire with negative latency • Can not be implemented directly, but can affect the schedule

  35. Input A Input B Output 0 0 Offset(-1) Offset(-1) -1 -1 -1 + -1 + 0 1 Offset wires can have negative latency

  36. Input A Input B Output 0 0 Offset(-1) Offset(-1) -1 -1 -1 + -1 + 0 1 This is scheduled Buffering = 0

  37. Stream Offsets • A stream offset is just a wire with a positive or negative latency • Negative latencies look backwards in the stream • Positive latencies look forwards in the stream • The entire dataflow graph will re-schedule to make sure the right data value is present when needed • Buffering could be placed anywhere, or pushed into inputs or outputs  more optimal than manual instantiation

  38. Input A Output 0 a[i] = a + a[i + 1] Offset(1) + a = a + stream.offset(a, +1)

  39. Input A Output 0 Offset(1) 1 1 + 2 Scheduling produces a circuit with 1 buffer

  40. Exercises For the questions below, assume that the latency of an addition operation is 10 cycles, and a multiply takes 5 cycles, while inputs/outputs take 0 cycles. • Write pseudo-code algorithms for ASAP and ALAP scheduling of a dataflow graph • Consider a MaxCompiler kernel with inputs a1, a2, a3, a4 and an output c. Draw the dataflow graph and draw the buffering introduced by ASAP scheduling to: • c = ( (a1 + a2) + a3) + a4 • c = (a1 + a2) + (a3 + a4) • Consider a MaxCompiler kernel with inputs a1, a2, a3, a4 and an output c. Draw the dataflow graph and write out the inequalities that must be satisfied to schedule: • c = ((a1 * a2) + (a3 * a4)) + a1 • c = stream.offset(a1, -10)*a2 + stream.offset(a1, -5)*a3 + stream.offset(a1, +15)*a4 How many values of stream a1 will be buffered on-chip for (b)?