 Download Presentation Presenter # Presenter - PowerPoint PPT Presentation

Download Presentation ##### Presenter

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
##### Presentation Transcript

1. Presenter MaxAcademy Lecture Series – V1.0, September 2011 Stream Scheduling

2. Overview • Latencies in stream computing • Scheduling algorithms • Stream offsets

3. Latencies in Stream Computing • Consider a simple arithmetic pipeline • Each operation has a latency • Number of cycles from input to output • May be zero • Throughput is still 1 value per cycle, L values can be in-flight in the pipeline (A + B) + C

4. Input A Input B InputC Output + + Basic hardware implementation

5. Input A Input B InputC Output 1 2 3 + + Data propagates through the circuit in “lock step”

6. Input A Input B InputC Output 1 2 + 3 +

7. Input A Input B InputC Output + 1 2 + X 3 Data arrives at wrong time due to pipeline latency

8. Input A Input B InputC Output + + Insert buffering to correct

9. Input A Input B InputC Output 1 2 3 + + Now with buffering

10. Input A Input B InputC Output + 1 2 3 +

11. Input A Input B InputC Output + 3 3 +

12. Input A Input B InputC Output + + 3 3

13. Input A Input B InputC Output + + 6

14. Input A Input B InputC Output + + Success! 6

15. Stream Scheduling Algorithms • A stream scheduling algorithm transforms an abstract dataflow graph into one that produces the correct results given the latencies of the operations • Can be automatically applied on a large dataflow graph (many thousands of nodes) • Can try to optimize for various metrics • Latency from inputs to outputs • Amount of buffering inserted  generally most interesting • Area (resource sharing)

16. ASAP As Soon As Possible

17. Input Input A Input A Input Input B InputC 0 0 0 Build up circuit incrementally Keeping track of latencies

18. Input Input A Input A Input Input B InputC 0 0 0 + 1

19. Input Input A Input A Input Input B InputC 0 0 0 + 1 + Input latencies are mismatched

20. Input Input A Input A Input Input B InputC 0 0 0 + 1 1 + 2 Insert buffering

21. Input Input A Input A Input Input B InputC Output 0 0 0 + 1 1 + 2

22. ALAP As Late As Possible

23. Output Start at output 0

24. Output Latencies are negative relative to end of circuit + -1 -1 0

25. InputC Output -2 + -2 + -1 -1 0

26. Input Input A Input A Input Input B InputC Output -2 + -2 + -1 -1 0

27. Input Input A Input A Input Input B InputC Output Buffering is saved -2 + -2 + -1 -1 0

28. Input Input A Input A Input Input B InputC Output 2 Output 1 Sometimes this is suboptimal + + What if we add an extra output?

29. Input Input A Input A Input Input B InputC Output 2 Output 1 Unnecessary buffering is added -2 + -2 + -1 -1 Neither ASAP nor ALAP can schedule this design optimally 0 0

30. Optimal Scheduling • ASAP and ALAP both fix either inputs or outputs in place • More complex scheduling algorithms may be able to develop a more optimal schedule e.g. using ILP

31. Buffering data on-chip • Consider: • We can see that we might need some explicit buffering to hold more than one data element on-chip • We could do this explicitly, with buffering elements a[i] = a[i] + (a[i - 1] + b[i - 1]) a = a + (buffer(a, 1) + buffer(b, 1))

32. Input A Input B Output Buffer(1) Buffer(1) + + The buffer has zero latency in the schedule

33. Input A Input B Output 0 0 Buffer(1) Buffer(1) 0 0 + 1 + 1 2 This will schedule thus Buffering = 3

34. Buffers and Latency • Accessing previous values with buffers is looking backwards in the stream • This is equivalent to having a wire with negative latency • Can not be implemented directly, but can affect the schedule

35. Input A Input B Output 0 0 Offset(-1) Offset(-1) -1 -1 -1 + -1 + 0 1 Offset wires can have negative latency

36. Input A Input B Output 0 0 Offset(-1) Offset(-1) -1 -1 -1 + -1 + 0 1 This is scheduled Buffering = 0

37. Stream Offsets • A stream offset is just a wire with a positive or negative latency • Negative latencies look backwards in the stream • Positive latencies look forwards in the stream • The entire dataflow graph will re-schedule to make sure the right data value is present when needed • Buffering could be placed anywhere, or pushed into inputs or outputs  more optimal than manual instantiation

38. Input A Output 0 a[i] = a + a[i + 1] Offset(1) + a = a + stream.offset(a, +1)

39. Input A Output 0 Offset(1) 1 1 + 2 Scheduling produces a circuit with 1 buffer

40. Exercises For the questions below, assume that the latency of an addition operation is 10 cycles, and a multiply takes 5 cycles, while inputs/outputs take 0 cycles. • Write pseudo-code algorithms for ASAP and ALAP scheduling of a dataflow graph • Consider a MaxCompiler kernel with inputs a1, a2, a3, a4 and an output c. Draw the dataflow graph and draw the buffering introduced by ASAP scheduling to: • c = ( (a1 + a2) + a3) + a4 • c = (a1 + a2) + (a3 + a4) • Consider a MaxCompiler kernel with inputs a1, a2, a3, a4 and an output c. Draw the dataflow graph and write out the inequalities that must be satisfied to schedule: • c = ((a1 * a2) + (a3 * a4)) + a1 • c = stream.offset(a1, -10)*a2 + stream.offset(a1, -5)*a3 + stream.offset(a1, +15)*a4 How many values of stream a1 will be buffered on-chip for (b)?