115 Views

Download Presentation
##### Presenter

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -

**Presenter**MaxAcademy Lecture Series – V1.0, September 2011 Stream Scheduling**Overview**• Latencies in stream computing • Scheduling algorithms • Stream offsets**Latencies in Stream Computing**• Consider a simple arithmetic pipeline • Each operation has a latency • Number of cycles from input to output • May be zero • Throughput is still 1 value per cycle, L values can be in-flight in the pipeline (A + B) + C**Input**A Input B InputC Output + + Basic hardware implementation**Input**A Input B InputC Output 1 2 3 + + Data propagates through the circuit in “lock step”**Input**A Input B InputC Output 1 2 + 3 +**Input**A Input B InputC Output + 1 2 + X 3 Data arrives at wrong time due to pipeline latency**Input**A Input B InputC Output + + Insert buffering to correct**Input**A Input B InputC Output 1 2 3 + + Now with buffering**Input**A Input B InputC Output + 1 2 3 +**Input**A Input B InputC Output + 3 3 +**Input**A Input B InputC Output + + 3 3**Input**A Input B InputC Output + + 6**Input**A Input B InputC Output + + Success! 6**Stream Scheduling Algorithms**• A stream scheduling algorithm transforms an abstract dataflow graph into one that produces the correct results given the latencies of the operations • Can be automatically applied on a large dataflow graph (many thousands of nodes) • Can try to optimize for various metrics • Latency from inputs to outputs • Amount of buffering inserted generally most interesting • Area (resource sharing)**ASAP**As Soon As Possible**Input**Input A Input A Input Input B InputC 0 0 0 Build up circuit incrementally Keeping track of latencies**Input**Input A Input A Input Input B InputC 0 0 0 + 1**Input**Input A Input A Input Input B InputC 0 0 0 + 1 + Input latencies are mismatched**Input**Input A Input A Input Input B InputC 0 0 0 + 1 1 + 2 Insert buffering**Input**Input A Input A Input Input B InputC Output 0 0 0 + 1 1 + 2**ALAP**As Late As Possible**Output**Start at output 0**Output**Latencies are negative relative to end of circuit + -1 -1 0**InputC**Output -2 + -2 + -1 -1 0**Input**Input A Input A Input Input B InputC Output -2 + -2 + -1 -1 0**Input**Input A Input A Input Input B InputC Output Buffering is saved -2 + -2 + -1 -1 0**Input**Input A Input A Input Input B InputC Output 2 Output 1 Sometimes this is suboptimal + + What if we add an extra output?**Input**Input A Input A Input Input B InputC Output 2 Output 1 Unnecessary buffering is added -2 + -2 + -1 -1 Neither ASAP nor ALAP can schedule this design optimally 0 0**Optimal Scheduling**• ASAP and ALAP both fix either inputs or outputs in place • More complex scheduling algorithms may be able to develop a more optimal schedule e.g. using ILP**Buffering data on-chip**• Consider: • We can see that we might need some explicit buffering to hold more than one data element on-chip • We could do this explicitly, with buffering elements a[i] = a[i] + (a[i - 1] + b[i - 1]) a = a + (buffer(a, 1) + buffer(b, 1))**Input**A Input B Output Buffer(1) Buffer(1) + + The buffer has zero latency in the schedule**Input**A Input B Output 0 0 Buffer(1) Buffer(1) 0 0 + 1 + 1 2 This will schedule thus Buffering = 3**Buffers and Latency**• Accessing previous values with buffers is looking backwards in the stream • This is equivalent to having a wire with negative latency • Can not be implemented directly, but can affect the schedule**Input**A Input B Output 0 0 Offset(-1) Offset(-1) -1 -1 -1 + -1 + 0 1 Offset wires can have negative latency**Input**A Input B Output 0 0 Offset(-1) Offset(-1) -1 -1 -1 + -1 + 0 1 This is scheduled Buffering = 0**Stream Offsets**• A stream offset is just a wire with a positive or negative latency • Negative latencies look backwards in the stream • Positive latencies look forwards in the stream • The entire dataflow graph will re-schedule to make sure the right data value is present when needed • Buffering could be placed anywhere, or pushed into inputs or outputs more optimal than manual instantiation**Input**A Output 0 a[i] = a + a[i + 1] Offset(1) + a = a + stream.offset(a, +1)**Input**A Output 0 Offset(1) 1 1 + 2 Scheduling produces a circuit with 1 buffer**Exercises**For the questions below, assume that the latency of an addition operation is 10 cycles, and a multiply takes 5 cycles, while inputs/outputs take 0 cycles. • Write pseudo-code algorithms for ASAP and ALAP scheduling of a dataflow graph • Consider a MaxCompiler kernel with inputs a1, a2, a3, a4 and an output c. Draw the dataflow graph and draw the buffering introduced by ASAP scheduling to: • c = ( (a1 + a2) + a3) + a4 • c = (a1 + a2) + (a3 + a4) • Consider a MaxCompiler kernel with inputs a1, a2, a3, a4 and an output c. Draw the dataflow graph and write out the inequalities that must be satisfied to schedule: • c = ((a1 * a2) + (a3 * a4)) + a1 • c = stream.offset(a1, -10)*a2 + stream.offset(a1, -5)*a3 + stream.offset(a1, +15)*a4 How many values of stream a1 will be buffered on-chip for (b)?