1 / 84

ECE 486/586 Computer Architecture Chapter 7 Software Pipelining

Explore the concept of software pipelining for a novel register-based RISC microprocessor that incorporates a VLIW instruction with parallel arithmetic, integer, and memory-access operations.

dpadgett
Download Presentation

ECE 486/586 Computer Architecture Chapter 7 Software Pipelining

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. ECE 486/586Computer Architecture Chapter 7 Software Pipelining Herbert G. Mayer, PSU Status 2/21/2017

  2. Introduction – Imagine • Imagine your team had built a novel, register-based RISC microprocessor for fast FP operation • But you broke the RISC mold by adding one super, very long instruction VL . . . • VL can perform numerous arithmetic, integer, and memory-access operations at the same time • Provided that there be no data dependence between registers used in various parts of VL • I.e. if an FP mult operation, being part of VL, places the product of r1 and r2 into r3 . . . • And an FP add operation, part of VL, places the sum of r3 and r4 into r5 • Then both parts of VL use r3, once as a source, once as a destination, but without dependence!

  3. Introduction – Imagine • Think about HW design challenges: • How do you design the HW for a VL instruction, so that there be no conflict (i.e. no dependence) between different uses of r3 in multiple sub instructions? • Think of the multiple, simultaneous sub instructions VL executes: When will the total VL instruction retire? • Could the same resource, say register r3, be used multiple times in one VL instruction as a source operand? • Could the same resource, say register r3, be used multiple times in one VL instruction as destination? • Could some sub instruction be a branch or call?

  4. Introduction • Software Pipelining (SW PL) is a performance enhancing technique that exploits instruction-level parallelism on a wide instruction wordto accelerate execution through parallelism • SW PL not to be confused with HW instruction pipelining, the latter being a pure HW method of speeding up execution by overlapping multiple instruction and having those HW modules execute simultaneously, running parts of multiple instructions in parallel • As the name suggests, SW pipelining is not primarily achieved through HW, yet is does exploit HW parallelism • Refers to interleaved execution of parts of multiple loop iterations mapped onto the sub instructions of VLIW operations

  5. Introduction • In SW PL, different parts of multiple iterations of a source program’s loop body are executed simultaneously,in a pipelined fashion • SW and HW similarity of pipelining: both execute multiple steps of a program simultaneously in an interleaved fashion • HW pipelining runs different parts of different, consecutive instructions simultaneously on a single processor; pipelined CPUs are still UP processors • SW pipelining runs different parts of the same SW loop simultaneously, activating different iterations of that loop in one single step • Requires that target machine has multiple HW modules that can execute simultaneously, e.g. a VLIW architecture, or just a superscalar machine

  6. Introduction • For example, an LIW architecture has instructions that perform a floating point (FP) multiply at the same time it executes an FPadd, and perhaps also an integer addition, plus memory-access operations • This parallelism has existed long already on a smaller scale, in superscalararchitectures • Similarly, a VLIW architecture provides instructions that do all of the above plus a store, several other integer operations, and perhaps loop instructions • Very roughly, the boundary between VLIW and LIW is 96 opcode bits; design beyond 96 instruction bits is called VLIW • Key problem with SW PL is to fill the different subinstuctions of a VLIW opcode with different parts –even different iterations– of a loop to be run parallel

  7. Introduction • In the late 1980s, Intel Corp. actually built and delivered processors jointly with CMU that packed numerous arithmetic operations plus memory access- and loop instruction into one single VLIW operation • Architecture name was iWarp • iWarp was a successor an improvement over CMU’s earlier Warp design for DARPA • Computers were sold to industrial customers as well, but did not gain wide acceptance • Reason? . . .

  8. Syllabus • Motivation • Outline of Software Pipelining • Definitions • Pipelining, SW PL, and Superscalar Execution • VLIW instructions • Example of Simple SW PL Loop • Peeling-Off Iterations • Software Pipelined Matrix Multiply • A Complex SW Pipelined Loop • Skipping the Epilogue? • Literature References

  9. Motivation • Imagine a super-machine instruction, one that can perform numerous operations in parallel, way more than a superscalar CPU! • The execution time for the novel instruction to complete is simply the time necessary for the longest of its many sub operations, not the time for the sum of all different computations! • When a fast –perhaps very simple– sub operation completes, then after completion its HW module idles, waiting for other parts of the super-machine instructionto terminate afterwards • We’ll name this super-machine instruction a Very Long Instruction Word (VLIW)operation; very long referring to > 96 instruction bits!

  10. Motivation • Such a VLIW operation may compute a floating point (FP) multiply (fmul) at the same time it performs an FP add (fadd), a load (ld), and a store (st), plus some integer operations;perhaps more • Below is a sequence of 3 fictitious assembler operations, to be executed in sequence • Later we shall package them into a single VLIW operation, and consider asm syntax and semantics: • -- code 1: two loads, 1 add, good addresses in r1, r2 • 1 ld r3, (r1)++ -- load r3 thru r1, post-incr. • 2 ld r4, (r2)++ -- load into r4 mem word at (r2) • 3 fadd r5, r3, r4 -- FP add into r5: r5 ← r3 + r4

  11. Motivation • The three instructions above execute sequentially • The VLIW instruction can do all of the above, plus more if needed in one instruction in parallel • The new, imagined super-assembler expresses this simultaneousness via assembly pseudo-operators, the { and } braces • These braces say: All { and } enclosed operations are packed into a single VLIW instruction • And all sub-opcodes, regardless of the order listed, are executed in parallel, in a single logical step, and in a single instruction • See the new VLIW instruction below:

  12. Motivation { -- code 2: two loads, 1 add, good addresses in r1, r2 1 ld r3, (r1)++ -- load r3 thru r1, post-incr. 2 ld r4, (r2)++ -- load into r4 mem word at (r2) 3 fadd r5, r3, r4 -- FP add r5 ← r3 + r4 -- r5 is destination } Looks very similar to code 1 earlier, almost same syntax But semantics are: code 1: serial code 2:parallel Quite different semantics! Quite different execution times! That is the point!

  13. Motivation • More obvious to see that all parts of a VLIW instruction run in parallel, by writing these sub-opcodes horizontally: • -- code 3: same as code 2, but written on single line • -- r5 is destination for floating-point “sum”! • { ld r3, (r1)++; ld r4, (r2)++; faddr5, r3, r4 } • Not surprisingly, this style of architecture, VLIW architecture, is also referred to as horizontal microcode

  14. Motivation • Unrealistic to expect that operands r3 and r4, being loaded from memory via ld, would be the ones added via fadd into r5 during the same instruction; duration would be the length of the loads plus that of fadd • Instead, the old values in r3 and r4 that had already been in the registers when the instruction started, those older values will be added into r5 • New values then are loaded into r3 and r4. So the meanings of the sequential and VLSI instruction sequences above are not at all equivalent to the earlier, sequential assembler code! • But then this seems that a VLIW operation cannot truly be used for parallel executing? • Only by using software pipelining technology can we exploit the potential of VLIW instructions!

  15. Motivation • That’s where Software Pipeliningcomes into play: a technique that takes advantage of the built in parallelism, while overcoming the limitation that the destination register of one operation cannot be used in the same step as a source for another • In Software Pipelining this multi-use is ok, but refers to 2 different purposes, e.g. once as source, a second time as destination! • Software Pipelining seeblesparts of multiple iterations of one source loopinto the same single VLIW instruction • For this to work the Software Pipelined loop must be suitably primed; this is done in the loop Prologue • And before the iterations terminate, the VLIW instruction must be emptied or drained; that is done in the loop Epilogue

  16. Outline of Software Pipelining

  17. Outline of Software Pipelining Progressing on Multiple Loop Iterations in 1 Step

  18. Outline of Software Pipelining • Software Pipelining requires a target processor with multiple executing agents (HW modules) that can run simultaneously • Target may be an LIW, a VLIW, or even an ancient superscalar processor • These multiple HW modules operate in parallel on (parts of) different iterations of a program loop • It is key that none of these multiple, simultaneous operations generate results that would be expected as input of another operation in the same iteration • I.e. there may not exist any data dependence between sub instructions of a VLIW instruction!

  19. Outline of Software Pipelining • Instead, in step 1 an operation generates a result in some destination register, say r1, while in step 2 another operation can use r1 as an input to generate the next result, to be placed into any register • It is the compiler’s or programmer’s task to pack operations of different loop iterations into suitable fields of one VLIW instruction in a way that they be executed in one step, but without dependence • In extreme cases, a complete source loop may be mapped onto a single VLIW instruction; clearly a desirable special case  • Intel’s iWarp architecture was designed so that a matrix-multiply could be packed into a single instruction,including loop overhead! Awesome HW!

  20. Outline of Software Pipelining • In the following examples we’ll use hypothetical VLIW instructions • The assembly language syntax is similar to conventional assemblers, e.g. SPARC: There will be individual load and store operations, floating point adds and multiplies, etc. • In addition, VLIW instructions can group numerous of these operations together; this is expressed by the { and } braces, indicating that all operations enclosed can be executed together, jointly in a single step; but without dependences! • The sequential order in which the various VLIW sub-operations are written in assembler source is arbitrary; order does by no means imply sequential execution order!

  21. Outline of Software Pipelining all sub-operations within { and } braces start at the same time; simple ones complete in a small fraction of a cycle; complex ones require the full cycle: 1 ld r3, (r1)++ -- load r3 thru r1, post-incr. 2 add r0, r6, #2 -- add 2 to r6, put sum into r0 3 { -- start of VLIW operation 4 fadd r5, r3, r4 -- add current r3 + r4 into r5 5 ld r3, (r1)++ -- new value in r3, old r3 used 6 ld r4, (r2)++ -- new value in r4, old r4 used 7 } -- end of VLIW instruction

  22. Outline of Software Pipelining • Lines 1 and 2 above display typical sequential load and add instructions • However, lines 3 through 7 display one single VLIW operation that lumps a floating point add, two integer increments, and two loads into one single executable step • Note that r3 and r4 are both used as operands, as inputs to the floating point add operation, but they also receive new values as a result of the loads • This works perfectly well, provided the processor immediately latches their old values for use in fadd • After completion of the VLIW instruction the loaded values in r3 and r4 can be used as operands again

  23. Outline of Software Pipelining • The execution time of the above VLIW instruction is dominated by the time to execute loads, i.e. memory accesses • These may even be in sequence, unless the addresses in r1 and r2 happen to activate different memory banks, and the HW detects this! • In that case, the time for two loads would be the time for one load plus a small number of clock cyclesto synchronize bus traffic

  24. Loop Instruction • The loop instruction used in the examples here is akin to the one used on Intel 8086 processor • The hypothetical operation: loop foo – uses dedicated loop reg “rx” has the meaning: An implied loop registerrx initially holds number of iterations of a countable loop; thus no need to mention rx • This integer value is decreased by 1 each iteration. If that decremented value is not yet 0, then execution continues at label foo • Typically foo resides at an address before the loop instruction (e.g. while statement), resulting in a back branch most of the time, and once in a fall through

  25. Some Definitions

  26. Definitions Basic Block Sequence of instructions (one or more) with a single entry point and a single exit point; entry- and exit w.r.t. transfer of control Entry pointmay be the destination of a branch, a fall through from a conditional branch, or the program entry point, i.e. the destination of an OS jump to start Exit pointmay be an unconditional branch instruction, a call, a return, or a fall-through Fall-through means: one instruction is a conditional flow of control change, and the subsequent instruction is executed by default, if the change in control flow does not take place Or fall-through can mean: The successor of the BB’s exit point is a branch target or call target!!

  27. Definitions Column A Software Pipelined loop operates on (parts of) more than one iteration of a source program’s loop body Each distinct loop iteration executed in one step --towards which a compute step progresses-- is called a column Hence, if one execution of the loop makes some progress toward 4 different iterations, such a SW-pipelined loop is said to have 4 columns

  28. Definitions Cycles Per Instruction: cpi cpi quantifies how long it takes for a single instruction to execute Generally, the number of execution cycles per instruction on a CISC architecture is: cpi > 1 However, on a pipelined UP architecture, where a new instruction can be initiated at each cycle, it is conceivable to reach a cpi rate of 1; assuming no stall and no hazard Note different meaningsfor duration of cycle On a UP pipelined architecture the cpi rate cannot shrink below one Yet on an MP architecture, or superscalar machine that is pipelined, the rate may be cpi < 1

  29. Definitions Dependence If the logic of the underlying program imposes an order between two instructions, there exists dependence -data or control dependence- between them Generally, the order of execution cannot be permuted Conventional in Computer Architecture to call this dependence, not dependency

  30. Definitions Draining Once the iterations of a SW pipelined loop are complete, there are still partial iterations yet to be completed; see also Epilogue Therefore, some instructions will be necessary after the VLIW loop to complete the actions still necessary in partial loop iterations E.g. some last registers still holding good values after loop termination must be used up by some sequential (non-VLIW) instructions That step, generally not empty, is called draining Synonym: Flushing Antonym: Priming

  31. Definitions Epilogue When the steady state of the VLIW loop terminates, there will generally be some valid operands in the hardware resources used For example, an operand may have been loaded and is yet to be used. Or, some value already has been computed that is yet to be added to a final result Thus the last operands must be consumed, the pipeline must be drained This is accomplished in the object code after the steady state and is called the Epilogue Antonym: Prologue

  32. Definitions IPC Instructions per cycle is a measure for Instruction Level Parallelism. How many different instructions are being executed –not necessarily to completion– during one single cycle? Desired to have an IPC rate > 1; ideally even, given sufficient parallelism, IPC >> 1 On conventional UP CISC architectures it is typical to have IPC << 1

  33. Definitions ILP Instruction Level Parallelism: Architectural attribute, allowing multiple instructions to be executed at the same time Generally requires lack of dependence between these instructions executed simultaneously Related: Pipelined, Superscalar, LIW, and VLIW

  34. Definitions LIW Long Instruction Word; an instruction requiring more opcode bits than a conventional (e.g. RISC) instructions, because multiple simultaneous operations are packed and encoded into a single LIW instruction Typically, LIW instructions are longer than 32 but no longer than 96 bits The opcode proper may be short, possibly even a single bit, plus further bits to specify sub-opcodes For example, the floating point add may perform either a subtract, negate, unsigned add, or even a noop, must be specified via bits for sub-opcodes

  35. Definitions Peeling Off Removal of an iteration from the original loop’s full iteration space; usually done to perform an optimization In software pipelining this is done to ensure the pipeline is primed before, and drained after execution of the loop The object code of peeled off iterations can be scheduled together with other instructions Hence the Prologue and Epilogue may end up in VLIW instructions of other code preceding or following a software pipelined loop

  36. Definitions Pipelining Mode of execution, in which one instruction is initiated every cycle and ideally one retires every cycle, even though each requires multiple (possibly many) cycles to complete Highly pipelined Xeon processors, for example, have a greater than 20-stage pipeline

  37. Definitions Priming –on VLIW architecture Executing sequential instructions on a VLIW architecture before loop execution, such that a subsequent SW-pipelined loop can run correctly That means, all registers are holding good values at the start of the loop, but hold also partly unused values at termination of the loop Antonym: flushing

  38. Definitions Prologue Before the Software Pipelined loop body can be initiated, the various hardware resources (e.g. registers) that partake in the SW pipelined loop must be initialized For example, the first operands may have to be loaded, or the first sum of two loaded operands must be computed Thus the first operands must be generated, the pipeline must be primed This is accomplished in the object code before the steady state and is called the Prologue Antonym: Epilogue

  39. Definitions Reduction Optimization Common optimization that replaces (i.e. reduces) multiple stores with one store For example, the inner loop of a matrix multiply may have repeated references to location c[row][col]: c[row][col] = c[row][col] + a[row][i] * b[i][col]; Abbreviated even in C source to: c[row][col] += a[row][i] * b[i][col]; Accumulate = reduced into a 0.0 initialized register: reg += a[row][i] * b[i][col]; . . . until loop completes, and then c[row][col] = reg

  40. Definitions Steady State The object code executed repeatedly, after the Prologue has been initiated and before the Epilogue will be active is called the Steady State Each single execution of the Steady State makes some progress toward multiple iterations of the source loop; see columns This loop was mapped into Prologue, SteadyState, plus Epilogue by the Software Pipelining compiler

  41. Definitions VLIW Very Long Instruction Word; like an LIW instruction, but VLIW instructions typically consume > 96 bits for the opcode, sub-opcodes, plus all operand specifications Some of the sub-opcodes may actually be NOOPs A VLIW architecture also provides regular, sequential opcodes, like a conventional architecture

  42. Pipelining, SoftwarePipelining, and Superscalar Execution

  43. Pipelining Assumes: multiple sequentially ordered HW units that each execute one part of a single instruction Hardware modules are not replicated At any step (clock cycle) each HW module executes one part of a different instruction Allows: simultaneous execution of multiple parts of different instructions at the same time. This does not accelerate the execution time of a single instruction Speeds up: the throughput of numerous consecutive instructions; provided there are no dependences

  44. Superscalar Relies on some replicated independent HW units that each can execute a complete instruction Or else relies on some instruction sequences arranged in the object code such that they can be fetched and executed together Simultaneous execution requires they exhibit no data dependence on one another; detected by superscalar HW Does: at any step execute either a single, or sometimes multiple, instructions simultaneously Always fetches multiple instructions greedily with expectation of potential parallel execution

  45. Superscalar Allows: simultaneous execution of certain, defined sequences of instructions Not all instruction sequences (or instructions pairs in a superscalar architecture with a maximum of 2 identical HW Modules) can be executed concurrently Speeds up: those select instruction sequences, for which the architecture provides multiple HW modules, and for which the logic of the running program provides data independence, i.e. the output of one is not required as the input to the other

  46. SW Pipelining Assumes: VLIW or LIW instructions Does: at any VLIW-step executes multiple operations at the same time, packaged into a single VLIW instruction Allows: simultaneous execution of parts of multiple iterations of the same source loop. But generally one VLIW instruction does not map into a full source loop; only a portion of the loop At each iteration the SW-pipelined loop executes VLIW instructionsthat jointly execute multiple source statements of a loop body

  47. VLIW Explained With Actual Programs

  48. VLIW Instructions Architecture design goal is to group together, into a single VLIW instruction, all of the following: floating point multiply, one or even more floating point add, 1 or more load with optional auto-increment, decrement, pre- or post of address register second load or store, with auto-increment, decrement, pre- or post integer add, also with optional auto-increment; “add” can be int addition, negation, subtraction integer multiply, divide, or invert loop-related operation sub-opcode needs a NOOP option, if not needed

  49. VLIW Instructions Note: If both a load and a storeare performed in one VLIW instruction on a system with a single memory controller, there could be data dependence, or interferences, if same address Address comparison in HW needed, to resolve aliasing! Could prevent SW pipelining Execution time of VLIW instruction is dominated by the cycles needed for longest sub instruction, typically load or store Having two store operations as part of the same VLIW instruction would require special care to detect the case of both memory destinations being identical or overlapping (alias analysis)

  50. VLIW Instructions: A Loop Sample The program snippet below [p. 52], to be software pipelined, adds all elements of floating-point vector a[] to the corresponding elements of vector b[] and moves the sums into vector a[] First we show the source program in pseudo-C, with a possible mapping into hypothetical assembly language Then comes a pictorial representation of the problem with HW resources used, and a software pipelined version in VLIW assembly language

More Related