Reconfigurable Computing

Reconfigurable Computing -Pipelined Systems John Morris Chung-Ang University The University of Auckland ‘Iolanthe’ at 13 knots on Cockburn Sound, Western Australia

Register Register Register Pipelines • Key strategy for improving the performance of systems • Provide a form of parallelism (Pipeline parallelism) • Different parts of different computations are being processed at the same time • In general, blocks A, B, C, … will be different • Although in some applications eg pipelined multiplier, digital filter, image processing applications, … • some (or all) of them may be identical C B A Clock A, B, C – combinatorial blocks

Register Register Register Pipelines • Any modern high performance processor provides an example of a pipelined system • ‘Work’ of processing an instruction is broken up into several sub-tasks, eg • IF - Instruction fetch • ID/OF - Instruction decode and operand fetch • Ex - Execute • WB - Write back RegisterFile Instructnmemory Ex WB IDOF IF Clock Part of a simple pipelined RISC processor

High performance processor pipelines • Basic idea • If an instruction requires x ns to fetch, decode, execute and store results, • Simple (non-pipelined) processor can be driven by clock, f=1/x However • divide the work into 4 blocks, each requiring x/4 ns • build a 4-stage pipeline clocked at 4/x = 4f • Pipeline completes an instruction every x/4 ns ,so it appears as if it is processing instructions at a 4f rate • 4-fold increase in processing power!! • Because the system is processing 4 instructions at once!! • but …

High performance processor pipelines • Basic idea • Use an n-stage pipeline • n-fold increase in processing power!! • Because the system is processing n instructions at once!! • Note • The time to actually process an instruction hasn’t changed • It’s still x ns • Thus the latency (time for the first instruction to complete) is still x ns • It’s the throughput that has inceased to 4f

High performance processor pipelines • Basic idea • Use an n-stage pipeline • n-fold increase in processing power!! • Because the system is processing n instructions at once!! • … and don’t forget reality!! • It will not be possible to divide the work exactly into x/4 ns chunks, so the longest stage will take y > x/4 ns • The registers are not ‘free’ • There is a propagation delay associated with them • so the shortest cycle time is nowymin = x/4 + (tSU + tOD) ns • where tSU and tOD are setup and output delay times for the register • thus the real throughput will be f’ = 1/ymax < 4f

High performance processor pipelines • Basic idea • Use an n-stage pipeline • n-fold increase in processing power!! • Because the system is processing n instructions at once!! • So we should write .. • n’-fold increase in processing power!!where n’ < n • Nevertheless,n’is usually substantial,so that pipelining speeds up throughput considerably

High performance processor pipelines • Basic idea • Use an n-stage pipeline • n-fold increase in processing power!! • Because the system is processing n instructions at once!! • So we should write .. • n’-fold increase in processing power!!where n’ < n • Nevertheless,n’is usually substantial,so that pipelining speeds up throughput considerably • Remember • Throughput increases • but • Latency remains the same • In fact, it increases ton  ymax

Register Register Register High performance processor pipelines • Pipeline stalls • The picture presented earlier makes a severe assumption ie that the pipeline is always full or that it never stalls • For example, • Extend the simple RISC processorwith a cache and data memory Data memory RegisterFile Instructnmemory Cache Ex WB IDOF IF Clock Part of a simple pipelined RISC processor

Register Register Register High performance processor pipelines Now,when the instruction is read from memory The execution unit tries to find the data in the cacheand if that fails,then it looks in main memory • Assumeslowest arithmetic operation –multiply time = 5ns (incl register time) • Sof can be set to 200MHz • Now • cache access time = 8ns • main memory access time = 100ns • This means that • For a cache access, the pipeline must stall (wait) for 1 extracycle • For a main memory access,the pipeline must stall for 10 extra cycles • Pipeline stalls • The picture presented earlier makes a severe assumption ie that the pipeline is always full or that it never stalls • For example, • Extend the simple RISC processorwith a cache and data memory Data memory RegisterFile Instructnmemory Cache Ex WB IDOF IF Clock Part of a simple pipelined RISC processor

High performance processor pipelines • Pipeline stalls • The simple picture presented up to now makes one severe assumptions ie that the pipeline is always full or that it never stalls • When a pipeline may stall (as in a general purpose processor) • Effect of stalls on throughput is generally >> all other factors! eg in a typical processor, ~25% of instructions access memoryand so stall the pipeline for 1-10 cycles • Calculate the effect for a cache hit rate of 90% • 75% of instructions – stall 0 cycles • 25x0.9 = 22.5% - stall 1 cycle • 2.5% - stall 10 cycles • Average stall = 0.225  1 + 0.025  10 = 0.475 cycles = 0.475  5ns • So effective cycle time is 5 + 2.37 = 7.4 ns • Still considerably better than the original 4  5ns = 20ns! ie we still gained from pipelining! (Just not quite so much!)

Balance • If a processing operation is divided into n stages,in general, these stages will perform different operationsand have different delay times, t1, t2, t3, …, tn • The pipeline can not run faster than the slowest of these times. Thus, the critical time is: • tstage = max(t1, t2, t3, …, tn ) • fmax = 1/(tstage + tSU + tOD) • In order that tstage ti/n, the average time for a stage,the pipeline must be balanced ie the stage times must be as close to the same as possible! • One slow stage slows the whole pipeline! • This implies that • the separation of work into pipeline stages needs care! • because of the fixed overheads, too many stages can have a negative effect on performance! • Too many stages  ti < (tSU + tOD) and no net gain!

Pipelines – Performance effects • Remember • Throughput increases • but • Latency remains (almost) the same • In fact, it increases slightly because of overhead factors!

Register Register Register Pipelines in VHDL • VHDL synthesizers will ‘register’ SIGNALs! • You don’t need to explicitly add registers! • Example – ALU (Execution unit) of pipelined processor Data memory RegisterFile Instructnmemory Cache Ex WB IDOF IF Clock Part of a simple pipelined RISC processor Components result dest reg address exception flag Components opcode – operation + dest reg address op1 - operand 1 op2 - operand 2

Register Register Register RegisterFile Instructnmemory Cache Ex WB IDOF IF Clock Part of a simple pipelined RISC processor 3 Components opcode – operation + dest reg address op1 - operand 1 op2 - operand 2 3 Components result dest reg address exception flag --Pipelined ALU ENTITY ALU IS PORT( instn : IN std_ulogic_vector; op1, op2 : IN std_ulogic_vector; res : OUT std_ulogic_vector; add : OUT std_ulogic_vector; exception : OUT std_ulogic; clk : IN std_ulogic ); END ENTITY ALU;

-- ARCHITECTURE m OF ALU IS CONSTANT op_start : NATURAL := 0; CONSTANT op_end : NATURAL := 2; CONSTANT add_start : NATURAL := (op_end+1); CONSTANT add_end : NATURAL := (op_end+5); SUBTYPE opcode_wd IS std_logic_vector( op_start TO op_end ); CONSTANT no_op : opcode_wd := "000"; CONSTANT add_op : opcode_wd := "001"; CONSTANT sub_op : opcode_wd := "010"; BEGIN PROCESS ( clk ) VARIABLE result, op1s, op2s: signed; VARIABLE opcode : std_ulogic_vector; BEGIN IF clk'EVENT AND clk = '1' THEN opcode := instn( opcode_wd'RANGE ); CASE opcode IS WHEN no_op => exception <= '0'; WHEN add_op => result := SIGNED(op1) + SIGNED(op2); exception <= '0'; WHEN sub_op => result := SIGNED(op1) - SIGNED(op2); exception <= '0'; WHEN others => exception <= '1'; END CASE; END IF; res <= result; END PROCESS; add <= instn( add_start TO add_end ); END ARCHITECTURE;

· · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · Multipliers - Pipelined • Pipelining will • throughput (results produced per second) • but also • total latency (time to produce full result) Insert registers to capture partial sums Benefits * Simple * Regular * Register width can vary - Need to capture operands also! * Usual pipeline advantages Inserting a register at every stage may not produce a benefit!

Multipliers • We can add the partial products with FA blocks a3 a2 a1 a0 0 b0 FA FA FA FA b1 FA FA FA FA b2 Note that an extra adder is needed below the last row to add the last partial products and the carries from the row above! FA FA FA FA p0 product bits p1 Carry select adder

Reconfigurable Computing - Pipelined Systems