Graduate Computer Architecture I

Graduate Computer Architecture I Lecture 16: FPGA Design

Emergence of FPGA • Great for Prototyping and Testing • Enable logic verification without high cost of fab • Reprogrammable  Research and Education • Meets most computational requirements • Options for transferring design to ASIC • Technology Advances • Huge FPGAs are available • Up to 200,000 Logic Units • Above clocking rate of 500 MHz • Competitive Pricing

System on Chip (SoC) • Large Embedded Memories • Up 10 Megabits of on-chip memories (Virtex 4) • High bandwidth and reconfigurable • Processor IP Cores • Tons of Soft Processor Cores (some open source) • Embedded Processor Cores • PowerPC, Nios RISC, and etc. – 450+ MHz • Simple Digital Signal Processing Cores • Up to 512 DSPs on Virtex 4 • Interconnects • High speed network I/O (10Gbps) • Built-in Ethernet MACs (Soft/Hard Core) • Security • Embedded 256-bit AES Encryption

Potential Advantages of FPGAs

Designing with FPGAs • Opportunities • Hardware logics are programmable • Immediate testing on the actual platform • Challenges • Programming Environment • Think and design in 2-D instead of 1-D • Consider hardware limitations • Hardware Synthesis • Smart language interpreter and translator • Efficient HW resource utilization

Today • Programming Environment • Object Oriented Programming Model • Template based language editors • Hardware/Software Co-design • Still a disconnect between SW/HW methods • Lack of education to bring them together • Hardware Synthesis • Getting smarter but not smart enough • Tuned specifically for each platform • Not able to take full advantage of resources • Manual tweaking and using templates

High Performance Design in FPGA • Fine Grain Pipelining • Reducing Critical Path • One level of look-up-table between D-flip flop • Works best for streaming data with little or no data dependencies • Logic Resource • Smaller sizes often yield faster design • Use all available resources • Less resource map and place conflicts • Quicker compilation • Parallel Engines • Exploit parallelism in application • Faster place and route

Pipelining • DEFINITION: • a K-Stage Pipeline (“K-pipeline”) is an acyclic circuit having exactly K registers on every path from an input to an output. • a COMBINATIONAL CIRCUIT is thus an 0-stage pipeline. • CONVENTION: • Every pipeline stage, hence every K-Stage pipeline, has a register on its OUTPUT (not on its input). • ALWAYS: • The CLOCK common to all registers must have a period sufficient to cover propagation over combinational paths + (input) register progation delay + (output) register setup time.

Bad pipelining • You can not just randomly registers • Successive inputs get mixed: e.g., B(A(Xi+1), Yi) • This happened because some paths from inputs to outputs have 2 registers, and some have only 1! • Not a well-formed K pipeline!

Adding Pipelines • Method • Draw a line that crosses every output in the circuit and mark the endpoints as terminal points. • Continue to draw new lines between the terminal points across various circuit connections, ensuring that every connection crosses each line in the same direction. • These lines represent pipeline stages. • Adding a pipeline register at every point where a separating line crosses a connection will always generate a valid pipeline • Focus on the slowest part of the circuit

Pipelining Example • 8 bit to 256 bit decoder • 256 different combination library ieee; use ieee.std_logic_1164.all; entity DECODER is port( I: in std_logic_vector(7 downto 0); O: out std_logic_vector(255 downto 0)); end DECODER; architecture behavioral of DECODER is begin process (I) begin case I is when “00000000” => O <= “1000...0000”; when “00000001” => O <= “0100...0000”; when “00000010” => O <= “0010...0000”; ... when “11111110” => O <= “0000...0010”; when “11111111” => O <= “0000...0001”; end case; end process; end behavioral; 256 bits

Hardware Synthesis LUT4 LUT4 LUT4 Comb Logic for “0” O(0) “1” O(1) “2” O(2) I(7:0) … O(254:3) Comb Logic for “255” O(255) • Synthesis • Uses at least three 4 to 1 Look-up-tables to decode 256 combinations of I(7:0) • Resource Usage • 3-LUT4 X 256 • 768 LUT4 • Critical Path • Input/Output pin delays • 2 levels of LUT4 • Sometimes 3 levels?! • Virtex 4 – Speed 11 • 8.281 ns  121 Mhz

Pipelined Decoder LUT4 LUT4 LUT4 Comb Logic for “0” O(0) “1” O(1) “2” O(2) I(7:0) … O(254:3) Comb Logic for “255” O(255) • Input/Output pin DFF • Already in most FPGAs • Minimizes pin latencies • DFF after every LUT4 • LUT4 always followed by DFF (why not use it) • Only when possible • Minimizes logic latency • FPGA Resource • 768 LUT4 as before • Plus 768 dff and 264 pin dff • But not really… • Critical Path • 1 Level of LUT4 • Plus small DFF prop delay and setup • Virtex 4 – Speed 11 • 2.198 ns  455 Mhz • 3.76x Speedup

Logic Resource • Leveraging on FPGA Architecture • Similarity with Architecture • LUT and few special logic followed by DFF • Smaller Design is often Faster • Easier for tools to Map, Place, and Route • Optimize designs wherever • In FPGA, each wire can has a large fanout limit • Reuse logic and results logic Input Output Fanout  Capacity for the wire to drive the inputs to other logic

Reusing Logic LUT4 LUT4 AND Gate “0,0” “0,1” “0,2” I(7:0) AND Gate “15,15” • Synthesis Tools • Obvious duplicate logics are automatically combined • Most are not optimized • Decoder Example • Two 4 bit to16 bit decoders • Combining decoder outputs • Two 16 bits to 256 bit • Critical Path • 1 Level of LUT4 • Approximately the same • Differences in wire delay • FPGA Resources • I/O DFF remain same • 2 x 16 LUT4 and DFF • Plus 256 LUT4 and DFF • Total 272 LUT4 and DFF! LUT4 Comb Logic for “0” O(0) “1” O(1) “2” O(2) Two sets of 4 to16 decoder … O(254:3) Comb Logic for “256” O(255)

Virtex 4 – Elementary Logic Block 2 to 1 Multiplexors 4 to 1 LUT 1 bit D-Flip Flops

Using MUXF as 2-input Gates 0 1 sel MUXF MUXF 0 a z z b a b 0 0 a z z b a 1 sel b Inverters can be pushed into the LUT4 or DFF (by using inverted Q)

Using Unused Multiplexors 0 LUT4 MUXF 1 sel AND Gate “0,0” “0,1” “0,2” I(7:0) AND Gate “15,15” • Decoder Example • Replace all LUT4 in the 2nd Decoder stages with MUX based 2 input AND gates • Critical Path • Same • 2.198 ns  455 Mhz • FPGA Resources • I/O DFF remain same • 256 MUXF and DFF • 32 LUT4 and DFF Comb Logic for “0” O(0) “1” O(1) “2” O(2) Two sets of 4 to16 decoder … O(254:3) Comb Logic for “256” O(255)

Parallel Design • Use Area to Increase Performance • Increase the Input bandwidth (Input Bus width) • Processing multiple data at a time • Duplicate engines to process independent data sets • Thread/Object level parallelism • Instructional level parallelism • Loop unroll to expose the parallelism • Excellent for Streaming Data Applications • Multimedia • Network Processing • Performance Scalability • Linear Performance increase with Size • Achieved for many algorithms • Sometimes Exponential Hardware Size • Try to scale using higher level of parallelism

Summary • FPGA Designing Methods • Fine Grain Pipelining to Increase Clock Rate • If possible 1-level of LUT followed by DFF • Parallel Engines to Increase Bandwidth • Duplicate logic to linearly increase the performance • Reducing Logic Resource Usage • Reusing duplicate logics • Using all available embedded Logic • There are other logics (i.e. Embedded Procs, Large Memories, Optimized primitive gates, and IP Cores) • Best Methods Today • Learn about internal architecture of FPGA • Make your own templates and use them • Use IP Cores • Future Research Topics • Integration of Generalize Pipelining Algorithms (In the works) • Smarter Synthesis Tools (Understanding HDL) • Automatic Platform Specific Optimization Techniques

Graduate Computer Architecture I