Instructor: Dr. Phillip Jones (phjones@iastate) Reconfigurable Computing Laboratory

CPRE 583Reconfigurable ComputingLecture 3: Wed 9/2/2009(Reconfigurable Computing Architectures, VHDL Overview 3) Instructor: Dr. Phillip Jones (phjones@iastate.edu) Reconfigurable Computing Laboratory Iowa State University Ames, Iowa, USA http://class.ece.iastate.edu/cpre583/

Overview • Reinforce some common questions • Finish Chapter 1 Lecture • Continue Chapter 2 • VHDL review

Common Questions • How does an FPGA work? • How does VHDL execute on an FPGA? • How many LUT on the classes FPGA? 44,000 • State machines will be cover more next lecture • Final Project group selection: choose your own groups • Class machine resources • Coover 2048, 1212; Coover 2041 ML507 (will be 2) • Distance students xilinx.ece.iastate.edu (other servers on the way)

What you should learn • Basic trade-offs associated with different aspects of a Reconfigurable Architecture. (Chapter 2) • Practice with timing diagrams, start state machines

Reconfigurable Architectures • Main Idea Chapter 2’s author wants to convey • Applications often have one or more small computationally intense regions of code (kernels) • Can these kernels be sped up using dedicated hardware? • Different kernels have different needs. How does a kernels requirements guide design decisions when implementing a Reconfigurable Architecture?

Reconfigurable Architectures • Forces that drive a Reconfigurable Architecture • Price • Mass production 100K to millions • Experimental 1 to 10’s • Granularity of reconfiguration • Fine grain • Course Grain • Degree of system integration/coupling • Tightly • Loosely All are a function of the application that will run on the Architecture

Example Points in (Price,Granularity,Coupling) Space $1M’s Exec Int Intel / AMD Decode Store float RFU Processor Price Coupling Tight $100’s Loose Coarse PC Ethernet Granularity ML507 Fine

What’s the point of a Reconfigurable Architecture • Performance metrics • Computational • Throughput • Latency • Power • Total power dissipation • Thermal • Reliability • Recovery from faults Increase application performance!

Typical Approach for Increasing Performance • Application/algorithm implemented in software • Often easier to write an application in software • Profile application (e.g. gprof) • Determine where the application is spending its time • Identify kernels of interest • e.g. application spends 90% of its time in function matrix_multiply() • Design custom hardware/instruction to accelerate kernel(s) • Analysis to kernel to determine how to extract fine/coarse grain parallelism (does any parallelism even exist?) Amdahl’s Law!

Amdahl’s Law: Example • Application My_app • Running time: 100 seconds • Spends 90 seconds in matrix_mul() • What is the maximum possible speed up of My_app if I place matrix_mul() in hardware? • What if the original My_app spends 99 seconds in matrx_mul()? 10 seconds = 10x faster 1 seconds = 100x faster Good recent FPGA paper that illustrates increasing an algorithm’s performance with Hardware “NOVEL FPGA BASED HAAR CLASSIFIER FACE DETECTION ALGORITHM ACCELERATION”, FPL 2008 http://class.ece.iastate.edu/cpre583/papers/Shih-Lien_Lu_FPL2008.pdf

Reconfigurable Architectures • RPF -> VIC (short slide)

Granularity

Granularity: Coarse Grain • rDPA: reconfigurable Data Path Array • Function Units with programmable interconnects Example ALU ALU ALU ALU ALU ALU ALU ALU ALU

Granularity: Fine Grain CLB CLB CLB CLB CLB CLB CLB CLB Configurable Logic Block CLB CLB CLB CLB CLB CLB CLB CLB • FPGA: Field Programmable Gate Array • Sea of general purpose logic gates

Granularity: Fine Grain Configurable Logic Block • FPGA: Field Programmable Gate Array • Sea of general purpose logic gates CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB

Granularity: Fine Grain Configurable Logic Block • FPGA: Field Programmable Gate Array • Sea of general purpose logic gates CLB CLB CLB CLB CLB CLB CLB CLB

Granularity: Trade-offs Trade-offs associated with LUT size Example: 2-LUT (4=2x2 bits) vs. 10-LUT (1024=32x32 bits) 1024-bits 2-LUT 10-LUT Microprocessor 1024-bits

Granularity: Trade-offs Trade-offs associated with LUT size Example: 2-LUT (4=2x2 bits) vs. 10-LUT (1024=32x32 bits) 1024-bits 2-LUT 4 op 3 A 3 10-LUT Microprocessor 3 B 1024-bits

Granularity: Trade-offs Trade-offs associated with LUT size Example: 2-LUT (4=2x2 bits) vs. 10-LUT (1024=32x32 bits) 1024-bits 2-LUT 4 op 3 A 3 10-LUT Microprocessor 3 B 4 op 1024-bits 3 A 3 B 3 4 op 3 3 A B 3

Granularity: Trade-offs Trade-offs associated with LUT size Example: 2-LUT (4=2x2 bits) vs. 10-LUT (1024=32x32 bits) 1024-bits 2-LUT 4 op A 3 10-LUT Microprocessor 3 3 B 1024-bits op A 3 3 3 B 4 op 3 A 3 B 3

Granularity: Trade-offs Trade-offs associated with LUT size Example: 2-LUT (4=2x2 bits) vs. 10-LUT (1024=32x32 bits) 1024-bits 2-LUT 4 4 op op A A 3 3 10-LUT Microprocessor 3 3 3 3 B B 1024-bits 4 op 3 A 3 3 B

Granularity: Trade-offs Trade-offs associated with LUT size Example: 2-LUT (4=2x2 bits) vs. 10-LUT (1024=32x32 bits) 1024-bits 2-LUT 10-LUT Bit logic and constants 1024-bits

Granularity: Trade-offs Trade-offs associated with LUT size Example: 2-LUT (4=2x2 bits) vs. 10-LUT (1024=32x32 bits) 1024-bits 2-LUT 10-LUT Bit logic and constants 1024-bits (A and “1100”) or (B or “1000”)

Granularity: Trade-offs Trade-offs associated with LUT size Example: 2-LUT (4=2x2 bits) vs. 10-LUT (1024=32x32 bits) 1024-bits 2-LUT A 10-LUT B Bit logic and constants 1024-bits (A and “1100”) or (B or “1000”)

Granularity: Trade-offs Trade-offs associated with LUT size Example: 2-LUT (4=2x2 bits) vs. 10-LUT (1024=32x32 bits) 1024-bits 2-LUT AND 4 A 10-LUT 1 Bit logic and constants 1024-bits OR Area that was required using 2-LUTS (A and “1100”) or (B or “1000”) 0 OR 4 B It’s much worse, each 10-LUT only has one output

Granularity: Example Architectures • Fine grain: GARP • Course grain: PipeRench

Granularity: GARP Memory D-cache I-cache CPU RFU Config cache Garp chip

Granularity: GARP Memory RFU Execution (16, 2-bit) control (1) D-cache I-cache CPU RFU N Config cache PE (Processing Element) Garp chip

Granularity: GARP Memory RFU Execution (16, 2-bit) control (1) D-cache I-cache CPU RFU N Config cache PE (Processing Element) Garp chip Example computations in one cycle A<<10 | (b&c) (A-2*b+c)

Granularity: GARP Memory • Impact of configuration size • 1 GHz bus frequency • 128-bit memory bus • 512Kbits of configuration size D-cache I-cache On a RFU context switch how long to load a new full configuration? CPU RFU 4 microseconds An estimate of amount of time for the CPU perform a context switch is ~5 microseconds Config cache Garp chip ~2x increase context switch latency!!

Granularity: GARP Memory RFU Execution (16, 2-bit) control (1) D-cache I-cache CPU RFU N Config cache PE (Processing Element) Garp chip • “The Garp Architecture and C Compiler” • http://www.cs.cmu.edu/~tcal/IEEE-Computer-Garp.pdf

Granularity: PipeRench • Coarse granularity • Higher (higher) level programming • Reference papers • PipeRench: A Coprocessor for Streaming Multimedia Acceleration (ISCA 1999): http://www.cs.cmu.edu/~mihaib/research/isca99.pdf • PipeRench Implementation of the Instruction Path Coprocessor (Micro 2000): http://class.ee.iastate.edu/cpre583/papers/piperench_Micro_2000.pdf

Granularity: PipeRench PE PE PE PE PE PE PE PE PE 8-bit ALU 8-bit ALU 8-bit ALU 8-bit ALU 8-bit ALU 8-bit ALU 8-bit ALU 8-bit ALU 8-bit ALU Reg file Reg file Reg file Reg file Reg file Reg file Reg file Reg file Reg file Interconnect Global bus Interconnect

Granularity: PipeRench Cycle 1 2 3 4 5 6 Pipeline stage 0 1 2 3 4 PE PE PE PE PE PE PE PE PE PE PE PE

Granularity: PipeRench Cycle 1 2 3 4 5 6 Pipeline stage 0 1 2 3 4 0 PE PE PE PE PE PE PE PE PE PE PE PE

Granularity: PipeRench Cycle 1 2 3 4 5 6 Pipeline stage 0 1 2 3 4 0 0 PE PE PE PE 1 PE PE PE PE PE PE PE PE

Granularity: PipeRench Cycle 1 2 3 4 5 6 Pipeline stage 0 1 2 3 4 0 0 0 PE PE PE PE 1 1 2 PE PE PE PE PE PE PE PE

Granularity: PipeRench Cycle 1 2 3 4 5 6 Pipeline stage 0 1 2 3 4 0 0 0 PE PE PE PE 1 1 1 2 2 PE PE PE PE 3 PE PE PE PE

Granularity: PipeRench Cycle 1 2 3 4 5 6 Pipeline stage 0 1 2 3 4 0 0 0 PE PE PE PE 1 1 1 2 2 2 PE PE PE PE 3 3 4 PE PE PE PE

Granularity: PipeRench Cycle 1 2 3 4 5 6 Pipeline stage 0 1 2 3 4 0 0 0 0 PE PE PE PE 1 1 1 2 2 2 PE PE PE PE 3 3 3 4 4 PE PE PE PE

Granularity: PipeRench Cycle 1 2 3 4 5 6 Pipeline stage 0 1 2 3 4 0 0 0 0 PE PE PE PE 1 1 1 2 2 2 PE PE PE PE 3 3 3 4 4 PE PE PE PE Cycle 1 2 3 4 5 6 Pipeline stage 0 1 2

Granularity: PipeRench Cycle 1 2 3 4 5 6 Pipeline stage 0 1 2 3 4 0 0 0 0 PE PE PE PE 1 1 1 2 2 2 PE PE PE PE 3 3 3 4 4 PE PE PE PE Cycle 1 2 3 4 5 6 Pipeline stage 0 1 2 0

Granularity: PipeRench Cycle 1 2 3 4 5 6 Pipeline stage 0 1 2 3 4 0 0 0 0 PE PE PE PE 1 1 1 2 2 2 PE PE PE PE 3 3 3 4 4 PE PE PE PE Cycle 1 2 3 4 5 6 Pipeline stage 0 1 2 0 0 1

Granularity: PipeRench Cycle 1 2 3 4 5 6 Pipeline stage 0 1 2 3 4 0 0 0 0 PE PE PE PE 1 1 1 2 2 2 PE PE PE PE 3 3 3 4 4 PE PE PE PE Cycle 1 2 3 4 5 6 Pipeline stage 0 1 2 0 0 0 1 1 2

Granularity: PipeRench Cycle 1 2 3 4 5 6 Pipeline stage 0 1 2 3 4 0 0 0 0 PE PE PE PE 1 1 1 2 2 2 PE PE PE PE 3 3 3 4 4 PE PE PE PE Cycle 1 2 3 4 5 6 Pipeline stage 0 1 2 0 0 0 3 1 1 1 2 2

Granularity: PipeRench Cycle 1 2 3 4 5 6 Pipeline stage 0 1 2 3 4 0 0 0 0 PE PE PE PE 1 1 1 2 2 2 PE PE PE PE 3 3 3 4 4 PE PE PE PE Cycle 1 2 3 4 5 6 Pipeline stage 0 1 2 0 0 0 3 3 1 1 1 4 2 2 2

Granularity: PipeRench Cycle 1 2 3 4 5 6 Pipeline stage 0 1 2 3 4 0 0 0 0 PE PE PE PE 1 1 1 2 2 2 PE PE PE PE 3 3 3 4 4 PE PE PE PE Cycle 1 2 3 4 5 6 Pipeline stage 0 1 2 0 0 0 3 3 3 1 1 1 4 4 2 2 2 0

Degree of Integration/Coupling • Independent Reconfigurable Coprocessor • Reconfigurable Fabric does not have direct communication with the CPU • Processor + Reconfigurable Processing Fabric • Loosely coupled on the same chip • Tightly coupled on the same chip

Instructor: Dr. Phillip Jones (phjones@iastate) Reconfigurable Computing Laboratory