1 / 65

Instructor: Dr. Phillip Jones (phjones@iastate) Reconfigurable Computing Laboratory

CPRE 583 Reconfigurable Computing Lecture 3: Wed 9/2/2009 (Reconfigurable Computing Architectures, VHDL Overview 3). Instructor: Dr. Phillip Jones (phjones@iastate.edu) Reconfigurable Computing Laboratory Iowa State University Ames, Iowa, USA. http://class.ece.iastate.edu/cpre583/.

kare
Download Presentation

Instructor: Dr. Phillip Jones (phjones@iastate) Reconfigurable Computing Laboratory

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CPRE 583Reconfigurable ComputingLecture 3: Wed 9/2/2009(Reconfigurable Computing Architectures, VHDL Overview 3) Instructor: Dr. Phillip Jones (phjones@iastate.edu) Reconfigurable Computing Laboratory Iowa State University Ames, Iowa, USA http://class.ece.iastate.edu/cpre583/

  2. Overview • Reinforce some common questions • Finish Chapter 1 Lecture • Continue Chapter 2 • VHDL review

  3. Common Questions • How does an FPGA work? • How does VHDL execute on an FPGA? • How many LUT on the classes FPGA? 44,000 • State machines will be cover more next lecture • Final Project group selection: choose your own groups • Class machine resources • Coover 2048, 1212; Coover 2041 ML507 (will be 2) • Distance students xilinx.ece.iastate.edu (other servers on the way)

  4. What you should learn • Basic trade-offs associated with different aspects of a Reconfigurable Architecture. (Chapter 2) • Practice with timing diagrams, start state machines

  5. Reconfigurable Architectures • Main Idea Chapter 2’s author wants to convey • Applications often have one or more small computationally intense regions of code (kernels) • Can these kernels be sped up using dedicated hardware? • Different kernels have different needs. How does a kernels requirements guide design decisions when implementing a Reconfigurable Architecture?

  6. Reconfigurable Architectures • Forces that drive a Reconfigurable Architecture • Price • Mass production 100K to millions • Experimental 1 to 10’s • Granularity of reconfiguration • Fine grain • Course Grain • Degree of system integration/coupling • Tightly • Loosely All are a function of the application that will run on the Architecture

  7. Example Points in (Price,Granularity,Coupling) Space $1M’s Exec Int Intel / AMD Decode Store float RFU Processor Price Coupling Tight $100’s Loose Coarse PC Ethernet Granularity ML507 Fine

  8. What’s the point of a Reconfigurable Architecture • Performance metrics • Computational • Throughput • Latency • Power • Total power dissipation • Thermal • Reliability • Recovery from faults Increase application performance!

  9. Typical Approach for Increasing Performance • Application/algorithm implemented in software • Often easier to write an application in software • Profile application (e.g. gprof) • Determine where the application is spending its time • Identify kernels of interest • e.g. application spends 90% of its time in function matrix_multiply() • Design custom hardware/instruction to accelerate kernel(s) • Analysis to kernel to determine how to extract fine/coarse grain parallelism (does any parallelism even exist?) Amdahl’s Law!

  10. Amdahl’s Law: Example • Application My_app • Running time: 100 seconds • Spends 90 seconds in matrix_mul() • What is the maximum possible speed up of My_app if I place matrix_mul() in hardware? • What if the original My_app spends 99 seconds in matrx_mul()? 10 seconds = 10x faster 1 seconds = 100x faster Good recent FPGA paper that illustrates increasing an algorithm’s performance with Hardware “NOVEL FPGA BASED HAAR CLASSIFIER FACE DETECTION ALGORITHM ACCELERATION”, FPL 2008 http://class.ece.iastate.edu/cpre583/papers/Shih-Lien_Lu_FPL2008.pdf

  11. Reconfigurable Architectures • RPF -> VIC (short slide)

  12. Granularity

  13. Granularity: Coarse Grain • rDPA: reconfigurable Data Path Array • Function Units with programmable interconnects Example ALU ALU ALU ALU ALU ALU ALU ALU ALU

  14. Granularity: Coarse Grain • rDPA: reconfigurable Data Path Array • Function Units with programmable interconnects Example ALU ALU ALU ALU ALU ALU ALU ALU ALU

  15. Granularity: Coarse Grain • rDPA: reconfigurable Data Path Array • Function Units with programmable interconnects Example ALU ALU ALU ALU ALU ALU ALU ALU ALU

  16. Granularity: Fine Grain CLB CLB CLB CLB CLB CLB CLB CLB Configurable Logic Block CLB CLB CLB CLB CLB CLB CLB CLB • FPGA: Field Programmable Gate Array • Sea of general purpose logic gates

  17. Granularity: Fine Grain Configurable Logic Block • FPGA: Field Programmable Gate Array • Sea of general purpose logic gates CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB

  18. Granularity: Fine Grain Configurable Logic Block • FPGA: Field Programmable Gate Array • Sea of general purpose logic gates CLB CLB CLB CLB CLB CLB CLB CLB

  19. Granularity: Trade-offs Trade-offs associated with LUT size Example: 2-LUT (4=2x2 bits) vs. 10-LUT (1024=32x32 bits) 1024-bits 2-LUT 10-LUT Microprocessor 1024-bits

  20. Granularity: Trade-offs Trade-offs associated with LUT size Example: 2-LUT (4=2x2 bits) vs. 10-LUT (1024=32x32 bits) 1024-bits 2-LUT 4 op 3 A 3 10-LUT Microprocessor 3 B 1024-bits

  21. Granularity: Trade-offs Trade-offs associated with LUT size Example: 2-LUT (4=2x2 bits) vs. 10-LUT (1024=32x32 bits) 1024-bits 2-LUT 4 op 3 A 3 10-LUT Microprocessor 3 B 4 op 1024-bits 3 A 3 B 3 4 op 3 3 A B 3

  22. Granularity: Trade-offs Trade-offs associated with LUT size Example: 2-LUT (4=2x2 bits) vs. 10-LUT (1024=32x32 bits) 1024-bits 2-LUT 4 op A 3 10-LUT Microprocessor 3 3 B 1024-bits op A 3 3 3 B 4 op 3 A 3 B 3

  23. Granularity: Trade-offs Trade-offs associated with LUT size Example: 2-LUT (4=2x2 bits) vs. 10-LUT (1024=32x32 bits) 1024-bits 2-LUT 4 4 op op A A 3 3 10-LUT Microprocessor 3 3 3 3 B B 1024-bits 4 op 3 A 3 3 B

  24. Granularity: Trade-offs Trade-offs associated with LUT size Example: 2-LUT (4=2x2 bits) vs. 10-LUT (1024=32x32 bits) 1024-bits 2-LUT 10-LUT Bit logic and constants 1024-bits

  25. Granularity: Trade-offs Trade-offs associated with LUT size Example: 2-LUT (4=2x2 bits) vs. 10-LUT (1024=32x32 bits) 1024-bits 2-LUT 10-LUT Bit logic and constants 1024-bits (A and “1100”) or (B or “1000”)

  26. Granularity: Trade-offs Trade-offs associated with LUT size Example: 2-LUT (4=2x2 bits) vs. 10-LUT (1024=32x32 bits) 1024-bits 2-LUT A 10-LUT B Bit logic and constants 1024-bits (A and “1100”) or (B or “1000”)

  27. Granularity: Trade-offs Trade-offs associated with LUT size Example: 2-LUT (4=2x2 bits) vs. 10-LUT (1024=32x32 bits) 1024-bits 2-LUT AND 4 A 10-LUT 1 Bit logic and constants 1024-bits OR Area that was required using 2-LUTS (A and “1100”) or (B or “1000”) 0 OR 4 B It’s much worse, each 10-LUT only has one output

  28. Granularity: Example Architectures • Fine grain: GARP • Course grain: PipeRench

  29. Granularity: GARP Memory D-cache I-cache CPU RFU Config cache Garp chip

  30. Granularity: GARP Memory RFU Execution (16, 2-bit) control (1) D-cache I-cache CPU RFU N Config cache PE (Processing Element) Garp chip

  31. Granularity: GARP Memory RFU Execution (16, 2-bit) control (1) D-cache I-cache CPU RFU N Config cache PE (Processing Element) Garp chip Example computations in one cycle A<<10 | (b&c) (A-2*b+c)

  32. Granularity: GARP Memory • Impact of configuration size • 1 GHz bus frequency • 128-bit memory bus • 512Kbits of configuration size D-cache I-cache On a RFU context switch how long to load a new full configuration? CPU RFU 4 microseconds An estimate of amount of time for the CPU perform a context switch is ~5 microseconds Config cache Garp chip ~2x increase context switch latency!!

  33. Granularity: GARP Memory RFU Execution (16, 2-bit) control (1) D-cache I-cache CPU RFU N Config cache PE (Processing Element) Garp chip • “The Garp Architecture and C Compiler” • http://www.cs.cmu.edu/~tcal/IEEE-Computer-Garp.pdf

  34. Granularity: PipeRench • Coarse granularity • Higher (higher) level programming • Reference papers • PipeRench: A Coprocessor for Streaming Multimedia Acceleration (ISCA 1999): http://www.cs.cmu.edu/~mihaib/research/isca99.pdf • PipeRench Implementation of the Instruction Path Coprocessor (Micro 2000): http://class.ee.iastate.edu/cpre583/papers/piperench_Micro_2000.pdf

  35. Granularity: PipeRench PE PE PE PE PE PE PE PE PE 8-bit ALU 8-bit ALU 8-bit ALU 8-bit ALU 8-bit ALU 8-bit ALU 8-bit ALU 8-bit ALU 8-bit ALU Reg file Reg file Reg file Reg file Reg file Reg file Reg file Reg file Reg file Interconnect Global bus Interconnect

  36. Granularity: PipeRench Cycle 1 2 3 4 5 6 Pipeline stage 0 1 2 3 4 PE PE PE PE PE PE PE PE PE PE PE PE

  37. Granularity: PipeRench Cycle 1 2 3 4 5 6 Pipeline stage 0 1 2 3 4 0 PE PE PE PE PE PE PE PE PE PE PE PE

  38. Granularity: PipeRench Cycle 1 2 3 4 5 6 Pipeline stage 0 1 2 3 4 0 0 PE PE PE PE 1 PE PE PE PE PE PE PE PE

  39. Granularity: PipeRench Cycle 1 2 3 4 5 6 Pipeline stage 0 1 2 3 4 0 0 0 PE PE PE PE 1 1 2 PE PE PE PE PE PE PE PE

  40. Granularity: PipeRench Cycle 1 2 3 4 5 6 Pipeline stage 0 1 2 3 4 0 0 0 PE PE PE PE 1 1 1 2 2 PE PE PE PE 3 PE PE PE PE

  41. Granularity: PipeRench Cycle 1 2 3 4 5 6 Pipeline stage 0 1 2 3 4 0 0 0 PE PE PE PE 1 1 1 2 2 2 PE PE PE PE 3 3 4 PE PE PE PE

  42. Granularity: PipeRench Cycle 1 2 3 4 5 6 Pipeline stage 0 1 2 3 4 0 0 0 0 PE PE PE PE 1 1 1 2 2 2 PE PE PE PE 3 3 3 4 4 PE PE PE PE

  43. Granularity: PipeRench Cycle 1 2 3 4 5 6 Pipeline stage 0 1 2 3 4 0 0 0 0 PE PE PE PE 1 1 1 2 2 2 PE PE PE PE 3 3 3 4 4 PE PE PE PE Cycle 1 2 3 4 5 6 Pipeline stage 0 1 2

  44. Granularity: PipeRench Cycle 1 2 3 4 5 6 Pipeline stage 0 1 2 3 4 0 0 0 0 PE PE PE PE 1 1 1 2 2 2 PE PE PE PE 3 3 3 4 4 PE PE PE PE Cycle 1 2 3 4 5 6 Pipeline stage 0 1 2 0

  45. Granularity: PipeRench Cycle 1 2 3 4 5 6 Pipeline stage 0 1 2 3 4 0 0 0 0 PE PE PE PE 1 1 1 2 2 2 PE PE PE PE 3 3 3 4 4 PE PE PE PE Cycle 1 2 3 4 5 6 Pipeline stage 0 1 2 0 0 1

  46. Granularity: PipeRench Cycle 1 2 3 4 5 6 Pipeline stage 0 1 2 3 4 0 0 0 0 PE PE PE PE 1 1 1 2 2 2 PE PE PE PE 3 3 3 4 4 PE PE PE PE Cycle 1 2 3 4 5 6 Pipeline stage 0 1 2 0 0 0 1 1 2

  47. Granularity: PipeRench Cycle 1 2 3 4 5 6 Pipeline stage 0 1 2 3 4 0 0 0 0 PE PE PE PE 1 1 1 2 2 2 PE PE PE PE 3 3 3 4 4 PE PE PE PE Cycle 1 2 3 4 5 6 Pipeline stage 0 1 2 0 0 0 3 1 1 1 2 2

  48. Granularity: PipeRench Cycle 1 2 3 4 5 6 Pipeline stage 0 1 2 3 4 0 0 0 0 PE PE PE PE 1 1 1 2 2 2 PE PE PE PE 3 3 3 4 4 PE PE PE PE Cycle 1 2 3 4 5 6 Pipeline stage 0 1 2 0 0 0 3 3 1 1 1 4 2 2 2

  49. Granularity: PipeRench Cycle 1 2 3 4 5 6 Pipeline stage 0 1 2 3 4 0 0 0 0 PE PE PE PE 1 1 1 2 2 2 PE PE PE PE 3 3 3 4 4 PE PE PE PE Cycle 1 2 3 4 5 6 Pipeline stage 0 1 2 0 0 0 3 3 3 1 1 1 4 4 2 2 2 0

  50. Degree of Integration/Coupling • Independent Reconfigurable Coprocessor • Reconfigurable Fabric does not have direct communication with the CPU • Processor + Reconfigurable Processing Fabric • Loosely coupled on the same chip • Tightly coupled on the same chip

More Related