1 / 65

CPRE 583 Reconfigurable Computing Lecture 5: Wed 9/8/2010 (Reconfigurable Computing Architectures)

CPRE 583 Reconfigurable Computing Lecture 5: Wed 9/8/2010 (Reconfigurable Computing Architectures). Instructor: Dr. Phillip Jones (phjones@iastate.edu) Reconfigurable Computing Laboratory Iowa State University Ames, Iowa, USA. http://class.ece.iastate.edu/cpre583/. Overview.

drake
Download Presentation

CPRE 583 Reconfigurable Computing Lecture 5: Wed 9/8/2010 (Reconfigurable Computing Architectures)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CPRE 583Reconfigurable ComputingLecture 5: Wed 9/8/2010(Reconfigurable Computing Architectures) Instructor: Dr. Phillip Jones (phjones@iastate.edu) Reconfigurable Computing Laboratory Iowa State University Ames, Iowa, USA http://class.ece.iastate.edu/cpre583/

  2. Overview • Chapter 2 (Reconfigurable Architectures)

  3. Common Questions

  4. Common Questions

  5. Common Questions

  6. What you should learn • Basic trade-offs associated with different aspects of a Reconfigurable Architecture. (Chapter 2)

  7. Reconfigurable Architectures • Main Idea Chapter 2’s author wants to convey • Applications often have one or more small computationally intense regions of code (kernels) • Can these kernels be sped up using dedicated hardware? • Different kernels have different needs. How does a kernels requirements guide design decisions when implementing a Reconfigurable Architecture?

  8. Reconfigurable Architectures • Forces that drive a Reconfigurable Architecture • Price • Mass production 100K to millions • Experimental 1 to 10’s • Granularity of reconfiguration • Fine grain • Course Grain • Degree of system integration/coupling • Tightly • Loosely All are a function of the application that will run on the Architecture

  9. Example Points in (Price,Granularity,Coupling) Space $1M’s Exec Int Intel / AMD Decode Store float RFU Processor Price Coupling Tight $100’s Loose Coarse PC Ethernet Granularity ML507 Fine

  10. What’s the point of a Reconfigurable Architecture • Performance metrics • Computational • Throughput • Latency • Power • Total power dissipation • Thermal • Reliability • Recovery from faults Increase application performance!

  11. Typical Approach for Increasing Performance • Application/algorithm implemented in software • Often easier to write an application in software • Profile application (e.g. gprof) • Determine where the application is spending its time • Identify kernels of interest • e.g. application spends 90% of its time in function matrix_multiply() • Design custom hardware/instruction to accelerate kernel(s) • Analysis to kernel to determine how to extract fine/coarse grain parallelism (does any parallelism even exist?) Amdahl’s Law!

  12. Amdahl’s Law: Example • Application My_app • Running time: 100 seconds • Spends 90 seconds in matrix_mul() • What is the maximum possible speed up of My_app if I place matrix_mul() in hardware? • What if the original My_app spends 99 seconds in matrx_mul()? 10 seconds = 10x faster 1 seconds = 100x faster Good recent FPGA paper that illustrates increasing an algorithm’s performance with Hardware “NOVEL FPGA BASED HAAR CLASSIFIER FACE DETECTION ALGORITHM ACCELERATION”, FPL 2008 http://class.ece.iastate.edu/cpre583/papers/Shih-Lien_Lu_FPL2008.pdf

  13. Granularity

  14. Granularity: Coarse Grain • rDPA: reconfigurable Data Path Array • Function Units with programmable interconnects Example ALU ALU ALU ALU ALU ALU ALU ALU ALU

  15. Granularity: Coarse Grain • rDPA: reconfigurable Data Path Array • Function Units with programmable interconnects Example ALU ALU ALU ALU ALU ALU ALU ALU ALU

  16. Granularity: Coarse Grain • rDPA: reconfigurable Data Path Array • Function Units with programmable interconnects Example ALU ALU ALU ALU ALU ALU ALU ALU ALU

  17. Granularity: Fine Grain CLB CLB CLB CLB CLB CLB CLB CLB Configurable Logic Block CLB CLB CLB CLB CLB CLB CLB CLB • FPGA: Field Programmable Gate Array • Sea of general purpose logic gates

  18. Granularity: Fine Grain Configurable Logic Block • FPGA: Field Programmable Gate Array • Sea of general purpose logic gates CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB

  19. Granularity: Fine Grain Configurable Logic Block • FPGA: Field Programmable Gate Array • Sea of general purpose logic gates CLB CLB CLB CLB CLB CLB CLB CLB

  20. Granularity: Trade-offs Trade-offs associated with LUT size Example: 2-LUT (4=2x2 bits) vs. 10-LUT (1024=32x32 bits) 1024-bits 2-LUT 10-LUT Microprocessor 1024-bits

  21. Granularity: Trade-offs Trade-offs associated with LUT size Example: 2-LUT (4=2x2 bits) vs. 10-LUT (1024=32x32 bits) 1024-bits 2-LUT 4 op 3 A 3 10-LUT Microprocessor 3 B 1024-bits

  22. Granularity: Trade-offs Trade-offs associated with LUT size Example: 2-LUT (4=2x2 bits) vs. 10-LUT (1024=32x32 bits) 1024-bits 2-LUT 4 op 3 A 3 10-LUT Microprocessor 3 B 4 op 1024-bits 3 A 3 B 3 4 op 3 3 A B 3

  23. Granularity: Trade-offs Trade-offs associated with LUT size Example: 2-LUT (4=2x2 bits) vs. 10-LUT (1024=32x32 bits) 1024-bits 2-LUT 4 op A 3 10-LUT Microprocessor 3 3 B 1024-bits op A 3 3 3 B 4 op 3 A 3 B 3

  24. Granularity: Trade-offs Trade-offs associated with LUT size Example: 2-LUT (4=2x2 bits) vs. 10-LUT (1024=32x32 bits) 1024-bits 2-LUT 4 4 op op A A 3 3 10-LUT Microprocessor 3 3 3 3 B B 1024-bits 4 op 3 A 3 3 B

  25. Granularity: Trade-offs Trade-offs associated with LUT size Example: 2-LUT (4=2x2 bits) vs. 10-LUT (1024=32x32 bits) 1024-bits 2-LUT 10-LUT Bit logic and constants 1024-bits

  26. Granularity: Trade-offs Trade-offs associated with LUT size Example: 2-LUT (4=2x2 bits) vs. 10-LUT (1024=32x32 bits) 1024-bits 2-LUT 10-LUT Bit logic and constants 1024-bits (A and “1100”) or (B or “1000”)

  27. Granularity: Trade-offs Trade-offs associated with LUT size Example: 2-LUT (4=2x2 bits) vs. 10-LUT (1024=32x32 bits) 1024-bits 2-LUT A 10-LUT B Bit logic and constants 1024-bits (A and “1100”) or (B or “1000”)

  28. Granularity: Trade-offs Trade-offs associated with LUT size Example: 2-LUT (4=2x2 bits) vs. 10-LUT (1024=32x32 bits) 1024-bits 2-LUT AND 4 A 10-LUT 1 Bit logic and constants 1024-bits OR Area that was required using 2-LUTS (A and “1100”) or (B or “1000”) 0 OR 4 B It’s much worse, each 10-LUT only has one output

  29. Granularity: Example Architectures • Fine grain: GARP • Course grain: PipeRench

  30. Granularity: GARP Memory D-cache I-cache CPU RFU Config cache Garp chip

  31. Granularity: GARP Memory RFU Execution (16, 2-bit) control (1) D-cache I-cache CPU RFU N Config cache PE (Processing Element) Garp chip

  32. Granularity: GARP Memory RFU Execution (16, 2-bit) control (1) D-cache I-cache CPU RFU N Config cache PE (Processing Element) Garp chip Example computations in one cycle A<<10 | (b&c) (A-2*b+c)

  33. Granularity: GARP Memory • Impact of configuration size • 1 GHz bus frequency • 128-bit memory bus • 512Kbits of configuration size D-cache I-cache On a RFU context switch how long to load a new full configuration? CPU RFU 4 microseconds An estimate of amount of time for the CPU perform a context switch is ~5 microseconds Config cache Garp chip ~2x increase context switch latency!!

  34. Granularity: GARP Memory RFU Execution (16, 2-bit) control (1) D-cache I-cache CPU RFU N Config cache PE (Processing Element) Garp chip • “The Garp Architecture and C Compiler” • http://www.cs.cmu.edu/~tcal/IEEE-Computer-Garp.pdf

  35. Granularity: PipeRench • Coarse granularity • Higher (higher) level programming • Reference papers • PipeRench: A Coprocessor for Streaming Multimedia Acceleration (ISCA 1999): http://www.cs.cmu.edu/~mihaib/research/isca99.pdf • PipeRench Implementation of the Instruction Path Coprocessor (Micro 2000): http://class.ee.iastate.edu/cpre583/papers/piperench_Micro_2000.pdf

  36. Granularity: PipeRench PE PE PE PE PE PE PE PE PE 8-bit ALU 8-bit ALU 8-bit ALU 8-bit ALU 8-bit ALU 8-bit ALU 8-bit ALU 8-bit ALU 8-bit ALU Reg file Reg file Reg file Reg file Reg file Reg file Reg file Reg file Reg file Interconnect Global bus Interconnect

  37. Granularity: PipeRench Cycle 1 2 3 4 5 6 Pipeline stage 0 1 2 3 4 PE PE PE PE PE PE PE PE PE PE PE PE

  38. Granularity: PipeRench Cycle 1 2 3 4 5 6 Pipeline stage 0 1 2 3 4 0 PE PE PE PE PE PE PE PE PE PE PE PE

  39. Granularity: PipeRench Cycle 1 2 3 4 5 6 Pipeline stage 0 1 2 3 4 0 0 PE PE PE PE 1 PE PE PE PE PE PE PE PE

  40. Granularity: PipeRench Cycle 1 2 3 4 5 6 Pipeline stage 0 1 2 3 4 0 0 0 PE PE PE PE 1 1 2 PE PE PE PE PE PE PE PE

  41. Granularity: PipeRench Cycle 1 2 3 4 5 6 Pipeline stage 0 1 2 3 4 0 0 0 PE PE PE PE 1 1 1 2 2 PE PE PE PE 3 PE PE PE PE

  42. Granularity: PipeRench Cycle 1 2 3 4 5 6 Pipeline stage 0 1 2 3 4 0 0 0 PE PE PE PE 1 1 1 2 2 2 PE PE PE PE 3 3 4 PE PE PE PE

  43. Granularity: PipeRench Cycle 1 2 3 4 5 6 Pipeline stage 0 1 2 3 4 0 0 0 0 PE PE PE PE 1 1 1 2 2 2 PE PE PE PE 3 3 3 4 4 PE PE PE PE

  44. Granularity: PipeRench Cycle 1 2 3 4 5 6 Pipeline stage 0 1 2 3 4 0 0 0 0 PE PE PE PE 1 1 1 2 2 2 PE PE PE PE 3 3 3 4 4 PE PE PE PE Cycle 1 2 3 4 5 6 Pipeline stage 0 1 2

  45. Granularity: PipeRench Cycle 1 2 3 4 5 6 Pipeline stage 0 1 2 3 4 0 0 0 0 PE PE PE PE 1 1 1 2 2 2 PE PE PE PE 3 3 3 4 4 PE PE PE PE Cycle 1 2 3 4 5 6 Pipeline stage 0 1 2 0

  46. Granularity: PipeRench Cycle 1 2 3 4 5 6 Pipeline stage 0 1 2 3 4 0 0 0 0 PE PE PE PE 1 1 1 2 2 2 PE PE PE PE 3 3 3 4 4 PE PE PE PE Cycle 1 2 3 4 5 6 Pipeline stage 0 1 2 0 0 1

  47. Granularity: PipeRench Cycle 1 2 3 4 5 6 Pipeline stage 0 1 2 3 4 0 0 0 0 PE PE PE PE 1 1 1 2 2 2 PE PE PE PE 3 3 3 4 4 PE PE PE PE Cycle 1 2 3 4 5 6 Pipeline stage 0 1 2 0 0 0 1 1 2

  48. Granularity: PipeRench Cycle 1 2 3 4 5 6 Pipeline stage 0 1 2 3 4 0 0 0 0 PE PE PE PE 1 1 1 2 2 2 PE PE PE PE 3 3 3 4 4 PE PE PE PE Cycle 1 2 3 4 5 6 Pipeline stage 0 1 2 0 0 0 3 1 1 1 2 2

  49. Granularity: PipeRench Cycle 1 2 3 4 5 6 Pipeline stage 0 1 2 3 4 0 0 0 0 PE PE PE PE 1 1 1 2 2 2 PE PE PE PE 3 3 3 4 4 PE PE PE PE Cycle 1 2 3 4 5 6 Pipeline stage 0 1 2 0 0 0 3 3 1 1 1 4 2 2 2

  50. Granularity: PipeRench Cycle 1 2 3 4 5 6 Pipeline stage 0 1 2 3 4 0 0 0 0 PE PE PE PE 1 1 1 2 2 2 PE PE PE PE 3 3 3 4 4 PE PE PE PE Cycle 1 2 3 4 5 6 Pipeline stage 0 1 2 0 0 0 3 3 3 1 1 1 4 4 2 2 2 0

More Related