1 / 54

FPGA and Reconfigurable Computing

FPGA and Reconfigurable Computing. Wu, Jinyuan Fermilab ICT May, 2009. Outline. Electronic Aspect of FPGA: LED Flashing Logic Elements in a Nutshell TDC and ADC FPGA as a Computing Fabric: Moore’s Law Forever? Space Charge Computing with FPGA Cores Doublet Matching & Hash Sorter

tanek-scott
Download Presentation

FPGA and Reconfigurable Computing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. FPGA and Reconfigurable Computing Wu, Jinyuan Fermilab ICT May, 2009 Wu Jinyuan, Fermilab jywu168@fnal.gov

  2. Outline • Electronic Aspect of FPGA: • LED Flashing • Logic Elements in a Nutshell • TDC and ADC • FPGA as a Computing Fabric: • Moore’s Law Forever? • Space Charge Computing with FPGA Cores • Doublet Matching & Hash Sorter • Triplet Matching & Tiny Triplet Finder • Enclosed Loop Micro-Sequencer (ELMS) Wu Jinyuan, Fermilab jywu168@fnal.gov

  3. Moore’s Law • Number of transistors in a package: x2 /18months Taken from www.intel.com Wu Jinyuan, Fermilab jywu168@fnal.gov

  4. Status of Moore’s Law: an Inconvenient Truth • # of transistors • Yes, via multi-core. • Clock Speed • ? Taken from www.intel.com Wu Jinyuan, Fermilab jywu168@fnal.gov

  5. The Execution & Non-Execution Cycles From MIT 6.823 Open Course Site • In current micro-processors: • Each instruction takes one clock cycle to execute. • It takes many clock cycles to prepare for executing an instruction. • Pipelined? Yes. But the non-execution pipeline stages consume silicon area, power etc. • To execute an instruction != to do useful calculation. • Can we do something different? Wu Jinyuan, Fermilab jywu168@fnal.gov

  6. Outline • Electronic Aspect of FPGA: • LED Flashing • Logic Elements in a Nutshell • TDC and ADC • FPGA as a Computing Fabric: • Moore’s Law Forever? • Space Charge Computing with FPGA Cores • Doublet Matching & Hash Sorter • Triplet Matching & Tiny Triplet Finder • Enclosed Loop Micro-Sequencer (ELMS) Wu Jinyuan, Fermilab jywu168@fnal.gov

  7. The Space Charge Computing • Each electron sees sum of Coulomb forces from other N-1 electrons. • The total number of calculations is about N2 and each calculation of the Coulomb force requires a square root, a division and several multiplications. • Regular sequential computers are not fast enough. Wu Jinyuan, Fermilab jywu168@fnal.gov

  8. The FPGA Board • Up to 16 FPGA devices ($32 ea) can be installed onto each board. • Each FPGA host one core. Wu Jinyuan, Fermilab jywu168@fnal.gov

  9. xj yj zj - X xi - X yi - X zi vyj vzj vxj x2 LUT 10b in 16b out + S S S + + + + x2 + + 32-bit Forces 16-bit Velocities 16-bit Coordinates x2 The 16-bit Demo Core Wu Jinyuan, Fermilab jywu168@fnal.gov

  10. A Double-Layer + Single-Layer Sequencer State Control Outer Loop Inner Loop Wu Jinyuan, Fermilab jywu168@fnal.gov

  11. x2 x2 + x2 The Lookup Table • The LUT replaces: • A Square Rooting • Two Multiplications • A Reciprocal • Operations LUT 10b in 16b out Wu Jinyuan, Fermilab jywu168@fnal.gov

  12. Number of Bits for Input to LUT - X xi • Shifters are used before and after the LUT. • Leading zeros are eliminated: • 00000001010110 • 0101011000 - X yi - X zi x2 LUT 10b in 16b out 16-bit Coordinates x2 + x2 32-bit Forces 32-bit Sum of Squares A 32-bit input LUT is too big. 232=4G words. Wu Jinyuan, Fermilab jywu168@fnal.gov

  13. Sum of 3 Squares (x1-x2)^2 (x1-x2) x1 x2 LUT Bit Evolution Before LUT If ((High Bits) != 0) Choose (High Bits) Else Choose (Low Bits) Wu Jinyuan, Fermilab jywu168@fnal.gov

  14. Bit Evolution After LUT Shift 2n before LUT Shift 3n after LUT (x1-x2) LUT Wu Jinyuan, Fermilab jywu168@fnal.gov

  15. e e 256 nm 28ps Two Electrons with Natural Scales Wu Jinyuan, Fermilab jywu168@fnal.gov

  16. 256 Charged Particles, Iteration 0 Wu Jinyuan, Fermilab jywu168@fnal.gov

  17. 256 Charged Particles, Iteration 5 Wu Jinyuan, Fermilab jywu168@fnal.gov

  18. 256 Charged Particles, Iteration 10 Wu Jinyuan, Fermilab jywu168@fnal.gov

  19. 256 Charged Particles, Iteration 15 Wu Jinyuan, Fermilab jywu168@fnal.gov

  20. 256 Charged Particles, Iteration 20 Wu Jinyuan, Fermilab jywu168@fnal.gov

  21. 256 Charged Particles, Iteration 25 Wu Jinyuan, Fermilab jywu168@fnal.gov

  22. 256 Charged Particles, Iteration 30 Wu Jinyuan, Fermilab jywu168@fnal.gov

  23. 256 Charged Particles, Iteration 35 Wu Jinyuan, Fermilab jywu168@fnal.gov

  24. 256 Charged Particles, Iteration 40 Wu Jinyuan, Fermilab jywu168@fnal.gov

  25. Speed Comparison with Regular CPU • The FPGA core is x10 faster than a typical 2.2 GHz CPU core. • The FPGA core runs at 200 MHz or 200 M Coulomb force calculations/s. • It seems the CPU core needs 80-100 clock cycles for each Coulomb force calculation. Wu Jinyuan, Fermilab jywu168@fnal.gov

  26. One Board: 8 FPGA Cores 8 Cores/Board = 40 Dual Core CPUs One Core/FPGA = 5 Dual Core CPUs One Core/FPGA = 5 Dual Core CPUs • One board has a calculation capacity as 40 dual core CPUs. • The power consumption of one board is < 4.5 W. • Newer FPGAs capable of hosting 4 cores/FPGA are available. Wu Jinyuan, Fermilab jywu168@fnal.gov

  27. Outline • Electronic Aspect of FPGA: • LED Flashing • Logic Elements in a Nutshell • TDC and ADC • FPGA as a Computing Fabric: • Moore’s Law Forever? • Space Charge Computing with FPGA Cores • Doublet Matching & Hash Sorter • Triplet Matching & Tiny Triplet Finder • Enclosed Loop Micro-Sequencer (ELMS) Wu Jinyuan, Fermilab jywu168@fnal.gov

  28. Example of Doublet Match, PET T D DT<A? Group 1 - Group 2 DT>(-A)? T D • Positrons and electrons annihilate to produce pairs of photons. The back-to-back photons hit the detector at nearly the same time. • Detector hits are digitized and hits at nearly the same time are to be matched together. • The process takes O(n^2) clock cycles. Wu Jinyuan, Fermilab jywu168@fnal.gov

  29. Hash Sorter • Pass 1: • Data in Group 1 are stored in the hash sorter bins based on key number K. • Pass 2: • Data in Group 2 are fetched though and paired up with corresponding Group 1 data with same key number K. The entire pairing process takes 2n clock cycles, rather than n2 clock cycles. K D Group 1 K Group 2 K D Wu Jinyuan, Fermilab jywu168@fnal.gov

  30. Link List Structure of Hash Sorter DIN DOUT DATA RAM Pointer RAM Index RAM K Wu Jinyuan, Fermilab jywu168@fnal.gov

  31. Hash Sorter Using hash sorter, matching pairs can be grouped together using 2n, rather than n2 clock cycles. K Wu Jinyuan, Fermilab jywu168@fnal.gov

  32. Outline • Electronic Aspect of FPGA: • LED Flashing • Logic Elements in a Nutshell • TDC and ADC • FPGA as a Computing Fabric: • Moore’s Law Forever? • Space Charge Computing with FPGA Cores • Doublet Matching & Hash Sorter • Triplet Matching & Tiny Triplet Finder • Enclosed Loop Micro-Sequencer (ELMS) Wu Jinyuan, Fermilab jywu168@fnal.gov

  33. Hits, Hit Data & Triplets • Hit data come out of the detector planes in random order. • Hit data from 3 planes generated by same particle tracks are organized together to form triplets. Wu Jinyuan, Fermilab jywu168@fnal.gov

  34. Triplet Finding • Three data items must satisfy the condition: xA+ xC = 2 xB. • A total of n3 combinations must be checked (e.g. 5x5x5=125). • Three layers of loops if the process is implemented in software. • Large silicon resource may be needed without careful planning: O(N2) Plane A Plane B Plane C Wu Jinyuan, Fermilab jywu168@fnal.gov

  35. Tiny Triplet Finder OperationsPass I: Filling Bit Arrays Bit Array/Shifters Note: Flipped Bit Order • xA+ xC = 2 xB • xA= - xC + constant Physical Planes Fill a corresponding logic cell. For any hit… Wu Jinyuan, Fermilab jywu168@fnal.gov

  36. Tiny Triplet Finder Operations Pass II: Making Match Bit Array/Shifters Triplet is found. Logically shift the bit array. Perform bit-wise AND in this range. Physical Planes For any center plane hit… Wu Jinyuan, Fermilab jywu168@fnal.gov

  37. Tiny? Yes, Tiny! – Logic Cell Usage: The triplet finding process for FPGA schemes takes 2n clock cycles. The Tiny Triplet Finder uses much fewer logic elements AM, CAM, Hough Transform etc., O(N2) Tiny Triplet Finder O(N*logN) Wu Jinyuan, Fermilab jywu168@fnal.gov

  38. Tiny Triplet FinderReuse Coincident Logic via Shifting Hit Patterns C3 C2 C1 One set of coincident logic is implemented. For an arbitrary hit on C3, rotate, i.e., shift the hit patterns for C1 and C2 to search for coincidence. Wu Jinyuan, Fermilab jywu168@fnal.gov

  39. Tiny Triplet Finder for Circular Tracks Also works with more than 3 layers Shifter Shifter Bit-wise Coincident Logic Bit Array Bit Array • Fill the C1 and C2 bit arrays. (n1 clock cycles) • Loop over C3 hits, shift bit arrays and check for coincidence. (n3 clock cycles) *R1/R3 *R2/R3 Triplet Map Output To Decoder Wu Jinyuan, Fermilab jywu168@fnal.gov

  40. Hit Matching Wu Jinyuan, Fermilab jywu168@fnal.gov

  41. Outline • Electronic Aspect of FPGA: • LED Flashing • Logic Elements in a Nutshell • TDC and ADC • FPGA as a Computing Fabric: • Moore’s Law Forever? • Space Charge Computing with FPGA Cores • Doublet Matching & Hash Sorter • Triplet Matching & Tiny Triplet Finder • Enclosed Loop Micro-Sequencer (ELMS) Wu Jinyuan, Fermilab jywu168@fnal.gov

  42. Micro-computing vs. Reconfigurable Computing (100+3-4)*5+7 =? 100 3 Data: 100,3,4,5,7 4 5 7 Control: LD (+) (-) (*) (+) Data FPGA Data CPU Program Program Configuration • In microprocessor, the users specify program on fixed logic circuits. • In FPGA, the users specify logic circuits (as well as program). • The FPGA computing needs not to follow microprocessor architectures. (But useful experiences can be borrowed.) • The usefulness of FPGA reconfigurable computing is still to be fully appreciated. Wu Jinyuan, Fermilab jywu168@fnal.gov

  43. FPGA Process Sequencing Options Wu Jinyuan, Fermilab jywu168@fnal.gov

  44. T M-1 A[] == SCLR B[] SLOAD D[] N Q[] The Between Counter PC0: instr0 PC1: instr1 PC2: instr2 PC3: instr3 PC4: instr4 PC5: instr5 PC6: instr6 PC7: instr7 PC8: instr8 PC9: instr9 PCA: instrA PCB: instrB PCC: instrC PCD: instrD 0,1,2,3,4,5,6,7,8,9,A 5,6,7,8,9,A 5,6,7,8,9,A 5,6,7,8,9,A 5,6,7,8,9,A T Between Counter ROM Control Signals 5,6,7,8,9,A,B,C,D,E,F… Wu Jinyuan, Fermilab jywu168@fnal.gov

  45. Conditional Branch Logic Reset A Program Counter ROM 128x 36bits Control Signals CLK Loop & Return Logic + Stack ELMS– Enclosed Loop Micro-Sequencer Allows jump back as in microprocessors Special in ELMS Supports FOR loops at machine code level PC Control Signals Opration 00 000000000000000 01 001000100011010 LD R1, #n 02 000010001000000 LD R2, #addr_a 03 000000000000100 LD R3, #addr_X 04 000000010001000 LD R7, #0 05 000000000100001 BckA1 LD R4, (R2) 06 000100000010000 INC R2 07 000001000100000 LD R5, (R3) 08 000100010000001 INC R3 09 001001000100000 MUL R6, R4, R5 0a 000000010001000 EndA1 ADD R7, R7, R6 0b 000010000010000 DEC R1 0c 000000100000100 BRNZ BckA1 • PC+ROM is a good sequencer in FPGA. • Adding Conditional Branch Logic allows the program to loop back. • Loop & Return Logic + Stack is a special feature in ELMS that supports FOR loops at machine code level. Wu Jinyuan, Fermilab jywu168@fnal.gov

  46. desA RTN CondJMP JMPIF JMP ROM 128x 36bits 0x04 RUNat04 cnt EndA BckA PC Reset +1 bckA endA Compare LoopBack CNT LoopBack = DEC = (PC==endA) && (CNT!=0) Loop & Return Registers + Stack (128 words) DEC LastPass LastPass = (PC==endA) && (CNT==1) RTN Push Pop ELMS – Detailed Block Diagram FOR BckA1 EndA1 #n LD R2, #addr_a LD R3, #addr_X LD R7, #0 BckA1 LD R4, (R2) INC R2 LD R5, (R3) INC R3 MUL R6, R4, R5 EndA1 ADD R7, R7, R6 LD R8, R7 User Control Signals The Stack supports nested loops and sub-routing calls up to 128 layers. Wu Jinyuan, Fermilab jywu168@fnal.gov

  47. Software: Using Spread Sheet as Compiler Wu Jinyuan, Fermilab jywu168@fnal.gov

  48. What’s Good About ELMS: FOR Loops at Machine Code Level w/ Zero-Over Head Microprocessor The ELMS LD R1, #n LD R2, #addr_a LD R3, #addr_X LD R7, #0 BckA1 LD R4, (R2) INC R2 LD R5, (R3) INC R3 MUL R6, R4, R5 EndA1 ADD R7, R7, R6 DEC R1 BRNZ BckA1 FOR BckA1 EndA1 #n LD R2, #addr_a LD R3, #addr_X LD R7, #0 BckA1 LD R4, (R2) INC R2 LD R5, (R3) INC R3 MUL R6, R4, R5 EndA1 ADD R7, R7, R6 Conditional Branch 25% • Looping sequence is known in this example before entering the loop. • Regular micro-processor treat the sequence as unknown. • ELMS supports FOR loops with pre-defined iterations at machine code level. • Execution time is saved and micro-complexities (branch penalty, pipeline bubble, etc.) associated with conditional branches are avoided. Wu Jinyuan, Fermilab jywu168@fnal.gov

  49. Conditional Branch Logic Reset A Program Counter ROM 128x 36bits Control Signals CLK Loop & Return Logic + Stack ELMS as a Hardware Loop Sequencer From http://www.analog.com/ • There are DSP devices that support hardware loop for zero-overhead loop implementation. • The emphasis of ELMS is that the FOR loop and subroutine calls/return are treated the same. • Any program passage can be used as a subroutine without needing a return instruction. • The ELMS uses as less resource as possible for FPGA implementation. Wu Jinyuan, Fermilab jywu168@fnal.gov

  50. No ALU => Small Resource Usage The von Neumann Architecture Princeton Architecture Harvard Architecture Fermilab (?) Architecture Program DATA Memory Program Control Program Memory Program Control Program Memory Sequencer (ELMS) ALU ALU DATA Memory DATA Memory Data Processor • The Princeton Architecture is more suitable at system level while Harvard Architecture is better suited at micro-structure level. • Regular microprocessors cannot run looped program without an ALU. • The ALU takes large amount of resource while may not be efficiently utilized for data processing tasks in FPGA. • The ELMS can run nested loop program without an ALU. • Further separation of Program and data is therefore possible. • The ELMS is kept small. Wu Jinyuan, Fermilab jywu168@fnal.gov

More Related