1 / 46

Developing an Architecture for a Single-Flux Quantum Based Reconfigurable Accelerator

Developing an Architecture for a Single-Flux Quantum Based Reconfigurable Accelerator. F. Mehdipour , Hiroaki Honda * , H. Kataoka, K. Inoue and K. Murakami Graduate School of Information Science and Electrical Engineering, Kyushu University, Japan

kineta
Download Presentation

Developing an Architecture for a Single-Flux Quantum Based Reconfigurable Accelerator

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Developing an Architecture for a Single-Flux Quantum Based Reconfigurable Accelerator F. Mehdipour, Hiroaki Honda*, H. Kataoka, K. Inoueand K. Murakami Graduate School of Information Science and Electrical Engineering, Kyushu University, Japan *Institute of Systems, Information Technologies and Nanotechnologies (ISIT), Fukuoka, Japan E-mail: farhad@c.csce.kyushu-ua.c.jp

  2. Agenda Introduction SFQ-LSRDP General Architecture The Design Procedure and Tool Chain Input/ Output Nodes Placement Area Minimization Experimental Results Conclusions

  3. CREST-JST SFQ-RDP Project (2006~): A Low-power, high-performance reconfigurable processor based on single-flux quantum circuits Superconducting Research Lab. (SRL) SFQ process Yokohama National Univ. SFQ-FPU chip, cell library Nagoya Univ. SFQ-RDP chip, cell library, and wiring Nagoya Univ. CAD for logic design and arithmetic circuits Dr. S. Nagasawa et al. Prof. N. Yoshikawa et al. Prof. A. Fujimaki et al. Prof. N. Takagi (Leader) et al. Kyushu Univ. Architecture, Compiler and Applications SFQ-LSRDP Prof. K. Murakami et al.

  4. Goals Discovering appropriate scientific applications Developing compiler tools Developing performance analyzing tools Designing and Implementing SFQ-LSRDP architecture considering the features and limitations of SFQ circuits

  5. How a reconfigurable processor works GPP Non-critical code Computation-intensive (critical) code LSRDP Non-critical code ... PE PE PE PE Computation-intensive (critical) code ORN … LSRDP ... Non-critical code PE PE PE PE . . . ORN ... PE PE PE PE Application code Main Memory

  6. Single-flux quantum (SFQ)against CMOS • CMOS main issues in implementing a large accelerator: • High electric power consumption • High heat radiation • Difficulties in high-density packing SFQ Features: • High-speed switching and signal transmission • Low power consumption • Compact implementation (smaller area) • Suitable for pipeline processing of data stream

  7. Outline of large-scale reconfigurable data-path (LSRDP) processor • Features: • Handling data flow graphs (DFGs) extracted from scientific applications • Pipeline execution • Burst transfer of input /output rearranged data from/to memory • Reduced no. of memory accesses (alleviating the memory wall problem) • Reconfigurable data-path components: • A matrix of large number of floating-point Functional Units (FUs) • Reconfigurable Operand Routing Network : (ORN) • Dynamic reconfiguration facilities • Streaming Buffer (SB) for I/O ports LSRDP GPP ... PE PE PE PE ORN : Operand Routing Network : : : : ... PE PE PE PE ORN ... PE PE PE PE SB SMAC Main Memory Scratchpad Memory

  8. SFQ-LSRDP General Architecture

  9. 4 7 15 13 12 LSRDP architecture Input ports • Processing Elements • FU (Functional Unit): implements basic 64-bit double-precision floating point operations including: ADD/SUB and MUL • TU(transfer unit): as a routing resource for transferring data b/w inconsecutive rows MUL Node 15 TU FU FU TU FU FU TU TU FU TU PE including two components TU Four functionalities Output ports

  10. FU - FU TU - FU TU - FU - - - TU - TU TU TU TU - FU TU TU FU TU PE Basic arch. 3-inps/2-outs FU TU PE arch. I 4-inps/3-outs TU FU TU TU TU PE structures FU - FU TU TU TU TU TU - TU TU-TU TU PE arch. II 3-inps/3-outs FU TU

  11. A A A A A A A A A A A A A A A A A A A A T T T T T T T T T T T T T T T T T T T T ADD/SUB TU M M M M M M M M M M M M M M M M M M M M MUL Layout types- Type I W … Each PE implements ADD/SUB and MUL ORN … H ORN M : MUL … . . . A : ADD/SUB T : Transfer Unit ORN … Flexible but consumes a lot of resources

  12. M A A A M M A M M A A M A A A A M M A A T T T T T T T T T T T T T T T T T T T T … ADD/SUB TU MUL TU Layout types- Type II W Each PE implements ADD/SUB or MUL Each PE implements ADD/SUB or MUL ORN … ORN … . . . H ORN …

  13. Maximum connection length (MCL)-Definition MCL:maximum horizontal distance b/wtwo PEs located in two subsequent rows

  14. An ORN structure ORN 2bit shiftregister ORN is consisted of2-bit shift registers, 1-by-2 and 2-by-2 cross bar switches A. Fujimaki, et al., Demonstration of an SFQ-Based Accelerator Prototype for a High-Performance Computer,” ASC08, 2008.

  15. Dynamic reconfiguration architecture • Three bit-stream lines for dynamic reconfiguration of: • Immediate registers (64bit) in each PE • Selector bits for muxes selecting the input data of FUs • Cross-bar switches in ORNs

  16. What should be decided during the design procedure Width and Height ? The number of I/O ports? Maximum Connection Length (MCL)? ORN size and structure? Layout: FU types (ADD/SUB and MUL)? Reconfiguration mechanism? (PE, ORN, Immediate data) On-chip memory configuration?

  17. The Design Procedure and Tool Chain

  18. Compiler and design flow • DFGs are manually generated • DFG mapping results are employed for: • Analyzing LSRDP architecture statistics (a quantitative approach) • Generating LSRDP configuration bit-streams

  19. Benchmark applications • Finite differential method calculation of2nd order partial differential equations • 1dim-Heat equation(Heat) • 1dim-Vibration equation (Vibration) • 2dim-Poisson equation (Poisson) • Quantum chemistry application • Recursive parts of Electron Repulsion Integral calculation(ERI-Rec) Types of operations in the calculations: ADD/SUB and MUL

  20. DFG extraction- Heat equation • 1-dim. heat equation for T(x,t) • Calculation by Finite DifferenceMethod (FDM) (A is const.) Basic DFG Basic DFG can be extended to horizontal and vertical directions to make a larger DFG

  21. A sample DFG - Heat Inputs: 32 Outputs: 16 Operations: 721 Immediates: 364 A sample DFG (Heat)

  22. DFG mapping flow Longest connections MCL= 2

  23. Placing Input/Output Nodes

  24. Fan-out based I/O nodes placement • ni: the number of children of input node i • Ci1, Ci2, Ci3, Ci,ni • X: location of the input node i • Total Connection Length: TCL= |Ci1-X|+ |Ci2-X|+…|Ci,ni-X| • Objective: MinimizeTCL • ni= 1 X= Ci1 • ni= 2 Ci1 <= X <= Ci2 • ni= 3 X = Ci2 • ni>=2 X = Cij, j=2…ni-1

  25. One main reason for the large MCL Inputs Ports are far from each other

  26. Proximity-factor based placement • Proximity factor indicates how far a pair of input ports should be located from each other • For a pair of input nodes • The larger number of closer descendants, higher proximity factor is assigned • Si,j: a set of common descendants for input nodes i and j • Dk,i(=Dk,j): distance of common descendant node k to the input nodes i and j (it is equal to ASAP execution level of the node)

  27. I3 I1 I2 3 2 1 4 5 6 7 Proximity factor-Example Inputs nodes I1 and I2 should be located closer than I3

  28. Input nodes placement alg.: Example if C(l)> C(r) l= l+1, L[l]=j else r= r+1, L[r]=j Placing the 1st input node with the highest proximity factor … … N/2-3 N/2-2 N/2-1 1 N/2+1 N/2+2 N/2+3 Placing the 2nd input node with the highest proximity factor … … N/2-3 N/2-2 2 1 N/2+1 N/2+2 N/2+3

  29. Input ports placement alg.: Example Placing i-th input node … … N/2-K … 2 1 3 … N/2+M l r If C(l)> C(r): … … i … 2 1 3 … N/2+M r l If C(r)> C(l): … N/2-K … 2 1 3 … i l r

  30. Area Minimization

  31. Estimating the area of a PE Area(FU)= Area(ADD/SUB)= Area(MUL) Area(TU)= Area(MUX)~ 0.1 Area (FU) FU TU FU TU TU PE arch. I PE basic arch Layout I: Area(PE)= 2.2x Area(FU), Layout II: Area(PE)= 1.2x Area(FU) Layout I: Area(PE)= 2.1x Area(FU), Layout II: Area(PE)= 1.1x Area(FU) C A B A B C FU TU op TU TU TU PE arch. II mux sel Layout I: Area(PE)= 2.2x Area(FU) Layout II: Area(PE)= 1.2x Area(FU)

  32. Estimating the ORN area-PE Basic arch. Number of rows = 1.5×W FU TU Basic arch. 3-inps/2-outs MCL= 1 Number of columns = 4×MCL Area (ORN)= 1.5 x W x (4 x MCL) x Area (CB) W: the no. of the PEs in a RDP row

  33. Estimating the ORN area-PE arch. I TU FU TU Number of rows = 2×W PE arch. I 4-inps/3-outs MCL= 1 Number of columns = 6×MCL+2 Area (ORN)= 2 x W x (6 x MCL+ 2) x Area (CB)

  34. Estimating the ORN area-PE arch. II Number of rows = 1.5×W FU TU TU TU PE arch. II 3-inps/3-outs MCL= 2 Number of columns = 4×MCL+1 Area (ORN) = 1.5 x W x (4 x MCL + 1) x Area (CB)

  35. A modified connection length measurement • New measurement technique for the net length src Connection length measurement: initialC.L.= dh modified  C.L.= dh/ dv dv dest dh C.L.(previous)= 3 C.L.(new)=3 src dest1 C.L.(previous)= 3 C.L.(new)=1 dest2

  36. A modified connection length measurement- Example Parent 2 is chosen when C.L. is measured as dh/dv MCL= 1 Parent 1 dh dh/dv 0, 4 0, 4/3 1, 3 1, 1 2, 2 2, 2/3 3, 1 3,1/3 4, 0 4, 0 is chosen when C.L. is measured as dh MCL= 2 dh dh/dv 0, 4 0, 1 1, 3 0.5, 0.75 2, 2 1, 0.5 3, 1 3/2, 1/4 4, 0 2, 0

  37. MCL minimization- Using a MCL threshold • A maximum threshold is assumed for the MCL • During the placement process: • For each CL larger than the threshold, the vertical distance increases as: • dv= CL/MCL_Threshold PE with the min. C.L to the source src • max permitted length= 2 • dh =3 > max permitted length • dv= 1 dest dest dv= dv+ [3/2]=dv+1= 2

  38. Basic placement and routing vs. integrated placement and routing DFG DFG Placing Input Nodes using PF-based alg. Placing Input Nodes LSRDP Architecture Description LSRDP Architecture Description Placing Operational Nodes & Routing Nets (node by node) Placing Operational & Output Nodes Placing Output Nodes Routing Nets Final Map Final Map Routing Output Nets Routing IO Nets Basic Placement and Routing Flow Integrated Placement and Routing Flow

  39. Experimental Results

  40. Specifications of the benchmark DFGs

  41. Evaluation results for various architectures-MCL and ORN sizes S2 results in smaller MCL and ORN size for both layout types

  42. Evaluation results for various architectures-no. of utilized PEs By using lhv, larger number of RDP rows are utilized larger number of PEs will be employed for S2

  43. FU TU TU FU TU FU TU TU TU Basic PE arch. 3-inps/2-outs PE arch. I 4-inps/3-outs PE arch. II 3-inps/3-outs Evaluation results for various architectures-overall LSRDP area (KJJ) S2 results in smaller overall area in terms of KJJ for both layout types Layout II results in smaller area PE arch. II gives smaller area

  44. A sample ORN implementation Block diagram of a high frequency test bench clkin_hf ladder clkin_lfout clkin_lfin circuit under test data_in data_out input shift register output shift register A photograph of a chip with 1-to-3 ORN prototype test bench circuit under test ladder 5 mm input shift register output shift register

  45. Conclusions • SFQ-LSRDP is a basic core of a high-performance low-power computer • Data Flow Graphs (DFGs) extracted from scientific applications are mapped on the LSRDP • LSRDP micro-architecture is designed based on characteristics of DFGs via a quantitative approach • LSRDP is promising for resolving issues originated from CMOS technology as well as achieving remarkable performance Acknowledgement: This research was supported in part by Core Research for Evolutional Scienceand Technology (CREST) of Japan Scienceand Technology Corporation (JST).

  46. Thanks for your attention! Any questions?

More Related