F. Mehdipour*, Hiroaki Honda** , H. Kataoka, K. Inoue and K. Murakami*

An Accelerator Based on Single-Flux Quantum Circuits for a High-PerformanceReconfigurable Computer F. Mehdipour*, Hiroaki Honda**, H. Kataoka*, K. Inoue* and K. Murakami* *Graduate School of Information Science and Electrical Engineering, Kyushu University, Japan **Institute of Systems, Information Technologies and Nanotechnologies (ISIT), Fukuoka, Japan E-mail: dahon@c.csce.kyushu-ua.c.jp, honda@isit.or.jp

Agenda • Introduction • Large-Scale Reconfigurable Data-Path(LSRDP) General Architecture and Specifications • Design Procedure and Tool Chain • Preliminary Results • Conclusions and Future Work

Introduction Roadrunner with PowerXcell TSUBAME NVIDIA Tesla S1070 http://it.nikkei.co.jp/ http://www.elsa-jp.co.jp/products/hpc/tesla/s1070/index.html • Parallel computer clusters with General-Purpose Processors (GPP)are often used for HPC • Various accelerators are used with GPPs for further performance improvement • PowerXcell, GPGPU, GRAPE-DR, ClearSpeed, etc. • Small size and low power consumption comparing to processors with similar performance http://www.top500.org/system/9485

Single Flux Quantum Large Scale Reconfigurable Data-Path (SFQ-LSRDP) • Large-Scale Reconfigurable Data-Path (LSRDP): • is introduced as an alternative accelerator • reduces the no. of memory accesses • is implemented by Single-Flux Quantum (SFQ) circuits instead of CMOS circuits • is suitable for high performance scientific computations A large memory bandwidth is demanded in conventional accelerators for high-performance computation On chip memories are often used to hide memory access latency

Outline of Large-Scale Reconfigurable Data-Path (LSRDP) processor • Features: • Data Flow Graphs (DFGs) extracted from critical calculation parts are directly mapped • Pipeline execution • Burst transfer is used for input /output rearranged data from/to memory • Reconfigurable data-path includes: • A large number of floating point Functional Units (FUs) • Reconfigurable Operand Routing Network : ORN • Dynamic reconfiguration facilities • Streaming Buffers (SB) for I/O ports • Implementation by SFQ circuits LSRDP GPP ... FU FU FU FU ORN : Operand Routing Network : : : : ... FU FU FU FU ORN ... FU FU FU FU SB SMAC Main Memory Scratchpad Memory

Single-Flux Quantum (SFQ)against CMOS • CMOS issues: • high electric power consumption • high heat radiation and difficulties in high-density packing • memory wall problem which limits the processing speed • SFQ Features: • High-speed switching and signal transmission • Low power consumption • Compact implementation of a system (small area) • No cost for latch • Suitable for pipeline processing of data stream • Serial bit-level processing

CREST-JST (2006~): Low-power,high-performance, reconfigurable processor using single-flux quantum circuits Superconducting Research Lab. (SRL) SFQ process Yokohama National Univ. SFQ-FPU chip, cell library Nagoya Univ. SFQ-RDP chip, cell library, and wiring Nagoya Univ. CAD for logic design and arithmetic circuits Dr. S. Nagasawa et al. Prof. N. Yoshikawa et al. Prof. A. Fujimaki et al. Prof. N. Takagi (Leader) et al. Kyushu Univ. Architecture, Compiler and Applications Prof. K. Murakami Dr. K. Inoue Dr. H. Honda Dr. F. Mehdipour H. Kataoka SFQ-LSRDP

Goals of the Project Designing and Implementing SFQ-LSRDP architecture considering the features and limitations of SFQ circuits Discovering appropriate applications Developing compiler tools Developing performance analyzing tools

LSRDP General Architecture and Specifications

Parameters Should Be DecidedWithin the LSRDP Design Procedure • Core structure a matrix of PEs • PE: combination of a Functional Unit (FU) • and a data Transfer Unit (TU) Width and Height ? Maximum Connection Length (MCL) between consecutive rows? Layout: FU types (ADD/SUB and MUL)? • Reconfiguration mechanism? • (PE, ORN, Immediate data) • On-chip memory configuration?

LSRDP Architecture • Processing Elements • FU • implements basic 64-bit double-precision floating point operations including: ADD, SUB and MUL • TU (transfer unit) as a routing resource for transferring datafrom a row to an inconsecutive row FU FU FU TU FU TU FU TU TU TU PE including Two components Four functionalities

W … … … … ORN A A A A A A A A A A A A A A A A A A A A T T T T T T T T T T T T T T T T T T T T ADD/SUM TU ORN M M M M M M M M M M M M M M M M M M M M MUL . . . ORN Layout Types- Type I Each PE implements ADD/SUB and MUL H M : MUL A : ADD/SUB T : Transfer Unit Flexible but consume a lot of resources

W A A A M M M A A A M A M A A A M M A A M T T T T T T T T T T T T T T T T T T T T … … … … ORN ADD/SUM TU ORN MUL TU . . . ORN Layout Types- Type II (Checkered) Each PE implements ADD/SUB or MUL Each PE implements ADD/SUB or MUL H

W M M M M M A A A A M M A A A A A M M M A T T T T T T T T T T T T T T T T T T T T … … … … ORN ADD/SUM TU ORN MUL TU . . . ORN Layout Types- Type III (Striped) Each PE implements ADD/SUB or MUL Each PE implements ADD/SUB or MUL H Type II or III, which one is more efficient?

Maximum Connection Length (MCL) MCL:maximum horizontal distance between two PEs located in two consecutive rows

An ORN Structure ORN 2bit shiftregister ORN is consisted of2-bit shift registers, 1-by-2 and 2-by-2 cross bar switches A. Fujimaki, et al., Demonstration of an SFQ-Based Accelerator Prototype for a High-Performance Computer,” ASC08, 2008.

Dynamic Reconfiguration Mechanism • Three bit-stream lines for dynamic reconfiguration of: • Immediate registers (64bit) in each PE • Selector bits for muxes selecting the input data of FUs • Cross-bar switches in ORNs

Design Procedure and Tool Chain

Compiler and Design Flow • DFGs are manually generated from critical parts of applications • DFG mapping results are used for • Analyzing LSRDP architecture statistics • Generating LSRDP configuration bit-streams

LSRDP Design Procedure DFGs & LSRDP HW constraints For each parameter Appropriate value for each parameter

Benchmark Applicationsfor Design Procedures • Finite differential method calculation of2nd order partial differential equations • 1dim-Heat equation(Heat) • 1dim-Vibration equation (Vibration) • 2dim-Poisson equation (Poisson) • Quantum chemistry application • Recursive parts of Electron Repulsion Integral calculation(ERI-Rec) Only ADD/SUB and MUL operations are usedin the critical calculations of all above applications

DFG Extraction- Heat Equation • 1-dim. heat equation for T(x,t) • Calculation by Finite DifferenceMethod (FDM) (A is const.) Basic DFG can be extended to horizontal and vertical directions to make a larger DFG Basic DFG corresponding to Minimum FDM calculation

Example of extracted DFGs- Heat Inputs: 32 Outputs: 16 Operations: 721 Immediates: 364 A huge sample DFG (Heat)

DFG Classification Totally, 24 DFGs are prepared for benchmark DFG Due to broad range of DFG sizes DFGs are classified as S, M, L, XL with respect to their size and the number of Input/Output nodes

Mapping DFGs onto LSRDP Longest connections

Preliminary Results

LSRDP Specifications: Width & Height LSRDP Dimensions and the number of Input/Output Ports

LSRDP Specifications: MCL Needs further MCL optimization

Layout I Layout II Analyzing Various LSRDP Layouts (Except ERI1 DFG which gives better size for Layout III) Layout II can be used instead of Layout I to obtain a smaller LSRDP architecture with lower power consumption and implementation cost.

LSRDP at One Glance (1/2)

LSRDP at One Glance (2/2)

Preliminary Performance Evaluation Base processor configuration GPP+LSRDP configuration GPP： Exec. time measurement by means of a processor simulator LSRDP： Estimation by performance modeling

Preliminary Performance Evaluation(Heat) Basic: SB only Reuse: SB + SPM Data reusing is employed to avoid the need for data rearrangement as well as frequently data retrieval from the scratchpad memory.

Preliminary Performance Evaluation (Poisson) A small fraction is related to processing time on LSRDP and the main fraction concerns to various overhead times as well as the execution time on GPP

Conclusions & Future Work • A high-performance computer comprising an accelerator (LSRDP) implemented by superconducting circuits was introduced. • 24 benchmark Data Flow Graphs (DFGs) were manually generated. • LSRDP micro-architecture is designed based on characteristics of scientific applications via a quantitative approach. • LSRDP is promising for resolving issues originated from CMOS technology as well as achieving considerable performances. • Future Work: • To achieve higher performance it is required to reduce various overhead costs mainly related to data management part. • To reduce the implementation cost of LSRDP, we will focus on reducing maximum connection length and ORN size.

Acknowledgement This research was supportedin part by Core Research for Evolutional Scienceand Technology (CREST) of Japan Scienceand Technology Corporation (JST).

Thanks! Any Questions?

Backup Slides • Backup Slides

Single Flux Quantum Superconductivity loop Ib L Josephson junction Ic Φ0 SFQ (Single Flux Quantum) CircuitHigh speed, Low power consumption, and Operating by a different principle from the CMOS Tunneling effect 2mV 2ps

Mapping Results For each class, a lot of extra TUs are needed to map all DFGs FU T PE types FU T T T

Connection Length Minimization- Results Final optimized Maximum Connection Length (MCL) results ORNs should provide the connection length of 9in LSRDP-S/M (MCL= 9). For LSRDP-L, MCL = 19 !!! ⇒ Serious Implementation Cost Possible to decrease?

Distributions of Connection Lengths Connection length 93% of connection lengths are 0 ~ 2 Only small fractions of connections results in larger ORNs

Analyzing Various LSRDP Layouts • Almost a similar small size values are achievedfor Layout I and IIfor the majority of DFGs (except ERI1 DFG which gives better size for Layout III) Layout II can be used instead of Layout I to obtain a smaller LSRDP architecture with lower power consumption and implementation cost as well

Why only ERI1 DFG is suitable to Layout III ? Heat Layout II ERI 1 Layout III

FU Layout for DIV, SQRT, EXP operations 16Bits Floating point DIV, SQRT, and EXP Functional unit have been already developed by SFQ current technology. ... FU FU FU FU DIV ORN : Operand Routing Network ... FU FU FU FU ORN Where ? ... FU FU FU FU : : : : Three times larger latency ... FU FU FU FU ORN ... FU FU FU FU Pipeline execution based on ADD and MUL latency Where should we place different latency FU ? Heterogeneous configuration of FU array ?

Estimated performance improvement of 2-dim Poisson equation by LSRDP calc. Normalized exec. time by GPP(3GHz) calc. Main Mem. bandwidth [GByte/sec]

Estimated performance improvement of ERI calculation by LSRDP (3GHz)

Recursive Parts of Electron Repulsion Integral Formula (ERI-Rec) DFG sizes have already determinedfrom original recursive formula

memory access X Large amount of calculations small size of input small size of output What types of software/algorithms are suitable for LSRDP ? • When same calculations have to be calculated repeatedly. • LSRDP is used for high throughput accelerator. • Input/Output data size is small compared with the amount of the operations. LSRDP

Exploration of suitable applicationsfor LSRDP • Application • matrix elements calculation • Molecular integral calculations in molecular orbital method • Monte Carlo type simulation • etc… • Numerical calculation library • special function (promising?) • differential equation • numerical integration • matrix operation (difficult ??) • Triangular matrix simultaneous equation • etc… Investigating applicability against various applications

F. Mehdipour*, Hiroaki Honda** , H. Kataoka, K. Inoue and K. Murakami*