Loading in 2 Seconds...

F. Mehdipour*, Hiroaki Honda** , H. Kataoka*, K. Inoue* and K. Murakami*

Loading in 2 Seconds...

- By
**dior** - Follow User

- 112 Views
- Uploaded on

Download Presentation
## PowerPoint Slideshow about 'F. Mehdipour*, Hiroaki Honda** , H. Kataoka*, K. Inoue* and K. Murakami*' - dior

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

### An Accelerator Based on Single-Flux Quantum Circuits for a High-PerformanceReconfigurable Computer

### Thanks!

F. Mehdipour*, Hiroaki Honda**, H. Kataoka*, K. Inoue*

and K. Murakami*

*Graduate School of Information Science and Electrical Engineering, Kyushu University, Japan

**Institute of Systems, Information Technologies and Nanotechnologies (ISIT), Fukuoka, Japan

E-mail: dahon@c.csce.kyushu-ua.c.jp, honda@isit.or.jp

Agenda

- Introduction
- Large-Scale Reconfigurable Data-Path(LSRDP) General Architecture and Specifications
- Design Procedure and Tool Chain
- Preliminary Results
- Conclusions and Future Work

Introduction

Roadrunner with PowerXcell

TSUBAME

NVIDIA Tesla S1070

http://it.nikkei.co.jp/

http://www.elsa-jp.co.jp/products/hpc/tesla/s1070/index.html

- Parallel computer clusters with General-Purpose Processors (GPP)are often used for HPC
- Various accelerators are used with GPPs for further performance improvement
- PowerXcell, GPGPU, GRAPE-DR, ClearSpeed, etc.
- Small size and low power consumption comparing to processors with similar performance

http://www.top500.org/system/9485

Single Flux Quantum Large Scale Reconfigurable Data-Path (SFQ-LSRDP)

- Large-Scale Reconfigurable Data-Path (LSRDP):
- is introduced as an alternative accelerator
- reduces the no. of memory accesses
- is implemented by Single-Flux Quantum (SFQ) circuits instead of CMOS circuits
- is suitable for high performance scientific computations

A large memory bandwidth is demanded in conventional accelerators for high-performance computation

On chip memories are often used to hide memory access latency

Outline of Large-Scale Reconfigurable Data-Path (LSRDP) processor

- Features:
- Data Flow Graphs (DFGs) extracted from critical calculation parts are directly mapped
- Pipeline execution
- Burst transfer is used for input /output rearranged data from/to memory

- Reconfigurable data-path includes:
- A large number of floating point Functional Units (FUs)
- Reconfigurable Operand Routing Network : ORN
- Dynamic reconfiguration facilities
- Streaming Buffers (SB) for I/O ports
- Implementation by SFQ circuits

LSRDP

GPP

...

FU

FU

FU

FU

ORN : Operand Routing Network

:

:

:

:

...

FU

FU

FU

FU

ORN

...

FU

FU

FU

FU

SB

SMAC

Main

Memory

Scratchpad Memory

Single-Flux Quantum (SFQ)against CMOS

- CMOS issues:
- high electric power consumption
- high heat radiation and difficulties in high-density packing
- memory wall problem which limits the processing speed

- SFQ Features:
- High-speed switching and signal transmission
- Low power consumption
- Compact implementation of a system (small area)
- No cost for latch
- Suitable for pipeline processing of data stream
- Serial bit-level processing

CREST-JST (2006~): Low-power,high-performance, reconfigurable processor using single-flux quantum circuits

Superconducting

Research Lab. (SRL)

SFQ process

Yokohama National Univ.

SFQ-FPU chip, cell library

Nagoya Univ.

SFQ-RDP chip, cell library,

and wiring

Nagoya Univ.

CAD for logic design

and arithmetic circuits

Dr. S. Nagasawa et al.

Prof. N. Yoshikawa et al.

Prof. A. Fujimaki et al.

Prof. N. Takagi (Leader)

et al.

Kyushu Univ.

Architecture, Compiler

and Applications

Prof. K. Murakami

Dr. K. Inoue

Dr. H. Honda

Dr. F. Mehdipour

H. Kataoka

SFQ-LSRDP

Goals of the Project

Designing and Implementing SFQ-LSRDP architecture considering the features and limitations of SFQ circuits

Discovering appropriate applications

Developing compiler tools

Developing performance analyzing tools

Parameters Should Be DecidedWithin the LSRDP Design Procedure

- Core structure a matrix of PEs

- PE: combination of a Functional Unit (FU)
- and a data Transfer Unit (TU)

Width and Height ?

Maximum Connection Length (MCL)

between consecutive rows?

Layout: FU types

(ADD/SUB and MUL)?

- Reconfiguration mechanism?
- (PE, ORN, Immediate data)

- On-chip memory configuration?

LSRDP Architecture

- Processing Elements
- FU
- implements basic 64-bit double-precision floating point operations including: ADD, SUB and MUL
- TU (transfer unit) as a routing resource for transferring datafrom a row to an inconsecutive row

FU

FU

FU

TU

FU

TU

FU

TU

TU

TU

PE including Two components

Four functionalities

…

…

…

…

ORN

A

A

A

A

A

A

A

A

A

A

A

A

A

A

A

A

A

A

A

A

T

T

T

T

T

T

T

T

T

T

T

T

T

T

T

T

T

T

T

T

ADD/SUM

TU

ORN

M

M

M

M

M

M

M

M

M

M

M

M

M

M

M

M

M

M

M

M

MUL

.

.

.

ORN

Layout Types- Type IEach PE implements ADD/SUB and MUL

H

M

: MUL

A

: ADD/SUB

T

: Transfer Unit

Flexible but consume a lot of resources

A

A

A

M

M

M

A

A

A

M

A

M

A

A

A

M

M

A

A

M

T

T

T

T

T

T

T

T

T

T

T

T

T

T

T

T

T

T

T

T

…

…

…

…

ORN

ADD/SUM

TU

ORN

MUL

TU

.

.

.

ORN

Layout Types- Type II (Checkered)Each PE implements ADD/SUB or MUL

Each PE implements ADD/SUB or MUL

H

M

M

M

M

M

A

A

A

A

M

M

A

A

A

A

A

M

M

M

A

T

T

T

T

T

T

T

T

T

T

T

T

T

T

T

T

T

T

T

T

…

…

…

…

ORN

ADD/SUM

TU

ORN

MUL

TU

.

.

.

ORN

Layout Types- Type III (Striped)Each PE implements ADD/SUB or MUL

Each PE implements ADD/SUB or MUL

H

Type II or III, which one is more efficient?

Maximum Connection Length (MCL)

MCL:maximum horizontal distance between two PEs located in two consecutive rows

An ORN Structure

ORN

2bit shiftregister

ORN is consisted of2-bit shift registers, 1-by-2 and 2-by-2 cross bar switches

A. Fujimaki, et al., Demonstration of an SFQ-Based Accelerator Prototype for a High-Performance Computer,” ASC08, 2008.

Dynamic Reconfiguration Mechanism

- Three bit-stream lines for dynamic reconfiguration of:
- Immediate registers (64bit) in each PE
- Selector bits for muxes selecting the input data of FUs
- Cross-bar switches in ORNs

Compiler and Design Flow

- DFGs are manually generated from critical parts of applications
- DFG mapping results are used for
- Analyzing LSRDP architecture statistics
- Generating LSRDP configuration bit-streams

LSRDP Design Procedure

DFGs & LSRDP HW constraints

For each

parameter

Appropriate value for each parameter

Benchmark Applicationsfor Design Procedures

- Finite differential method calculation of2nd order partial differential equations
- 1dim-Heat equation(Heat)
- 1dim-Vibration equation (Vibration)
- 2dim-Poisson equation (Poisson)
- Quantum chemistry application
- Recursive parts of Electron Repulsion Integral calculation(ERI-Rec)

Only ADD/SUB and MUL operations are usedin the critical calculations of all above applications

DFG Extraction- Heat Equation

- 1-dim. heat equation for T(x,t)
- Calculation by Finite DifferenceMethod (FDM)

(A is const.)

Basic DFG can be extended to horizontal and vertical directions to make a larger DFG

Basic DFG corresponding to Minimum FDM calculation

Example of extracted DFGs- Heat

Inputs: 32

Outputs: 16

Operations: 721

Immediates: 364

A huge sample DFG (Heat)

DFG Classification

Totally,

24 DFGs are prepared

for benchmark DFG

Due to broad range of DFG sizes

DFGs are classified as S, M, L, XL with respect to their size

and the number of Input/Output nodes

Mapping DFGs onto LSRDP

Longest connections

LSRDP Specifications: Width & Height

LSRDP Dimensions and the number of Input/Output Ports

LSRDP Specifications: MCL

Needs further MCL optimization

Layout II

Analyzing Various LSRDP Layouts(Except ERI1 DFG which gives better size for Layout III)

Layout II can be used instead of Layout I to obtain a smaller LSRDP architecture with lower power consumption and implementation cost.

Preliminary Performance Evaluation

Base processor configuration

GPP+LSRDP configuration

GPP： Exec. time measurement by means of a processor simulator

LSRDP： Estimation by performance modeling

Preliminary Performance Evaluation(Heat)

Basic: SB only

Reuse: SB + SPM

Data reusing is employed to avoid the need for data rearrangement as well as frequently data retrieval from the scratchpad memory.

Preliminary Performance Evaluation (Poisson)

A small fraction is related to processing time on LSRDP and the main fraction concerns to various overhead times as well as the execution time on GPP

Conclusions & Future Work

- A high-performance computer comprising an accelerator (LSRDP) implemented by superconducting circuits was introduced.
- 24 benchmark Data Flow Graphs (DFGs) were manually generated.
- LSRDP micro-architecture is designed based on characteristics of scientific applications via a quantitative approach.
- LSRDP is promising for resolving issues originated from CMOS technology as well as achieving considerable performances.

- Future Work:
- To achieve higher performance it is required to reduce various overhead costs mainly related to data management part.
- To reduce the implementation cost of LSRDP, we will focus on reducing maximum connection length and ORN size.

Acknowledgement

This research was supportedin part by Core Research for Evolutional Scienceand Technology (CREST) of Japan Scienceand Technology Corporation (JST).

Any Questions?

Backup Slides

- Backup Slides

Superconductivity

loop

Ib

L

Josephson junction

Ic

Φ0

SFQ (Single Flux Quantum) CircuitHigh speed, Low power consumption, and Operating by a different principle from the CMOSTunneling effect

2mV

2ps

Connection Length Minimization- Results

Final optimized Maximum Connection Length (MCL) results

ORNs should provide the connection length of 9in LSRDP-S/M (MCL= 9).

For LSRDP-L, MCL = 19 !!!

⇒ Serious Implementation Cost

Possible to decrease?

Distributions of Connection Lengths

Connection

length

93% of connection lengths are 0 ~ 2

Only small fractions of connections results in larger ORNs

Analyzing Various LSRDP Layouts

- Almost a similar small size values are achievedfor Layout I and IIfor the majority of DFGs (except ERI1 DFG which gives better size for Layout III)

Layout II can be used instead of Layout I to obtain a smaller LSRDP architecture with lower power consumption and implementation cost as well

FU Layout for DIV, SQRT, EXP operations

16Bits Floating point DIV, SQRT, and EXP Functional unit have been already developed by SFQ current technology.

...

FU

FU

FU

FU

DIV

ORN : Operand Routing Network

...

FU

FU

FU

FU

ORN

Where ?

...

FU

FU

FU

FU

:

:

:

:

Three times

larger latency

...

FU

FU

FU

FU

ORN

...

FU

FU

FU

FU

Pipeline execution based

on ADD and MUL latency

Where should we place different latency FU ?

Heterogeneous configuration of FU array ?

Estimated performance improvement of 2-dim Poisson equation by LSRDP calc.

Normalized exec. time

by GPP(3GHz) calc.

Main Mem. bandwidth [GByte/sec]

Recursive Parts of Electron Repulsion Integral Formula (ERI-Rec)

DFG sizes have already determinedfrom original recursive formula

X

Large amount

of calculations

small size

of input

small size

of output

What types of software/algorithms are suitable for LSRDP ?- When same calculations have to be calculated repeatedly.
- LSRDP is used for high throughput accelerator.
- Input/Output data size is small compared with the amount of the operations.

LSRDP

Exploration of suitable applicationsfor LSRDP

- Application
- matrix elements calculation
- Molecular integral calculations in molecular orbital method
- Monte Carlo type simulation
- etc…
- Numerical calculation library
- special function (promising?)
- differential equation
- numerical integration
- matrix operation (difficult ??)
- Triangular matrix simultaneous equation
- etc…

Investigating applicability against various applications

Recursive Parts of Electron Repulsion Integral Formula in Molecular Orbital Calc.

~Up to (pp,pp) Recursive Calculation~

(ss,ss)(m)and all coefficients

are given as input

# of Inputs： Max. 28

# of Outputs：1 ~ 81

(i,j,k,l = x,y,z): p function has 3 components (as 1dim array)

Each DFG has only ADD (SUB) and MUL FUs.

DFG sizes are determined by original calculation algorithm

DFG Distribution for each application

ERI-Rec

(8 DFGs)

Vibration (7)

# of FUs

Heat (6)

Poisson (3)

# of Inputs

DFGs have different qualities in terms of the

# of FUs, # of Inputs and Outputs

Example of extracted DFGs (ERI-Rec)

- Maximum DFG of ERI-Rec: (pipj,pkpl)

Inputs: 28

Outputs: 81

FUs: 1004

Immediates: 0

Vertical Partitioning

Inputs: 24

Outputs: 1

FUs: 108

Immediates: 0

Poisson Equation

2D – Poisson Eq.

Successive Over Relaxation method

ω is const.

Red/Black Gauss Seidel

In order to obtain u(n+1) (xi,yj) in the next iteration,

current values of five variables i.e. u(n) (xi,yj), u(n) (xi±1,yj), u(n) (xi,yj±1) are needed

55

Performance Evaluation:Simulation Environment

- Variable parameters:
- Freq. of GPP and LSRDP
- Bandwidth between main memory and LSRDP
- Latency of reconfiguration time
- # of FPUs in LSRDP
- Supporting FPU types (Add, Mul, Div, Exp, Sqrt, Error function units are supported)

LSRDP

GPP

Main

Memory

Use streaming buffer in the LSRDP chip

I/O data is sorted in the main memory.

GPP： Exec. time measurement by processor simulator

LSRDP： Estimation by performance modeling

57

Estimated performance improvementof 1-dim heat equation by LSRDP calc.

Main Mem. bandwidth [GByte/sec]

Estimated performance improvement of 1-dim heat equation by LSRDP calc.

Normalized exec. time

by GPP(3GHz) calc.

Main Mem. bandwidth [GByte/sec]

Poisson Red/Black 法におけるDFGの拡大による繰り返し回数の増加

4+1ノードの入力

中心1ノードの出力

SOR式

1回の計算

SOR式

2回の繰り返し

9+4ノードの入力

中心1ノードの出力

- DFGの拡大により1度に計算可能な繰り返し回数が増加

これに伴い必要な入力数も増加

60

Implementation of Heat calculation to LSRDP

Original GPP code

LSRDP code

LSRDP Reconfiguration

Loop j’Input Data Rearrangement

Loop N

LSRDP pipeline exec.

(FDM DFG calc.)

End Loop

Output Data Rearrangement

End Loop

Loop j

Loop i

T(xi,tj)

End Loop

End Loop

61

Implementation of Poisson calculation to LSRDP

Original GPP code

LSRDP code

Loop Iter

Loop i

loop j

u(xi,yj)

End Loop

End Loop

End Loop

LSRDP Reconfiguration

Loop Iter’

Input Data rearrangement

Loop N

LSRDP pipeline exec. (FDM DFG calc.)

End Loop

Output Data rearrangement

End Loop

62

Implementation of ERI-Rec calculation to LSRDP

LSRDP code

original GPP code

Loop I,J,K,L

Loop contraction

Initial Integral Calc.

Recursive Calc.

End Loop

Partial Fock Calc.

End Loop

Loop I,J,K,LLSRDP ReconfigurationLoop contraction

Initial Integral Calc.

End Loop

Input Data rearrangement

Loop N

LSRDP pipeline calc.

(Recursive DFG calc.)End Loop

Output Data rearrangement

Partial Fock Calc.

End Loop

Initial Integral Calc.:

1/Sqrt, Exp, Fm(T) are utilized => GPP calculation

Recursive Calc.:

only ADD/SUB, MUL

=> LSRDP calculation

63

Vertical vs. HorizontalDFG Decomposition

Original

Horizontal Decomp.

Loop N

ReconfigurationLoop M

LSRDP pipeline calc.

End Loop

End Loop

Loop N

ReconfigurationLoop M

1stLSRDP pipeline calc.

End Loop

End Loop

Loop N

ReconfigurationLoop M

2ndLSRDP pipeline calc.

End Loop

End Loop

Vertical Decomp.

Loop n ( > N)

ReconfigurationLoop M

LSRDP pipeline calc.

End Loop

End Loop

64

Example of extracted DFGs

- Maximum DFG of ERI-Rec: (pipj,pkpl)

Inputs: 28

Outputs: 81

FUs: 1004

Immediates: 0

Vertical Partitioning

Inputs: 24

Outputs: 1

FUs: 108

Immediates: 0

Example of extracted DFGs- Heat

Inputs: 32

Outputs: 16

Operations: 721

Immediates: 364

A huge sample DFG (Heat)

Performance Evaluation:Simulation Environment

- Variable parameters:
- Freq. of GPP and LSRDP
- Bandwidth between main memory and LSRDP
- Latency of reconfiguration time
- # of FPUs in LSRDP
- Supporting FPU types (Add, Mul, Div, Exp, Sqrt, Error function units are supported)

LSRDP

GPP

Main

Memory

Use streaming buffer in the LSRDP chip

I/O data is sorted in the main memory.

GPP： Exec. time measurement by processor simulator

LSRDP： Estimation by performance modeling

67

Performance Evaluation:Execution Time Modeling

Sort data + Reconfig.

+ Send signal for comm.

Execution time

Calculation time

Total pipeline depth

in the given program

# of rows of LSRDP

(latency of LSRDP)

+

Stall time

Stall from

Bandwidthreq > Bandwidthmem

Latency of LSRDP <->Mem

For first Input and last output

+

68

…

…

…

…

ORN

A

A

A

A

A

A

A

A

A

A

A

A

A

A

A

A

A

A

A

A

T

T

T

T

T

T

T

T

T

T

T

T

T

T

T

T

T

T

T

T

ADD/SUM

TU

ORN

M

M

M

M

M

M

M

M

M

M

M

M

M

M

M

M

M

M

M

M

MUL

.

.

.

ORN

Layout Types- Type IEach PE implements ADD/SUB and MUL

H

M

: MUL

A

: ADD/SUB

T

: Transfer Unit

Total No. of PEs= W * H

Total Area= W*H* [Area(MUL)+Area(ADD/SUB)+ Area(TU)]+ Area(ORNs)

A

A

A

M

M

M

A

A

A

M

A

M

A

A

A

M

M

A

A

M

T

T

T

T

T

T

T

T

T

T

T

T

T

T

T

T

T

T

T

T

…

…

…

…

ORN

ADD/SUM

TU

ORN

MUL

TU

.

.

.

ORN

Layout Types- Type IIEach PE implements ADD/SUB or MUL

Each PE implements ADD/SUB or MUL

H

Total No. of PEs= W * H

Total Area= ½* W*H*[Area(MUL)+Area(TU)]+

½*W*H*[Area(ADD/SUB)+Area(TU)]+ Area (ORNs)

M

M

M

M

M

A

A

A

A

M

M

A

A

A

A

M

M

M

M

M

T

T

T

T

T

T

T

T

T

T

T

T

T

T

T

T

T

T

T

T

…

…

…

…

ORN

ADD/SUM

TU

ORN

MUL

TU

.

.

.

ORN

Layout Types- Type IIIEach PE implements ADD/SUB or MUL

Each PE implements ADD/SUB or MUL

H

Total No. of PEs= W * H

Total Area= ½* W*H*[Area(MUL)+Area(TU)]+

½*W*H*[Area(ADD/SUB)+Area(TU)]+ Area (ORNs)

CB Various Functionalities

CB

½CB

10

11

00

01

10

11

00

01

- CB
- two inputs/outputs
- four cases are possible
- reconfigurable

- 1/2CB
- one input/ two outputs
- four cases are possible
- reconfigurable

T2

½CB

FPU

FPU

CB

CB

CB

CB

T

½CB

T

T2

CB

CB

CB

½CB

CB

CB

FPU

FPU

CB

CB

T

T2

T

½CB

CB

CB

CB

½CB

CB

CB

FPU

FPU

CB

CB

T

½CB

T

T2

CB

CB

CB

½CB

FPU

FPU

CB

CB

CB

CB

½CB

T

T

T2

CB

CB

CB

½CB

FPU

CB

CB

FPU

CB

CB

½CB

T

T

T2

CB

T2

T2

An ORN Structure- “+”:
- scalable
- pipelined
- easily re-designed for any number of N and M

- “–”:
- large number of Josephson junctions
- M number of ½ CB and (2×M+1)×MCL number of CB

The number of FPUs is M, the number of Transfer Units (T) is also M;

MCL is a maximum connection length if we consider FPUs only

=>

- ½ CB – 2×M
- T2 – (M+4×MCL+2)
- CB – (2×MCL+1) ×(4×M-1)

CB: 351 JJs

Reduction of the number of Josephson junctions is essential!

½ CB: 216 JJs

* T2 is a 2-bit shift register

A. Fujimaki, et al., Demonstration of an SFQ-Based Accelerator Prototype for a High-Performance Computer,” ASC08, 2008.

10TFLOPS SFQ-RDP computer

4.2 K

SFQ 0.5um process

CMOS

CPU

(1chip)

ORN

2TB memory module

（FB-DIMM

[DDR3@1333MHz, 128GB]

×16 modules）

...

FPU

SFQ RDP

（32FPU×32chips）

（４GFLOPS／FPU)

ORN

:

:

:

:

...

ORN

SFQ Streaming Buffer

（64Kb×2chips）

...

ORN

SB

1024FPU@MCM

（３４chips）×4MCM

:

:

:

...

:

SMAC

SMAC

SMAC

Memory band width per MCM：256GB/ｓ

(=16GB/s ×16 channels)

Download Presentation

Connecting to Server..