Loading in 5 sec....

F. Mehdipour*, Hiroaki Honda** , H. Kataoka*, K. Inoue* and K. Murakami*PowerPoint Presentation

F. Mehdipour*, Hiroaki Honda** , H. Kataoka*, K. Inoue* and K. Murakami*

Download Presentation

F. Mehdipour*, Hiroaki Honda** , H. Kataoka*, K. Inoue* and K. Murakami*

Loading in 2 Seconds...

- 82 Views
- Uploaded on
- Presentation posted in: General

F. Mehdipour*, Hiroaki Honda** , H. Kataoka*, K. Inoue* and K. Murakami*

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

An Accelerator Based on Single-Flux Quantum Circuits for a High-PerformanceReconfigurable Computer

F. Mehdipour*, Hiroaki Honda**, H. Kataoka*, K. Inoue*

and K. Murakami*

*Graduate School of Information Science and Electrical Engineering, Kyushu University, Japan

**Institute of Systems, Information Technologies and Nanotechnologies (ISIT), Fukuoka, Japan

E-mail: dahon@c.csce.kyushu-ua.c.jp, honda@isit.or.jp

- Introduction
- Large-Scale Reconfigurable Data-Path(LSRDP) General Architecture and Specifications
- Design Procedure and Tool Chain
- Preliminary Results
- Conclusions and Future Work

Roadrunner with PowerXcell

TSUBAME

NVIDIA Tesla S1070

http://it.nikkei.co.jp/

http://www.elsa-jp.co.jp/products/hpc/tesla/s1070/index.html

- Parallel computer clusters with General-Purpose Processors (GPP)are often used for HPC
- Various accelerators are used with GPPs for further performance improvement
- PowerXcell, GPGPU, GRAPE-DR, ClearSpeed, etc.
- Small size and low power consumption comparing to processors with similar performance

http://www.top500.org/system/9485

- Large-Scale Reconfigurable Data-Path (LSRDP):
- is introduced as an alternative accelerator
- reduces the no. of memory accesses
- is implemented by Single-Flux Quantum (SFQ) circuits instead of CMOS circuits
- is suitable for high performance scientific computations

A large memory bandwidth is demanded in conventional accelerators for high-performance computation

On chip memories are often used to hide memory access latency

- Features:
- Data Flow Graphs (DFGs) extracted from critical calculation parts are directly mapped
- Pipeline execution
- Burst transfer is used for input /output rearranged data from/to memory

- Reconfigurable data-path includes:
- A large number of floating point Functional Units (FUs)
- Reconfigurable Operand Routing Network : ORN
- Dynamic reconfiguration facilities
- Streaming Buffers (SB) for I/O ports
- Implementation by SFQ circuits

LSRDP

GPP

...

FU

FU

FU

FU

ORN : Operand Routing Network

:

:

:

:

...

FU

FU

FU

FU

ORN

...

FU

FU

FU

FU

SB

SMAC

Main

Memory

Scratchpad Memory

- CMOS issues:
- high electric power consumption
- high heat radiation and difficulties in high-density packing
- memory wall problem which limits the processing speed

- SFQ Features:
- High-speed switching and signal transmission
- Low power consumption
- Compact implementation of a system (small area)
- No cost for latch
- Suitable for pipeline processing of data stream
- Serial bit-level processing

Superconducting

Research Lab. (SRL)

SFQ process

Yokohama National Univ.

SFQ-FPU chip, cell library

Nagoya Univ.

SFQ-RDP chip, cell library,

and wiring

Nagoya Univ.

CAD for logic design

and arithmetic circuits

Dr. S. Nagasawa et al.

Prof. N. Yoshikawa et al.

Prof. A. Fujimaki et al.

Prof. N. Takagi (Leader)

et al.

Kyushu Univ.

Architecture, Compiler

and Applications

Prof. K. Murakami

Dr. K. Inoue

Dr. H. Honda

Dr. F. Mehdipour

H. Kataoka

SFQ-LSRDP

Designing and Implementing SFQ-LSRDP architecture considering the features and limitations of SFQ circuits

Discovering appropriate applications

Developing compiler tools

Developing performance analyzing tools

LSRDP General Architecture and Specifications

- Core structure a matrix of PEs

- PE: combination of a Functional Unit (FU)
- and a data Transfer Unit (TU)

Width and Height ?

Maximum Connection Length (MCL)

between consecutive rows?

Layout: FU types

(ADD/SUB and MUL)?

- Reconfiguration mechanism?
- (PE, ORN, Immediate data)

- On-chip memory configuration?

- Processing Elements
- FU
- implements basic 64-bit double-precision floating point operations including: ADD, SUB and MUL

- TU (transfer unit) as a routing resource for transferring datafrom a row to an inconsecutive row

- FU

FU

FU

FU

TU

FU

TU

FU

TU

TU

TU

PE including Two components

Four functionalities

W

…

…

…

…

ORN

A

A

A

A

A

A

A

A

A

A

A

A

A

A

A

A

A

A

A

A

T

T

T

T

T

T

T

T

T

T

T

T

T

T

T

T

T

T

T

T

ADD/SUM

TU

ORN

M

M

M

M

M

M

M

M

M

M

M

M

M

M

M

M

M

M

M

M

MUL

.

.

.

ORN

Each PE implements ADD/SUB and MUL

H

M

: MUL

A

: ADD/SUB

T

: Transfer Unit

Flexible but consume a lot of resources

W

A

A

A

M

M

M

A

A

A

M

A

M

A

A

A

M

M

A

A

M

T

T

T

T

T

T

T

T

T

T

T

T

T

T

T

T

T

T

T

T

…

…

…

…

ORN

ADD/SUM

TU

ORN

MUL

TU

.

.

.

ORN

Each PE implements ADD/SUB or MUL

Each PE implements ADD/SUB or MUL

H

W

M

M

M

M

M

A

A

A

A

M

M

A

A

A

A

A

M

M

M

A

T

T

T

T

T

T

T

T

T

T

T

T

T

T

T

T

T

T

T

T

…

…

…

…

ORN

ADD/SUM

TU

ORN

MUL

TU

.

.

.

ORN

Each PE implements ADD/SUB or MUL

Each PE implements ADD/SUB or MUL

H

Type II or III, which one is more efficient?

MCL:maximum horizontal distance between two PEs located in two consecutive rows

ORN

2bit shiftregister

ORN is consisted of2-bit shift registers, 1-by-2 and 2-by-2 cross bar switches

A. Fujimaki, et al., Demonstration of an SFQ-Based Accelerator Prototype for a High-Performance Computer,” ASC08, 2008.

- Three bit-stream lines for dynamic reconfiguration of:
- Immediate registers (64bit) in each PE
- Selector bits for muxes selecting the input data of FUs
- Cross-bar switches in ORNs

Design Procedure and Tool Chain

- DFGs are manually generated from critical parts of applications
- DFG mapping results are used for
- Analyzing LSRDP architecture statistics
- Generating LSRDP configuration bit-streams

DFGs & LSRDP HW constraints

For each

parameter

Appropriate value for each parameter

- Finite differential method calculation of2nd order partial differential equations
- 1dim-Heat equation(Heat)
- 1dim-Vibration equation (Vibration)
- 2dim-Poisson equation (Poisson)

- Quantum chemistry application
- Recursive parts of Electron Repulsion Integral calculation(ERI-Rec)

Only ADD/SUB and MUL operations are usedin the critical calculations of all above applications

- 1-dim. heat equation for T(x,t)
- Calculation by Finite DifferenceMethod (FDM)

(A is const.)

Basic DFG can be extended to horizontal and vertical directions to make a larger DFG

Basic DFG corresponding to Minimum FDM calculation

Inputs: 32

Outputs: 16

Operations: 721

Immediates: 364

A huge sample DFG (Heat)

Totally,

24 DFGs are prepared

for benchmark DFG

Due to broad range of DFG sizes

DFGs are classified as S, M, L, XL with respect to their size

and the number of Input/Output nodes

Longest connections

Preliminary Results

LSRDP Dimensions and the number of Input/Output Ports

Needs further MCL optimization

Layout I

Layout II

(Except ERI1 DFG which gives better size for Layout III)

Layout II can be used instead of Layout I to obtain a smaller LSRDP architecture with lower power consumption and implementation cost.

Base processor configuration

GPP+LSRDP configuration

GPP： Exec. time measurement by means of a processor simulator

LSRDP： Estimation by performance modeling

Basic: SB only

Reuse: SB + SPM

Data reusing is employed to avoid the need for data rearrangement as well as frequently data retrieval from the scratchpad memory.

A small fraction is related to processing time on LSRDP and the main fraction concerns to various overhead times as well as the execution time on GPP

- A high-performance computer comprising an accelerator (LSRDP) implemented by superconducting circuits was introduced.
- 24 benchmark Data Flow Graphs (DFGs) were manually generated.
- LSRDP micro-architecture is designed based on characteristics of scientific applications via a quantitative approach.
- LSRDP is promising for resolving issues originated from CMOS technology as well as achieving considerable performances.

- Future Work:
- To achieve higher performance it is required to reduce various overhead costs mainly related to data management part.
- To reduce the implementation cost of LSRDP, we will focus on reducing maximum connection length and ORN size.

This research was supportedin part by Core Research for Evolutional Scienceand Technology (CREST) of Japan Scienceand Technology Corporation (JST).

Thanks!

Any Questions?

- Backup Slides

Single Flux Quantum

Superconductivity

loop

Ib

L

Josephson junction

Ic

Φ0

Tunneling effect

2mV

2ps

For each class, a lot of extra TUs

are needed to map all DFGs

FU

T

PE types

FU

T

T

T

Final optimized Maximum Connection Length (MCL) results

ORNs should provide the connection length of 9in LSRDP-S/M (MCL= 9).

For LSRDP-L, MCL = 19 !!!

⇒ Serious Implementation Cost

Possible to decrease?

Connection

length

93% of connection lengths are 0 ~ 2

Only small fractions of connections results in larger ORNs

- Almost a similar small size values are achievedfor Layout I and IIfor the majority of DFGs (except ERI1 DFG which gives better size for Layout III)

Layout II can be used instead of Layout I to obtain a smaller LSRDP architecture with lower power consumption and implementation cost as well

Heat

Layout II

ERI 1

Layout III

16Bits Floating point DIV, SQRT, and EXP Functional unit have been already developed by SFQ current technology.

...

FU

FU

FU

FU

DIV

ORN : Operand Routing Network

...

FU

FU

FU

FU

ORN

Where ?

...

FU

FU

FU

FU

:

:

:

:

Three times

larger latency

...

FU

FU

FU

FU

ORN

...

FU

FU

FU

FU

Pipeline execution based

on ADD and MUL latency

Where should we place different latency FU ?

Heterogeneous configuration of FU array ?

Normalized exec. time

by GPP(3GHz) calc.

Main Mem. bandwidth [GByte/sec]

(3GHz)

DFG sizes have already determinedfrom original recursive formula

memory access

X

Large amount

of calculations

small size

of input

small size

of output

- When same calculations have to be calculated repeatedly.
- LSRDP is used for high throughput accelerator.

- Input/Output data size is small compared with the amount of the operations.

LSRDP

- Application
- matrix elements calculation
- Molecular integral calculations in molecular orbital method

- Monte Carlo type simulation
- etc…

- matrix elements calculation
- Numerical calculation library
- special function (promising?)
- differential equation
- numerical integration
- matrix operation (difficult ??)
- Triangular matrix simultaneous equation
- etc…

Investigating applicability against various applications

~Up to (pp,pp) Recursive Calculation~

(ss,ss)(m)and all coefficients

are given as input

# of Inputs： Max. 28

# of Outputs：1 ~ 81

(i,j,k,l = x,y,z): p function has 3 components (as 1dim array)

Each DFG has only ADD (SUB) and MUL FUs.

DFG sizes are determined by original calculation algorithm

ERI-Rec

(8 DFGs)

Vibration (7)

# of FUs

Heat (6)

Poisson (3)

# of Inputs

DFGs have different qualities in terms of the

# of FUs, # of Inputs and Outputs

MCL

Heat original DFG

(I/O: 8/4, FUs: 32)

Mapping result

- Maximum DFG of ERI-Rec: (pipj,pkpl)

Inputs: 28

Outputs: 81

FUs: 1004

Immediates: 0

Vertical Partitioning

Inputs: 24

Outputs: 1

FUs: 108

Immediates: 0

2D – Poisson Eq.

Successive Over Relaxation method

ω is const.

Red/Black Gauss Seidel

In order to obtain u(n+1) (xi,yj) in the next iteration,

current values of five variables i.e. u(n) (xi,yj), u(n) (xi±1,yj), u(n) (xi,yj±1) are needed

55

- Maximum Poisson DFG

Inputs: 32

Outputs: 1

FUs: 721 Immediates: 364

- Variable parameters:
- Freq. of GPP and LSRDP
- Bandwidth between main memory and LSRDP
- Latency of reconfiguration time
- # of FPUs in LSRDP
- Supporting FPU types (Add, Mul, Div, Exp, Sqrt, Error function units are supported)

LSRDP

GPP

Main

Memory

Use streaming buffer in the LSRDP chip

I/O data is sorted in the main memory.

GPP： Exec. time measurement by processor simulator

LSRDP： Estimation by performance modeling

57

Main Mem. bandwidth [GByte/sec]

Normalized exec. time

by GPP(3GHz) calc.

Main Mem. bandwidth [GByte/sec]

4+1ノードの入力

中心1ノードの出力

SOR式

1回の計算

SOR式

2回の繰り返し

9+4ノードの入力

中心1ノードの出力

- DFGの拡大により1度に計算可能な繰り返し回数が増加

これに伴い必要な入力数も増加

60

Original GPP code

LSRDP code

LSRDP Reconfiguration

Loop j’Input Data Rearrangement

Loop N

LSRDP pipeline exec.

(FDM DFG calc.)

End Loop

Output Data Rearrangement

End Loop

Loop j

Loop i

T(xi,tj)

End Loop

End Loop

61

Original GPP code

LSRDP code

Loop Iter

Loop i

loop j

u(xi,yj)

End Loop

End Loop

End Loop

LSRDP Reconfiguration

Loop Iter’

Input Data rearrangement

Loop N

LSRDP pipeline exec. (FDM DFG calc.)

End Loop

Output Data rearrangement

End Loop

62

LSRDP code

original GPP code

Loop I,J,K,L

Loop contraction

Initial Integral Calc.

Recursive Calc.

End Loop

Partial Fock Calc.

End Loop

Loop I,J,K,LLSRDP ReconfigurationLoop contraction

Initial Integral Calc.

End Loop

Input Data rearrangement

Loop N

LSRDP pipeline calc.

(Recursive DFG calc.)End Loop

Output Data rearrangement

Partial Fock Calc.

End Loop

Initial Integral Calc.:

1/Sqrt, Exp, Fm(T) are utilized => GPP calculation

Recursive Calc.:

only ADD/SUB, MUL

=> LSRDP calculation

63

Original

Horizontal Decomp.

Loop N

ReconfigurationLoop M

LSRDP pipeline calc.

End Loop

End Loop

Loop N

ReconfigurationLoop M

1stLSRDP pipeline calc.

End Loop

End Loop

Loop N

ReconfigurationLoop M

2ndLSRDP pipeline calc.

End Loop

End Loop

Vertical Decomp.

Loop n ( > N)

ReconfigurationLoop M

LSRDP pipeline calc.

End Loop

End Loop

64

- Maximum DFG of ERI-Rec: (pipj,pkpl)

Inputs: 28

Outputs: 81

FUs: 1004

Immediates: 0

Vertical Partitioning

Inputs: 24

Outputs: 1

FUs: 108

Immediates: 0

Inputs: 32

Outputs: 16

Operations: 721

Immediates: 364

A huge sample DFG (Heat)

- Variable parameters:
- Freq. of GPP and LSRDP
- Bandwidth between main memory and LSRDP
- Latency of reconfiguration time
- # of FPUs in LSRDP
- Supporting FPU types (Add, Mul, Div, Exp, Sqrt, Error function units are supported)

LSRDP

GPP

Main

Memory

Use streaming buffer in the LSRDP chip

I/O data is sorted in the main memory.

GPP： Exec. time measurement by processor simulator

LSRDP： Estimation by performance modeling

67

Sort data + Reconfig.

+ Send signal for comm.

Execution time

Calculation time

Total pipeline depth

in the given program

# of rows of LSRDP

(latency of LSRDP)

+

Stall time

Stall from

Bandwidthreq > Bandwidthmem

Latency of LSRDP <->Mem

For first Input and last output

+

68

W

…

…

…

…

ORN

A

A

A

A

A

A

A

A

A

A

A

A

A

A

A

A

A

A

A

A

T

T

T

T

T

T

T

T

T

T

T

T

T

T

T

T

T

T

T

T

ADD/SUM

TU

ORN

M

M

M

M

M

M

M

M

M

M

M

M

M

M

M

M

M

M

M

M

MUL

.

.

.

ORN

Each PE implements ADD/SUB and MUL

H

M

: MUL

A

: ADD/SUB

T

: Transfer Unit

Total No. of PEs= W * H

Total Area= W*H* [Area(MUL)+Area(ADD/SUB)+ Area(TU)]+ Area(ORNs)

W

A

A

A

M

M

M

A

A

A

M

A

M

A

A

A

M

M

A

A

M

T

T

T

T

T

T

T

T

T

T

T

T

T

T

T

T

T

T

T

T

…

…

…

…

ORN

ADD/SUM

TU

ORN

MUL

TU

.

.

.

ORN

Each PE implements ADD/SUB or MUL

Each PE implements ADD/SUB or MUL

H

Total No. of PEs= W * H

Total Area= ½* W*H*[Area(MUL)+Area(TU)]+

½*W*H*[Area(ADD/SUB)+Area(TU)]+ Area (ORNs)

W

M

M

M

M

M

A

A

A

A

M

M

A

A

A

A

M

M

M

M

M

T

T

T

T

T

T

T

T

T

T

T

T

T

T

T

T

T

T

T

T

…

…

…

…

ORN

ADD/SUM

TU

ORN

MUL

TU

.

.

.

ORN

Each PE implements ADD/SUB or MUL

Each PE implements ADD/SUB or MUL

H

Total No. of PEs= W * H

Total Area= ½* W*H*[Area(MUL)+Area(TU)]+

½*W*H*[Area(ADD/SUB)+Area(TU)]+ Area (ORNs)

CB

½CB

10

11

00

01

10

11

00

01

- CB
- two inputs/outputs
- four cases are possible
- reconfigurable

- 1/2CB
- one input/ two outputs
- four cases are possible
- reconfigurable

T2

T2

½CB

FPU

FPU

CB

CB

CB

CB

T

½CB

T

T2

CB

CB

CB

½CB

CB

CB

FPU

FPU

CB

CB

T

T2

T

½CB

CB

CB

CB

½CB

CB

CB

FPU

FPU

CB

CB

T

½CB

T

T2

CB

CB

CB

½CB

FPU

FPU

CB

CB

CB

CB

½CB

T

T

T2

CB

CB

CB

½CB

FPU

CB

CB

FPU

CB

CB

½CB

T

T

T2

CB

T2

T2

- “+”:
- scalable
- pipelined
- easily re-designed for any number of N and M

- “–”:
- large number of Josephson junctions
- M number of ½ CB and (2×M+1)×MCL number of CB

The number of FPUs is M, the number of Transfer Units (T) is also M;

MCL is a maximum connection length if we consider FPUs only

=>

- ½ CB – 2×M
- T2 – (M+4×MCL+2)
- CB – (2×MCL+1) ×(4×M-1)

CB: 351 JJs

Reduction of the number of Josephson junctions is essential!

½ CB: 216 JJs

* T2 is a 2-bit shift register

A. Fujimaki, et al., Demonstration of an SFQ-Based Accelerator Prototype for a High-Performance Computer,” ASC08, 2008.

PE & ORN reconfiguration structure

State Diagram

Reconfiguration in LSRDP

4.2 K

SFQ 0.5um process

CMOS

CPU

(1chip)

ORN

2TB memory module

（FB-DIMM

[DDR3@1333MHz, 128GB]

×16 modules）

...

FPU

SFQ RDP

（32FPU×32chips）

（４GFLOPS／FPU)

ORN

:

:

:

:

...

ORN

SFQ Streaming Buffer

（64Kb×2chips）

...

ORN

SB

1024FPU@MCM

（３４chips）×4MCM

:

:

:

...

:

SMAC

SMAC

SMAC

Memory band width per MCM：256GB/ｓ

(=16GB/s ×16 channels)