An accelerator based on single flux quantum circuits for a high performance reconfigurable computer
Download
1 / 78

F. Mehdipour*, Hiroaki Honda** , H. Kataoka*, K. Inoue* and K. Murakami* - PowerPoint PPT Presentation


  • 99 Views
  • Uploaded on

An Accelerator Based on Single-Flux Quantum Circuits for a High-Performance Reconfigurable Computer. F. Mehdipour*, Hiroaki Honda** , H. Kataoka*, K. Inoue* and K. Murakami* *Graduate School of Information Science and Electrical Engineering, Kyushu University, Japan

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' F. Mehdipour*, Hiroaki Honda** , H. Kataoka*, K. Inoue* and K. Murakami*' - dior


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
An accelerator based on single flux quantum circuits for a high performance reconfigurable computer

An Accelerator Based on Single-Flux Quantum Circuits for a High-PerformanceReconfigurable Computer

F. Mehdipour*, Hiroaki Honda**, H. Kataoka*, K. Inoue*

and K. Murakami*

*Graduate School of Information Science and Electrical Engineering, Kyushu University, Japan

**Institute of Systems, Information Technologies and Nanotechnologies (ISIT), Fukuoka, Japan

E-mail: [email protected], [email protected]


Agenda
Agenda

  • Introduction

  • Large-Scale Reconfigurable Data-Path(LSRDP) General Architecture and Specifications

  • Design Procedure and Tool Chain

  • Preliminary Results

  • Conclusions and Future Work


Introduction
Introduction

Roadrunner with PowerXcell

TSUBAME

NVIDIA Tesla S1070

http://it.nikkei.co.jp/

http://www.elsa-jp.co.jp/products/hpc/tesla/s1070/index.html

  • Parallel computer clusters with General-Purpose Processors (GPP)are often used for HPC

  • Various accelerators are used with GPPs for further performance improvement

    • PowerXcell, GPGPU, GRAPE-DR, ClearSpeed, etc.

    • Small size and low power consumption comparing to processors with similar performance

http://www.top500.org/system/9485


Single flux quantum large scale reconfigurable data path sfq lsrdp
Single Flux Quantum Large Scale Reconfigurable Data-Path (SFQ-LSRDP)

  • Large-Scale Reconfigurable Data-Path (LSRDP):

    • is introduced as an alternative accelerator

    • reduces the no. of memory accesses

    • is implemented by Single-Flux Quantum (SFQ) circuits instead of CMOS circuits

    • is suitable for high performance scientific computations

A large memory bandwidth is demanded in conventional accelerators for high-performance computation

On chip memories are often used to hide memory access latency


Outline of large scale reconfigurable data path lsrdp processor
Outline of Large-Scale Reconfigurable Data-Path (LSRDP) processor

  • Features:

    • Data Flow Graphs (DFGs) extracted from critical calculation parts are directly mapped

    • Pipeline execution

    • Burst transfer is used for input /output rearranged data from/to memory

  • Reconfigurable data-path includes:

    • A large number of floating point Functional Units (FUs)

    • Reconfigurable Operand Routing Network : ORN

    • Dynamic reconfiguration facilities

    • Streaming Buffers (SB) for I/O ports

    • Implementation by SFQ circuits

LSRDP

GPP

...

FU

FU

FU

FU

ORN : Operand Routing Network

:

:

:

:

...

FU

FU

FU

FU

ORN

...

FU

FU

FU

FU

SB

SMAC

Main

Memory

Scratchpad Memory


Single flux quantum sfq against cmos
Single-Flux Quantum (SFQ) processoragainst CMOS

  • CMOS issues:

    • high electric power consumption

    • high heat radiation and difficulties in high-density packing

    • memory wall problem which limits the processing speed

  • SFQ Features:

    • High-speed switching and signal transmission

    • Low power consumption

    • Compact implementation of a system (small area)

    • No cost for latch

    • Suitable for pipeline processing of data stream

    • Serial bit-level processing


CREST-JST (2006~): Low-power, processorhigh-performance, reconfigurable processor using single-flux quantum circuits

Superconducting

Research Lab. (SRL)

SFQ process

Yokohama National Univ.

SFQ-FPU chip, cell library

Nagoya Univ.

SFQ-RDP chip, cell library,

and wiring

Nagoya Univ.

CAD for logic design

and arithmetic circuits

Dr. S. Nagasawa et al.

Prof. N. Yoshikawa et al.

Prof. A. Fujimaki et al.

Prof. N. Takagi (Leader)

et al.

Kyushu Univ.

Architecture, Compiler

and Applications

Prof. K. Murakami

Dr. K. Inoue

Dr. H. Honda

Dr. F. Mehdipour

H. Kataoka

SFQ-LSRDP


Goals of the project
Goals of the Project processor

Designing and Implementing SFQ-LSRDP architecture considering the features and limitations of SFQ circuits

Discovering appropriate applications

Developing compiler tools

Developing performance analyzing tools



Parameters should be decided within the lsrdp design procedure
Parameters Should Be Decided processorWithin the LSRDP Design Procedure

  • Core structure a matrix of PEs

  • PE: combination of a Functional Unit (FU)

  • and a data Transfer Unit (TU)

Width and Height ?

Maximum Connection Length (MCL)

between consecutive rows?

Layout: FU types

(ADD/SUB and MUL)?

  • Reconfiguration mechanism?

  • (PE, ORN, Immediate data)

  • On-chip memory configuration?


Lsrdp architecture
LSRDP Architecture processor

  • Processing Elements

    • FU

      • implements basic 64-bit double-precision floating point operations including: ADD, SUB and MUL

    • TU (transfer unit) as a routing resource for transferring datafrom a row to an inconsecutive row

FU

FU

FU

TU

FU

TU

FU

TU

TU

TU

PE including Two components

Four functionalities


Layout types type i

W processor

ORN

A

A

A

A

A

A

A

A

A

A

A

A

A

A

A

A

A

A

A

A

T

T

T

T

T

T

T

T

T

T

T

T

T

T

T

T

T

T

T

T

ADD/SUM

TU

ORN

M

M

M

M

M

M

M

M

M

M

M

M

M

M

M

M

M

M

M

M

MUL

.

.

.

ORN

Layout Types- Type I

Each PE implements ADD/SUB and MUL

H

M

: MUL

A

: ADD/SUB

T

: Transfer Unit

Flexible but consume a lot of resources


Layout types type ii checkered

W processor

A

A

A

M

M

M

A

A

A

M

A

M

A

A

A

M

M

A

A

M

T

T

T

T

T

T

T

T

T

T

T

T

T

T

T

T

T

T

T

T

ORN

ADD/SUM

TU

ORN

MUL

TU

.

.

.

ORN

Layout Types- Type II (Checkered)

Each PE implements ADD/SUB or MUL

Each PE implements ADD/SUB or MUL

H


Layout types type iii striped

W processor

M

M

M

M

M

A

A

A

A

M

M

A

A

A

A

A

M

M

M

A

T

T

T

T

T

T

T

T

T

T

T

T

T

T

T

T

T

T

T

T

ORN

ADD/SUM

TU

ORN

MUL

TU

.

.

.

ORN

Layout Types- Type III (Striped)

Each PE implements ADD/SUB or MUL

Each PE implements ADD/SUB or MUL

H

Type II or III, which one is more efficient?


Maximum connection length mcl
Maximum Connection Length processor (MCL)

MCL:maximum horizontal distance between two PEs located in two consecutive rows


An orn structure
An ORN Structure processor

ORN

2bit shiftregister

ORN is consisted of2-bit shift registers, 1-by-2 and 2-by-2 cross bar switches

A. Fujimaki, et al., Demonstration of an SFQ-Based Accelerator Prototype for a High-Performance Computer,” ASC08, 2008.


Dynamic reconfiguration mechanism
Dynamic processorReconfiguration Mechanism

  • Three bit-stream lines for dynamic reconfiguration of:

    • Immediate registers (64bit) in each PE

    • Selector bits for muxes selecting the input data of FUs

    • Cross-bar switches in ORNs


Design procedure and tool chain

Design Procedure and processorTool Chain


Compiler and design flow
Compiler and Design Flow processor

  • DFGs are manually generated from critical parts of applications

  • DFG mapping results are used for

    • Analyzing LSRDP architecture statistics

    • Generating LSRDP configuration bit-streams


Lsrdp design procedure
LSRDP Design Procedure processor

DFGs & LSRDP HW constraints

For each

parameter

Appropriate value for each parameter


Benchmark applications for design procedures
Benchmark Applications processorfor Design Procedures

  • Finite differential method calculation of2nd order partial differential equations

    • 1dim-Heat equation(Heat)

    • 1dim-Vibration equation (Vibration)

    • 2dim-Poisson equation (Poisson)

  • Quantum chemistry application

    • Recursive parts of Electron Repulsion Integral calculation(ERI-Rec)

Only ADD/SUB and MUL operations are usedin the critical calculations of all above applications


Dfg extraction heat equation
DFG Extraction- Heat Equation processor

  • 1-dim. heat equation for T(x,t)

  • Calculation by Finite DifferenceMethod (FDM)

(A is const.)

Basic DFG can be extended to horizontal and vertical directions to make a larger DFG

Basic DFG corresponding to Minimum FDM calculation


Example of extracted dfgs heat
Example of extracted DFGs- Heat processor

Inputs: 32

Outputs: 16

Operations: 721

Immediates: 364

A huge sample DFG (Heat)


Dfg classification
DFG Classification processor

Totally,

24 DFGs are prepared

for benchmark DFG

Due to broad range of DFG sizes

DFGs are classified as S, M, L, XL with respect to their size

and the number of Input/Output nodes


Mapping dfgs onto lsrdp
Mapping DFGs onto LSRDP processor

Longest connections



Lsrdp specifications width height
LSRDP Specifications: Width & Height processor

LSRDP Dimensions and the number of Input/Output Ports


Lsrdp specifications mcl
LSRDP Specifications: MCL processor

Needs further MCL optimization


Analyzing various lsrdp layouts

Layout I processor

Layout II

Analyzing Various LSRDP Layouts

(Except ERI1 DFG which gives better size for Layout III)

Layout II can be used instead of Layout I to obtain a smaller LSRDP architecture with lower power consumption and implementation cost.




Preliminary performance evaluation
Preliminary Performance Evaluation processor

Base processor configuration

GPP+LSRDP configuration

GPP: Exec. time measurement by means of a processor simulator

LSRDP: Estimation by performance modeling


Preliminary performance evaluation heat
Preliminary Performance Evaluation processor(Heat)

Basic: SB only

Reuse: SB + SPM

Data reusing is employed to avoid the need for data rearrangement as well as frequently data retrieval from the scratchpad memory.


Preliminary performance evaluation poisson
Preliminary Performance Evaluation (Poisson) processor

A small fraction is related to processing time on LSRDP and the main fraction concerns to various overhead times as well as the execution time on GPP


Conclusions future work
Conclusions & Future Work processor

  • A high-performance computer comprising an accelerator (LSRDP) implemented by superconducting circuits was introduced.

  • 24 benchmark Data Flow Graphs (DFGs) were manually generated.

  • LSRDP micro-architecture is designed based on characteristics of scientific applications via a quantitative approach.

  • LSRDP is promising for resolving issues originated from CMOS technology as well as achieving considerable performances.

  • Future Work:

    • To achieve higher performance it is required to reduce various overhead costs mainly related to data management part.

    • To reduce the implementation cost of LSRDP, we will focus on reducing maximum connection length and ORN size.


Acknowledgement
Acknowledgement processor

This research was supportedin part by Core Research for Evolutional Scienceand Technology (CREST) of Japan Scienceand Technology Corporation (JST).


Thanks

Thanks! processor

Any Questions?


Backup slides
Backup Slides processor

  • Backup Slides


Single Flux Quantum processor

Superconductivity

loop

Ib

L

Josephson junction

Ic

Φ0

SFQ (Single Flux Quantum) CircuitHigh speed, Low power consumption, and Operating by a different principle from the CMOS

Tunneling effect

2mV

2ps


Mapping results
Mapping Results processor

For each class, a lot of extra TUs

are needed to map all DFGs

FU

T

PE types

FU

T

T

T


Connection length minimization results
Connection Length Minimization- Results processor

Final optimized Maximum Connection Length (MCL) results

ORNs should provide the connection length of 9in LSRDP-S/M (MCL= 9).

For LSRDP-L, MCL = 19 !!!

⇒ Serious Implementation Cost

Possible to decrease?


Distributions of connection lengths
Distributions of Connection Lengths processor

Connection

length

93% of connection lengths are 0 ~ 2

Only small fractions of connections results in larger ORNs


Analyzing various lsrdp layouts1
Analyzing Various LSRDP Layouts processor

  • Almost a similar small size values are achievedfor Layout I and IIfor the majority of DFGs (except ERI1 DFG which gives better size for Layout III)

Layout II can be used instead of Layout I to obtain a smaller LSRDP architecture with lower power consumption and implementation cost as well


Why only eri1 dfg is suitable to layout iii
Why only ERI1 DFG is suitable to Layout III ? processor

Heat

Layout II

ERI 1

Layout III


Fu layout for div sqrt exp operations
FU Layout for DIV, SQRT, EXP operations processor

16Bits Floating point DIV, SQRT, and EXP Functional unit have been already developed by SFQ current technology.

...

FU

FU

FU

FU

DIV

ORN : Operand Routing Network

...

FU

FU

FU

FU

ORN

Where ?

...

FU

FU

FU

FU

:

:

:

:

Three times

larger latency

...

FU

FU

FU

FU

ORN

...

FU

FU

FU

FU

Pipeline execution based

on ADD and MUL latency

Where should we place different latency FU ?

Heterogeneous configuration of FU array ?


Estimated performance improvement of 2 dim poisson equation by lsrdp calc
Estimated performance improvement of 2-dim Poisson equation by LSRDP calc.

Normalized exec. time

by GPP(3GHz) calc.

Main Mem. bandwidth [GByte/sec]



Recursive parts of electron repulsion integral formula eri rec
Recursive Parts of Electron Repulsion LSRDPIntegral Formula (ERI-Rec)

DFG sizes have already determinedfrom original recursive formula


What types of software algorithms are suitable for lsrdp

memory access LSRDP

X

Large amount

of calculations

small size

of input

small size

of output

What types of software/algorithms are suitable for LSRDP ?

  • When same calculations have to be calculated repeatedly.

    • LSRDP is used for high throughput accelerator.

  • Input/Output data size is small compared with the amount of the operations.

LSRDP


Exploration of suitable applications for lsrdp
Exploration of suitable applications LSRDPfor LSRDP

  • Application

    • matrix elements calculation

      • Molecular integral calculations in molecular orbital method

    • Monte Carlo type simulation

    • etc…

  • Numerical calculation library

    • special function (promising?)

    • differential equation

    • numerical integration

    • matrix operation (difficult ??)

    • Triangular matrix simultaneous equation

    • etc…

Investigating applicability against various applications


Recursive parts of electron repulsion integral formula in molecular orbital calc
Recursive Parts of Electron Repulsion Integral Formula in Molecular Orbital Calc.

~Up to (pp,pp) Recursive Calculation~

(ss,ss)(m)and all coefficients

are given as input

# of Inputs: Max. 28

# of Outputs:1 ~ 81

(i,j,k,l = x,y,z): p function has 3 components (as 1dim array)

Each DFG has only ADD (SUB) and MUL FUs.

DFG sizes are determined by original calculation algorithm


Dfg distribution for each application
DFG Distribution for each application Molecular Orbital Calc.

ERI-Rec

(8 DFGs)

Vibration (7)

# of FUs

Heat (6)

Poisson (3)

# of Inputs

DFGs have different qualities in terms of the

# of FUs, # of Inputs and Outputs


Example of mcl heat
Example of MCL (Heat) Molecular Orbital Calc.

MCL

Heat original DFG

(I/O: 8/4, FUs: 32)

Mapping result


Example of extracted dfgs eri rec
Example of extracted DFGs (ERI-Rec) Molecular Orbital Calc.

  • Maximum DFG of ERI-Rec: (pipj,pkpl)

Inputs: 28

Outputs: 81

FUs: 1004

Immediates: 0

Vertical Partitioning

Inputs: 24

Outputs: 1

FUs: 108

Immediates: 0


Poisson equation
Poisson Equation Molecular Orbital Calc.

2D – Poisson Eq.

Successive Over Relaxation method

ω is const.

Red/Black Gauss Seidel

In order to obtain u(n+1) (xi,yj) in the next iteration,

current values of five variables i.e. u(n) (xi,yj), u(n) (xi±1,yj), u(n) (xi,yj±1) are needed

55


Example of extracted dfgs poisson
Example of extracted DFGs (Poisson) Molecular Orbital Calc.

  • Maximum Poisson DFG

Inputs: 32

Outputs: 1

FUs: 721 Immediates: 364


Performance evaluation simulation environment
Performance Evaluation: Molecular Orbital Calc.Simulation Environment

  • Variable parameters:

  • Freq. of GPP and LSRDP

  • Bandwidth between main memory and LSRDP

  • Latency of reconfiguration time

  • # of FPUs in LSRDP

  • Supporting FPU types (Add, Mul, Div, Exp, Sqrt, Error function units are supported)

LSRDP

GPP

Main

Memory

Use streaming buffer in the LSRDP chip

I/O data is sorted in the main memory.

GPP: Exec. time measurement by processor simulator

LSRDP: Estimation by performance modeling

57


Estimated performance improvement of 1 dim heat equation by lsrdp calc
Estimated performance improvement Molecular Orbital Calc.of 1-dim heat equation by LSRDP calc.

Main Mem. bandwidth [GByte/sec]


Estimated performance improvement of 1 dim heat equation by lsrdp calc1
Estimated performance improvement of 1-dim heat equation by LSRDP calc.

Normalized exec. time

by GPP(3GHz) calc.

Main Mem. bandwidth [GByte/sec]


Poisson red black dfg
Poisson Red/Black LSRDP calc. 法におけるDFGの拡大による繰り返し回数の増加

4+1ノードの入力

中心1ノードの出力

SOR式

1回の計算

SOR式

2回の繰り返し

9+4ノードの入力

中心1ノードの出力

  • DFGの拡大により1度に計算可能な繰り返し回数が増加

これに伴い必要な入力数も増加

60


Implementation of heat calculation to lsrdp
Implementation of Heat calculation to LSRDP LSRDP calc.

Original GPP code

LSRDP code

LSRDP Reconfiguration

Loop j’Input Data Rearrangement

Loop N

LSRDP pipeline exec.

(FDM DFG calc.)

End Loop

Output Data Rearrangement

End Loop

Loop j

Loop i

T(xi,tj)

End Loop

End Loop

61


Implementation of poisson calculation to lsrdp
Implementation of Poisson calculation to LSRDP LSRDP calc.

Original GPP code

LSRDP code

Loop Iter

Loop i

loop j

u(xi,yj)

End Loop

End Loop

End Loop

LSRDP Reconfiguration

Loop Iter’

Input Data rearrangement

Loop N

LSRDP pipeline exec. (FDM DFG calc.)

End Loop

Output Data rearrangement

End Loop

62


Implementation of eri rec calculation to lsrdp
Implementation of ERI-Rec calculation to LSRDP LSRDP calc.

LSRDP code

original GPP code

Loop I,J,K,L

Loop contraction

Initial Integral Calc.

Recursive Calc.

End Loop

Partial Fock Calc.

End Loop

Loop I,J,K,LLSRDP ReconfigurationLoop contraction

Initial Integral Calc.

End Loop

Input Data rearrangement

Loop N

LSRDP pipeline calc.

(Recursive DFG calc.)End Loop

Output Data rearrangement

Partial Fock Calc.

End Loop

Initial Integral Calc.:

1/Sqrt, Exp, Fm(T) are utilized => GPP calculation

Recursive Calc.:

only ADD/SUB, MUL

=> LSRDP calculation

63


Vertical vs horizontal dfg decomposition
Vertical vs. Horizontal LSRDP calc. DFG Decomposition

Original

Horizontal Decomp.

Loop N

ReconfigurationLoop M

LSRDP pipeline calc.

End Loop

End Loop

Loop N

ReconfigurationLoop M

1stLSRDP pipeline calc.

End Loop

End Loop

Loop N

ReconfigurationLoop M

2ndLSRDP pipeline calc.

End Loop

End Loop

Vertical Decomp.

Loop n ( > N)

ReconfigurationLoop M

LSRDP pipeline calc.

End Loop

End Loop

64


Example of extracted dfgs
Example of extracted DFGs LSRDP calc.

  • Maximum DFG of ERI-Rec: (pipj,pkpl)

Inputs: 28

Outputs: 81

FUs: 1004

Immediates: 0

Vertical Partitioning

Inputs: 24

Outputs: 1

FUs: 108

Immediates: 0


Example of extracted dfgs heat1
Example of extracted DFGs- Heat LSRDP calc.

Inputs: 32

Outputs: 16

Operations: 721

Immediates: 364

A huge sample DFG (Heat)


Performance evaluation simulation environment1
Performance Evaluation: LSRDP calc. Simulation Environment

  • Variable parameters:

  • Freq. of GPP and LSRDP

  • Bandwidth between main memory and LSRDP

  • Latency of reconfiguration time

  • # of FPUs in LSRDP

  • Supporting FPU types (Add, Mul, Div, Exp, Sqrt, Error function units are supported)

LSRDP

GPP

Main

Memory

Use streaming buffer in the LSRDP chip

I/O data is sorted in the main memory.

GPP: Exec. time measurement by processor simulator

LSRDP: Estimation by performance modeling

67


Performance evaluation execution time modeling
Performance Evaluation: LSRDP calc. Execution Time Modeling

Sort data + Reconfig.

+ Send signal for comm.

Execution time

Calculation time

Total pipeline depth

in the given program

# of rows of LSRDP

(latency of LSRDP)

+

Stall time

Stall from

Bandwidthreq > Bandwidthmem

Latency of LSRDP <->Mem

For first Input and last output

+

68


Layout types type i1

W LSRDP calc.

ORN

A

A

A

A

A

A

A

A

A

A

A

A

A

A

A

A

A

A

A

A

T

T

T

T

T

T

T

T

T

T

T

T

T

T

T

T

T

T

T

T

ADD/SUM

TU

ORN

M

M

M

M

M

M

M

M

M

M

M

M

M

M

M

M

M

M

M

M

MUL

.

.

.

ORN

Layout Types- Type I

Each PE implements ADD/SUB and MUL

H

M

: MUL

A

: ADD/SUB

T

: Transfer Unit

Total No. of PEs= W * H

Total Area= W*H* [Area(MUL)+Area(ADD/SUB)+ Area(TU)]+ Area(ORNs)


Layout types type ii

W LSRDP calc.

A

A

A

M

M

M

A

A

A

M

A

M

A

A

A

M

M

A

A

M

T

T

T

T

T

T

T

T

T

T

T

T

T

T

T

T

T

T

T

T

ORN

ADD/SUM

TU

ORN

MUL

TU

.

.

.

ORN

Layout Types- Type II

Each PE implements ADD/SUB or MUL

Each PE implements ADD/SUB or MUL

H

Total No. of PEs= W * H

Total Area= ½* W*H*[Area(MUL)+Area(TU)]+

½*W*H*[Area(ADD/SUB)+Area(TU)]+ Area (ORNs)


Layout types type iii

W LSRDP calc.

M

M

M

M

M

A

A

A

A

M

M

A

A

A

A

M

M

M

M

M

T

T

T

T

T

T

T

T

T

T

T

T

T

T

T

T

T

T

T

T

ORN

ADD/SUM

TU

ORN

MUL

TU

.

.

.

ORN

Layout Types- Type III

Each PE implements ADD/SUB or MUL

Each PE implements ADD/SUB or MUL

H

Total No. of PEs= W * H

Total Area= ½* W*H*[Area(MUL)+Area(TU)]+

½*W*H*[Area(ADD/SUB)+Area(TU)]+ Area (ORNs)


Cb various functionalities
CB Various Functionalities LSRDP calc.

CB

½CB

10

11

00

01

10

11

00

01

  • CB

    • two inputs/outputs

    • four cases are possible

    • reconfigurable

  • 1/2CB

    • one input/ two outputs

    • four cases are possible

    • reconfigurable


An orn structure1

T2 LSRDP calc.

T2

½CB

FPU

FPU

CB

CB

CB

CB

T

½CB

T

T2

CB

CB

CB

½CB

CB

CB

FPU

FPU

CB

CB

T

T2

T

½CB

CB

CB

CB

½CB

CB

CB

FPU

FPU

CB

CB

T

½CB

T

T2

CB

CB

CB

½CB

FPU

FPU

CB

CB

CB

CB

½CB

T

T

T2

CB

CB

CB

½CB

FPU

CB

CB

FPU

CB

CB

½CB

T

T

T2

CB

T2

T2

An ORN Structure

  • “+”:

    • scalable

    • pipelined

    • easily re-designed for any number of N and M

  • “–”:

    • large number of Josephson junctions

    • M number of ½ CB and (2×M+1)×MCL number of CB

The number of FPUs is M, the number of Transfer Units (T) is also M;

MCL is a maximum connection length if we consider FPUs only

=>

  • ½ CB – 2×M

  • T2 – (M+4×MCL+2)

  • CB – (2×MCL+1) ×(4×M-1)

CB: 351 JJs

Reduction of the number of Josephson junctions is essential!

½ CB: 216 JJs

* T2 is a 2-bit shift register

A. Fujimaki, et al., Demonstration of an SFQ-Based Accelerator Prototype for a High-Performance Computer,” ASC08, 2008.


Reconfiguration mechanism
Reconfiguration Mechanism LSRDP calc.

PE & ORN reconfiguration structure

State Diagram

Reconfiguration in LSRDP



10tflops sfq rdp computer
10TFLOPS SFQ-RDP computer LSRDP calc.

4.2 K

SFQ 0.5um process

CMOS

CPU

(1chip)

ORN

2TB memory module

(FB-DIMM

[[email protected], 128GB]

×16 modules)

...

FPU

SFQ RDP

(32FPU×32chips)

(4GFLOPS/FPU)

ORN

:

:

:

:

...

ORN

SFQ Streaming Buffer

(64Kb×2chips)

...

ORN

SB

[email protected]

(34chips)×4MCM

:

:

:

...

:

SMAC

SMAC

SMAC

Memory band width per MCM:256GB/s

(=16GB/s ×16 channels)


Power consumption comparison for 10tflops computers
Power Consumption Comparison LSRDP calc. for 10TFLOPS computers



ad