An accelerator based on single flux quantum circuits for a high performance reconfigurable computer
This presentation is the property of its rightful owner.
Sponsored Links
1 / 78

F. Mehdipour*, Hiroaki Honda** , H. Kataoka*, K. Inoue* and K. Murakami* PowerPoint PPT Presentation


  • 55 Views
  • Uploaded on
  • Presentation posted in: General

An Accelerator Based on Single-Flux Quantum Circuits for a High-Performance Reconfigurable Computer. F. Mehdipour*, Hiroaki Honda** , H. Kataoka*, K. Inoue* and K. Murakami* *Graduate School of Information Science and Electrical Engineering, Kyushu University, Japan

Download Presentation

F. Mehdipour*, Hiroaki Honda** , H. Kataoka*, K. Inoue* and K. Murakami*

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


An accelerator based on single flux quantum circuits for a high performance reconfigurable computer

An Accelerator Based on Single-Flux Quantum Circuits for a High-PerformanceReconfigurable Computer

F. Mehdipour*, Hiroaki Honda**, H. Kataoka*, K. Inoue*

and K. Murakami*

*Graduate School of Information Science and Electrical Engineering, Kyushu University, Japan

**Institute of Systems, Information Technologies and Nanotechnologies (ISIT), Fukuoka, Japan

E-mail: [email protected], [email protected]


Agenda

Agenda

  • Introduction

  • Large-Scale Reconfigurable Data-Path(LSRDP) General Architecture and Specifications

  • Design Procedure and Tool Chain

  • Preliminary Results

  • Conclusions and Future Work


Introduction

Introduction

Roadrunner with PowerXcell

TSUBAME

NVIDIA Tesla S1070

http://it.nikkei.co.jp/

http://www.elsa-jp.co.jp/products/hpc/tesla/s1070/index.html

  • Parallel computer clusters with General-Purpose Processors (GPP)are often used for HPC

  • Various accelerators are used with GPPs for further performance improvement

    • PowerXcell, GPGPU, GRAPE-DR, ClearSpeed, etc.

    • Small size and low power consumption comparing to processors with similar performance

http://www.top500.org/system/9485


Single flux quantum large scale reconfigurable data path sfq lsrdp

Single Flux Quantum Large Scale Reconfigurable Data-Path (SFQ-LSRDP)

  • Large-Scale Reconfigurable Data-Path (LSRDP):

    • is introduced as an alternative accelerator

    • reduces the no. of memory accesses

    • is implemented by Single-Flux Quantum (SFQ) circuits instead of CMOS circuits

    • is suitable for high performance scientific computations

A large memory bandwidth is demanded in conventional accelerators for high-performance computation

On chip memories are often used to hide memory access latency


Outline of large scale reconfigurable data path lsrdp processor

Outline of Large-Scale Reconfigurable Data-Path (LSRDP) processor

  • Features:

    • Data Flow Graphs (DFGs) extracted from critical calculation parts are directly mapped

    • Pipeline execution

    • Burst transfer is used for input /output rearranged data from/to memory

  • Reconfigurable data-path includes:

    • A large number of floating point Functional Units (FUs)

    • Reconfigurable Operand Routing Network : ORN

    • Dynamic reconfiguration facilities

    • Streaming Buffers (SB) for I/O ports

    • Implementation by SFQ circuits

LSRDP

GPP

...

FU

FU

FU

FU

ORN : Operand Routing Network

:

:

:

:

...

FU

FU

FU

FU

ORN

...

FU

FU

FU

FU

SB

SMAC

Main

Memory

Scratchpad Memory


Single flux quantum sfq against cmos

Single-Flux Quantum (SFQ)against CMOS

  • CMOS issues:

    • high electric power consumption

    • high heat radiation and difficulties in high-density packing

    • memory wall problem which limits the processing speed

  • SFQ Features:

    • High-speed switching and signal transmission

    • Low power consumption

    • Compact implementation of a system (small area)

    • No cost for latch

    • Suitable for pipeline processing of data stream

    • Serial bit-level processing


F mehdipour hiroaki honda h kataoka k inoue and k murakami

CREST-JST (2006~): Low-power,high-performance, reconfigurable processor using single-flux quantum circuits

Superconducting

Research Lab. (SRL)

SFQ process

Yokohama National Univ.

SFQ-FPU chip, cell library

Nagoya Univ.

SFQ-RDP chip, cell library,

and wiring

Nagoya Univ.

CAD for logic design

and arithmetic circuits

Dr. S. Nagasawa et al.

Prof. N. Yoshikawa et al.

Prof. A. Fujimaki et al.

Prof. N. Takagi (Leader)

et al.

Kyushu Univ.

Architecture, Compiler

and Applications

Prof. K. Murakami

Dr. K. Inoue

Dr. H. Honda

Dr. F. Mehdipour

H. Kataoka

SFQ-LSRDP


Goals of the project

Goals of the Project

Designing and Implementing SFQ-LSRDP architecture considering the features and limitations of SFQ circuits

Discovering appropriate applications

Developing compiler tools

Developing performance analyzing tools


Lsrdp general architecture and specifications

LSRDP General Architecture and Specifications


Parameters should be decided within the lsrdp design procedure

Parameters Should Be DecidedWithin the LSRDP Design Procedure

  • Core structure a matrix of PEs

  • PE: combination of a Functional Unit (FU)

  • and a data Transfer Unit (TU)

Width and Height ?

Maximum Connection Length (MCL)

between consecutive rows?

Layout: FU types

(ADD/SUB and MUL)?

  • Reconfiguration mechanism?

  • (PE, ORN, Immediate data)

  • On-chip memory configuration?


Lsrdp architecture

LSRDP Architecture

  • Processing Elements

    • FU

      • implements basic 64-bit double-precision floating point operations including: ADD, SUB and MUL

    • TU (transfer unit) as a routing resource for transferring datafrom a row to an inconsecutive row

FU

FU

FU

TU

FU

TU

FU

TU

TU

TU

PE including Two components

Four functionalities


Layout types type i

W

ORN

A

A

A

A

A

A

A

A

A

A

A

A

A

A

A

A

A

A

A

A

T

T

T

T

T

T

T

T

T

T

T

T

T

T

T

T

T

T

T

T

ADD/SUM

TU

ORN

M

M

M

M

M

M

M

M

M

M

M

M

M

M

M

M

M

M

M

M

MUL

.

.

.

ORN

Layout Types- Type I

Each PE implements ADD/SUB and MUL

H

M

: MUL

A

: ADD/SUB

T

: Transfer Unit

Flexible but consume a lot of resources


Layout types type ii checkered

W

A

A

A

M

M

M

A

A

A

M

A

M

A

A

A

M

M

A

A

M

T

T

T

T

T

T

T

T

T

T

T

T

T

T

T

T

T

T

T

T

ORN

ADD/SUM

TU

ORN

MUL

TU

.

.

.

ORN

Layout Types- Type II (Checkered)

Each PE implements ADD/SUB or MUL

Each PE implements ADD/SUB or MUL

H


Layout types type iii striped

W

M

M

M

M

M

A

A

A

A

M

M

A

A

A

A

A

M

M

M

A

T

T

T

T

T

T

T

T

T

T

T

T

T

T

T

T

T

T

T

T

ORN

ADD/SUM

TU

ORN

MUL

TU

.

.

.

ORN

Layout Types- Type III (Striped)

Each PE implements ADD/SUB or MUL

Each PE implements ADD/SUB or MUL

H

Type II or III, which one is more efficient?


Maximum connection length mcl

Maximum Connection Length (MCL)

MCL:maximum horizontal distance between two PEs located in two consecutive rows


An orn structure

An ORN Structure

ORN

2bit shiftregister

ORN is consisted of2-bit shift registers, 1-by-2 and 2-by-2 cross bar switches

A. Fujimaki, et al., Demonstration of an SFQ-Based Accelerator Prototype for a High-Performance Computer,” ASC08, 2008.


Dynamic reconfiguration mechanism

Dynamic Reconfiguration Mechanism

  • Three bit-stream lines for dynamic reconfiguration of:

    • Immediate registers (64bit) in each PE

    • Selector bits for muxes selecting the input data of FUs

    • Cross-bar switches in ORNs


Design procedure and tool chain

Design Procedure and Tool Chain


Compiler and design flow

Compiler and Design Flow

  • DFGs are manually generated from critical parts of applications

  • DFG mapping results are used for

    • Analyzing LSRDP architecture statistics

    • Generating LSRDP configuration bit-streams


Lsrdp design procedure

LSRDP Design Procedure

DFGs & LSRDP HW constraints

For each

parameter

Appropriate value for each parameter


Benchmark applications for design procedures

Benchmark Applicationsfor Design Procedures

  • Finite differential method calculation of2nd order partial differential equations

    • 1dim-Heat equation(Heat)

    • 1dim-Vibration equation (Vibration)

    • 2dim-Poisson equation (Poisson)

  • Quantum chemistry application

    • Recursive parts of Electron Repulsion Integral calculation(ERI-Rec)

Only ADD/SUB and MUL operations are usedin the critical calculations of all above applications


Dfg extraction heat equation

DFG Extraction- Heat Equation

  • 1-dim. heat equation for T(x,t)

  • Calculation by Finite DifferenceMethod (FDM)

(A is const.)

Basic DFG can be extended to horizontal and vertical directions to make a larger DFG

Basic DFG corresponding to Minimum FDM calculation


Example of extracted dfgs heat

Example of extracted DFGs- Heat

Inputs: 32

Outputs: 16

Operations: 721

Immediates: 364

A huge sample DFG (Heat)


Dfg classification

DFG Classification

Totally,

24 DFGs are prepared

for benchmark DFG

Due to broad range of DFG sizes

DFGs are classified as S, M, L, XL with respect to their size

and the number of Input/Output nodes


Mapping dfgs onto lsrdp

Mapping DFGs onto LSRDP

Longest connections


Preliminary results

Preliminary Results


Lsrdp specifications width height

LSRDP Specifications: Width & Height

LSRDP Dimensions and the number of Input/Output Ports


Lsrdp specifications mcl

LSRDP Specifications: MCL

Needs further MCL optimization


Analyzing various lsrdp layouts

Layout I

Layout II

Analyzing Various LSRDP Layouts

(Except ERI1 DFG which gives better size for Layout III)

Layout II can be used instead of Layout I to obtain a smaller LSRDP architecture with lower power consumption and implementation cost.


Lsrdp at one glance 1 2

LSRDP at One Glance (1/2)


Lsrdp at one glance 2 2

LSRDP at One Glance (2/2)


Preliminary performance evaluation

Preliminary Performance Evaluation

Base processor configuration

GPP+LSRDP configuration

GPP: Exec. time measurement by means of a processor simulator

LSRDP: Estimation by performance modeling


Preliminary performance evaluation heat

Preliminary Performance Evaluation(Heat)

Basic: SB only

Reuse: SB + SPM

Data reusing is employed to avoid the need for data rearrangement as well as frequently data retrieval from the scratchpad memory.


Preliminary performance evaluation poisson

Preliminary Performance Evaluation (Poisson)

A small fraction is related to processing time on LSRDP and the main fraction concerns to various overhead times as well as the execution time on GPP


Conclusions future work

Conclusions & Future Work

  • A high-performance computer comprising an accelerator (LSRDP) implemented by superconducting circuits was introduced.

  • 24 benchmark Data Flow Graphs (DFGs) were manually generated.

  • LSRDP micro-architecture is designed based on characteristics of scientific applications via a quantitative approach.

  • LSRDP is promising for resolving issues originated from CMOS technology as well as achieving considerable performances.

  • Future Work:

    • To achieve higher performance it is required to reduce various overhead costs mainly related to data management part.

    • To reduce the implementation cost of LSRDP, we will focus on reducing maximum connection length and ORN size.


Acknowledgement

Acknowledgement

This research was supportedin part by Core Research for Evolutional Scienceand Technology (CREST) of Japan Scienceand Technology Corporation (JST).


Thanks

Thanks!

Any Questions?


Backup slides

Backup Slides

  • Backup Slides


F mehdipour hiroaki honda h kataoka k inoue and k murakami

Single Flux Quantum

Superconductivity

loop

Ib

L

Josephson junction

Ic

Φ0

SFQ (Single Flux Quantum) CircuitHigh speed, Low power consumption, and Operating by a different principle from the CMOS

Tunneling effect

2mV

2ps


Mapping results

Mapping Results

For each class, a lot of extra TUs

are needed to map all DFGs

FU

T

PE types

FU

T

T

T


Connection length minimization results

Connection Length Minimization- Results

Final optimized Maximum Connection Length (MCL) results

ORNs should provide the connection length of 9in LSRDP-S/M (MCL= 9).

For LSRDP-L, MCL = 19 !!!

⇒ Serious Implementation Cost

Possible to decrease?


Distributions of connection lengths

Distributions of Connection Lengths

Connection

length

93% of connection lengths are 0 ~ 2

Only small fractions of connections results in larger ORNs


Analyzing various lsrdp layouts1

Analyzing Various LSRDP Layouts

  • Almost a similar small size values are achievedfor Layout I and IIfor the majority of DFGs (except ERI1 DFG which gives better size for Layout III)

Layout II can be used instead of Layout I to obtain a smaller LSRDP architecture with lower power consumption and implementation cost as well


Why only eri1 dfg is suitable to layout iii

Why only ERI1 DFG is suitable to Layout III ?

Heat

Layout II

ERI 1

Layout III


Fu layout for div sqrt exp operations

FU Layout for DIV, SQRT, EXP operations

16Bits Floating point DIV, SQRT, and EXP Functional unit have been already developed by SFQ current technology.

...

FU

FU

FU

FU

DIV

ORN : Operand Routing Network

...

FU

FU

FU

FU

ORN

Where ?

...

FU

FU

FU

FU

:

:

:

:

Three times

larger latency

...

FU

FU

FU

FU

ORN

...

FU

FU

FU

FU

Pipeline execution based

on ADD and MUL latency

Where should we place different latency FU ?

Heterogeneous configuration of FU array ?


Estimated performance improvement of 2 dim poisson equation by lsrdp calc

Estimated performance improvement of 2-dim Poisson equation by LSRDP calc.

Normalized exec. time

by GPP(3GHz) calc.

Main Mem. bandwidth [GByte/sec]


Estimated performance improvement of eri calculation by lsrdp

Estimated performance improvement of ERI calculation by LSRDP

(3GHz)


Recursive parts of electron repulsion integral formula eri rec

Recursive Parts of Electron Repulsion Integral Formula (ERI-Rec)

DFG sizes have already determinedfrom original recursive formula


What types of software algorithms are suitable for lsrdp

memory access

X

Large amount

of calculations

small size

of input

small size

of output

What types of software/algorithms are suitable for LSRDP ?

  • When same calculations have to be calculated repeatedly.

    • LSRDP is used for high throughput accelerator.

  • Input/Output data size is small compared with the amount of the operations.

LSRDP


Exploration of suitable applications for lsrdp

Exploration of suitable applicationsfor LSRDP

  • Application

    • matrix elements calculation

      • Molecular integral calculations in molecular orbital method

    • Monte Carlo type simulation

    • etc…

  • Numerical calculation library

    • special function (promising?)

    • differential equation

    • numerical integration

    • matrix operation (difficult ??)

    • Triangular matrix simultaneous equation

    • etc…

Investigating applicability against various applications


Recursive parts of electron repulsion integral formula in molecular orbital calc

Recursive Parts of Electron Repulsion Integral Formula in Molecular Orbital Calc.

~Up to (pp,pp) Recursive Calculation~

(ss,ss)(m)and all coefficients

are given as input

# of Inputs: Max. 28

# of Outputs:1 ~ 81

(i,j,k,l = x,y,z): p function has 3 components (as 1dim array)

Each DFG has only ADD (SUB) and MUL FUs.

DFG sizes are determined by original calculation algorithm


Dfg distribution for each application

DFG Distribution for each application

ERI-Rec

(8 DFGs)

Vibration (7)

# of FUs

Heat (6)

Poisson (3)

# of Inputs

DFGs have different qualities in terms of the

# of FUs, # of Inputs and Outputs


Example of mcl heat

Example of MCL (Heat)

MCL

Heat original DFG

(I/O: 8/4, FUs: 32)

Mapping result


Example of extracted dfgs eri rec

Example of extracted DFGs (ERI-Rec)

  • Maximum DFG of ERI-Rec: (pipj,pkpl)

Inputs: 28

Outputs: 81

FUs: 1004

Immediates: 0

Vertical Partitioning

Inputs: 24

Outputs: 1

FUs: 108

Immediates: 0


Poisson equation

Poisson Equation

2D – Poisson Eq.

Successive Over Relaxation method

ω is const.

Red/Black Gauss Seidel

In order to obtain u(n+1) (xi,yj) in the next iteration,

current values of five variables i.e. u(n) (xi,yj), u(n) (xi±1,yj), u(n) (xi,yj±1) are needed

55


Example of extracted dfgs poisson

Example of extracted DFGs (Poisson)

  • Maximum Poisson DFG

Inputs: 32

Outputs: 1

FUs: 721 Immediates: 364


Performance evaluation simulation environment

Performance Evaluation:Simulation Environment

  • Variable parameters:

  • Freq. of GPP and LSRDP

  • Bandwidth between main memory and LSRDP

  • Latency of reconfiguration time

  • # of FPUs in LSRDP

  • Supporting FPU types (Add, Mul, Div, Exp, Sqrt, Error function units are supported)

LSRDP

GPP

Main

Memory

Use streaming buffer in the LSRDP chip

I/O data is sorted in the main memory.

GPP: Exec. time measurement by processor simulator

LSRDP: Estimation by performance modeling

57


Estimated performance improvement of 1 dim heat equation by lsrdp calc

Estimated performance improvementof 1-dim heat equation by LSRDP calc.

Main Mem. bandwidth [GByte/sec]


Estimated performance improvement of 1 dim heat equation by lsrdp calc1

Estimated performance improvement of 1-dim heat equation by LSRDP calc.

Normalized exec. time

by GPP(3GHz) calc.

Main Mem. bandwidth [GByte/sec]


Poisson red black dfg

Poisson Red/Black 法におけるDFGの拡大による繰り返し回数の増加

4+1ノードの入力

中心1ノードの出力

SOR式

1回の計算

SOR式

2回の繰り返し

9+4ノードの入力

中心1ノードの出力

  • DFGの拡大により1度に計算可能な繰り返し回数が増加

これに伴い必要な入力数も増加

60


Implementation of heat calculation to lsrdp

Implementation of Heat calculation to LSRDP

Original GPP code

LSRDP code

LSRDP Reconfiguration

Loop j’Input Data Rearrangement

Loop N

LSRDP pipeline exec.

(FDM DFG calc.)

End Loop

Output Data Rearrangement

End Loop

Loop j

Loop i

T(xi,tj)

End Loop

End Loop

61


Implementation of poisson calculation to lsrdp

Implementation of Poisson calculation to LSRDP

Original GPP code

LSRDP code

Loop Iter

Loop i

loop j

u(xi,yj)

End Loop

End Loop

End Loop

LSRDP Reconfiguration

Loop Iter’

Input Data rearrangement

Loop N

LSRDP pipeline exec. (FDM DFG calc.)

End Loop

Output Data rearrangement

End Loop

62


Implementation of eri rec calculation to lsrdp

Implementation of ERI-Rec calculation to LSRDP

LSRDP code

original GPP code

Loop I,J,K,L

Loop contraction

Initial Integral Calc.

Recursive Calc.

End Loop

Partial Fock Calc.

End Loop

Loop I,J,K,LLSRDP ReconfigurationLoop contraction

Initial Integral Calc.

End Loop

Input Data rearrangement

Loop N

LSRDP pipeline calc.

(Recursive DFG calc.)End Loop

Output Data rearrangement

Partial Fock Calc.

End Loop

Initial Integral Calc.:

1/Sqrt, Exp, Fm(T) are utilized => GPP calculation

Recursive Calc.:

only ADD/SUB, MUL

=> LSRDP calculation

63


Vertical vs horizontal dfg decomposition

Vertical vs. HorizontalDFG Decomposition

Original

Horizontal Decomp.

Loop N

ReconfigurationLoop M

LSRDP pipeline calc.

End Loop

End Loop

Loop N

ReconfigurationLoop M

1stLSRDP pipeline calc.

End Loop

End Loop

Loop N

ReconfigurationLoop M

2ndLSRDP pipeline calc.

End Loop

End Loop

Vertical Decomp.

Loop n ( > N)

ReconfigurationLoop M

LSRDP pipeline calc.

End Loop

End Loop

64


Example of extracted dfgs

Example of extracted DFGs

  • Maximum DFG of ERI-Rec: (pipj,pkpl)

Inputs: 28

Outputs: 81

FUs: 1004

Immediates: 0

Vertical Partitioning

Inputs: 24

Outputs: 1

FUs: 108

Immediates: 0


Example of extracted dfgs heat1

Example of extracted DFGs- Heat

Inputs: 32

Outputs: 16

Operations: 721

Immediates: 364

A huge sample DFG (Heat)


Performance evaluation simulation environment1

Performance Evaluation:Simulation Environment

  • Variable parameters:

  • Freq. of GPP and LSRDP

  • Bandwidth between main memory and LSRDP

  • Latency of reconfiguration time

  • # of FPUs in LSRDP

  • Supporting FPU types (Add, Mul, Div, Exp, Sqrt, Error function units are supported)

LSRDP

GPP

Main

Memory

Use streaming buffer in the LSRDP chip

I/O data is sorted in the main memory.

GPP: Exec. time measurement by processor simulator

LSRDP: Estimation by performance modeling

67


Performance evaluation execution time modeling

Performance Evaluation:Execution Time Modeling

Sort data + Reconfig.

+ Send signal for comm.

Execution time

Calculation time

Total pipeline depth

in the given program

# of rows of LSRDP

(latency of LSRDP)

+

Stall time

Stall from

Bandwidthreq > Bandwidthmem

Latency of LSRDP <->Mem

For first Input and last output

+

68


Layout types type i1

W

ORN

A

A

A

A

A

A

A

A

A

A

A

A

A

A

A

A

A

A

A

A

T

T

T

T

T

T

T

T

T

T

T

T

T

T

T

T

T

T

T

T

ADD/SUM

TU

ORN

M

M

M

M

M

M

M

M

M

M

M

M

M

M

M

M

M

M

M

M

MUL

.

.

.

ORN

Layout Types- Type I

Each PE implements ADD/SUB and MUL

H

M

: MUL

A

: ADD/SUB

T

: Transfer Unit

Total No. of PEs= W * H

Total Area= W*H* [Area(MUL)+Area(ADD/SUB)+ Area(TU)]+ Area(ORNs)


Layout types type ii

W

A

A

A

M

M

M

A

A

A

M

A

M

A

A

A

M

M

A

A

M

T

T

T

T

T

T

T

T

T

T

T

T

T

T

T

T

T

T

T

T

ORN

ADD/SUM

TU

ORN

MUL

TU

.

.

.

ORN

Layout Types- Type II

Each PE implements ADD/SUB or MUL

Each PE implements ADD/SUB or MUL

H

Total No. of PEs= W * H

Total Area= ½* W*H*[Area(MUL)+Area(TU)]+

½*W*H*[Area(ADD/SUB)+Area(TU)]+ Area (ORNs)


Layout types type iii

W

M

M

M

M

M

A

A

A

A

M

M

A

A

A

A

M

M

M

M

M

T

T

T

T

T

T

T

T

T

T

T

T

T

T

T

T

T

T

T

T

ORN

ADD/SUM

TU

ORN

MUL

TU

.

.

.

ORN

Layout Types- Type III

Each PE implements ADD/SUB or MUL

Each PE implements ADD/SUB or MUL

H

Total No. of PEs= W * H

Total Area= ½* W*H*[Area(MUL)+Area(TU)]+

½*W*H*[Area(ADD/SUB)+Area(TU)]+ Area (ORNs)


Cb various functionalities

CB Various Functionalities

CB

½CB

10

11

00

01

10

11

00

01

  • CB

    • two inputs/outputs

    • four cases are possible

    • reconfigurable

  • 1/2CB

    • one input/ two outputs

    • four cases are possible

    • reconfigurable


An orn structure1

T2

T2

½CB

FPU

FPU

CB

CB

CB

CB

T

½CB

T

T2

CB

CB

CB

½CB

CB

CB

FPU

FPU

CB

CB

T

T2

T

½CB

CB

CB

CB

½CB

CB

CB

FPU

FPU

CB

CB

T

½CB

T

T2

CB

CB

CB

½CB

FPU

FPU

CB

CB

CB

CB

½CB

T

T

T2

CB

CB

CB

½CB

FPU

CB

CB

FPU

CB

CB

½CB

T

T

T2

CB

T2

T2

An ORN Structure

  • “+”:

    • scalable

    • pipelined

    • easily re-designed for any number of N and M

  • “–”:

    • large number of Josephson junctions

    • M number of ½ CB and (2×M+1)×MCL number of CB

The number of FPUs is M, the number of Transfer Units (T) is also M;

MCL is a maximum connection length if we consider FPUs only

=>

  • ½ CB – 2×M

  • T2 – (M+4×MCL+2)

  • CB – (2×MCL+1) ×(4×M-1)

CB: 351 JJs

Reduction of the number of Josephson junctions is essential!

½ CB: 216 JJs

* T2 is a 2-bit shift register

A. Fujimaki, et al., Demonstration of an SFQ-Based Accelerator Prototype for a High-Performance Computer,” ASC08, 2008.


Reconfiguration mechanism

Reconfiguration Mechanism

PE & ORN reconfiguration structure

State Diagram

Reconfiguration in LSRDP


Lsrdp design procedure1

LSRDP Design Procedure


10tflops sfq rdp computer

10TFLOPS SFQ-RDP computer

4.2 K

SFQ 0.5um process

CMOS

CPU

(1chip)

ORN

2TB memory module

(FB-DIMM

[[email protected], 128GB]

×16 modules)

...

FPU

SFQ RDP

(32FPU×32chips)

(4GFLOPS/FPU)

ORN

:

:

:

:

...

ORN

SFQ Streaming Buffer

(64Kb×2chips)

...

ORN

SB

[email protected]

(34chips)×4MCM

:

:

:

...

:

SMAC

SMAC

SMAC

Memory band width per MCM:256GB/s

(=16GB/s ×16 channels)


Power consumption comparison for 10tflops computers

Power Consumption Comparisonfor 10TFLOPS computers


Power consumption performance

Power Consumption / Performance


  • Login