Architecture level synthesis for automatic interconnect pipelining
Download
1 / 24

Architecture-Level Synthesis for Automatic Interconnect Pipelining - PowerPoint PPT Presentation


  • 66 Views
  • Uploaded on

Architecture-Level Synthesis for Automatic Interconnect Pipelining. Jason Cong, Yiping Fan , Zhiru Zhang VLSI CAD Lab Computer Science Department University of California, Los Angeles {cong, fanyp, zhiruz}@cs.ucla.edu. Funded by GSRC, NSF, and Altera Corp. Outline. Motivation

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Architecture-Level Synthesis for Automatic Interconnect Pipelining' - janae


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Architecture level synthesis for automatic interconnect pipelining

Architecture-Level Synthesis for Automatic Interconnect Pipelining

Jason Cong, Yiping Fan, Zhiru Zhang

VLSI CAD Lab

Computer Science Department

University of California, Los Angeles

{cong, fanyp, zhiruz}@cs.ucla.edu

Funded by GSRC, NSF, and Altera Corp.


Outline
Outline

  • Motivation

  • Our contributions

    • RDR-Pipe micro-architecture

      • Regular Distributed Register micro-architecture with interconnect pipelining

    • Synthesis flow and algorithms

      • MCAS-Pipe: automatic interconnect pipelining and sharing

  • Experimental results

  • Conclusions


Interconnect bottleneck in nanometer designs
Interconnect Bottleneck in Nanometer Designs

  • Challenge: single-cycle full chip communication will be no longer possible

  • Not supported by the current CAD toolset

5 cycles

  • ITRS’01 0.07um Tech

  • 5.63 GHz across-chip clock

  • 800 mm2 (28.3mm x 28.3mm)

  • IPEM BIWS estimations

    • Buffer size: 100x

    • Driver/receiver size: 100x

  • Semi-global layer (Tier 3)

    • Can travel up to 11.4mm in one cycle

    • Need 5 clock cycles From corner to corner

4 cycles

3 cycles

2 cycles

1 cycle

28.3

11.4

22.8

0


Related work
Related Work

  • Retiming with placement or floorplanning

    • Retiming + multilevel partitioning [Cong et al, ICCAD’00] and coarse placement [Cong et al, DAC’03]

    • Retiming + floorplanning [Chong & Brayton, IWLS’01]

    • Retiming + placement for FPGAs [Singh & Brown, FPGA’02]

  • Global wire pipelining in ItaniumTM processor

    • [McInerney et al. ISPD’00]

  • Buffer and flip-flop insertion in RTL

    • [Lu et al. DATE’02]

    • [Cocchini, ICCAD’02]


Limitation during logic physical level to explore multicycle communication

Limitation during Logic/Physical Level to Explore Multicycle Communication

  • Minimum clock period achievable by logic optimization is bounded by max. delay-to-register (DR) ratio of the loops in the circuits [Papaefthymiou, MST’94]

  • Interconnect pipelining by flip-flop insertion ?

    • Requires considerable amount of manual rework on the original RTL descriptions


Our approach
Our Approach

  • Consideration of multicycle communication during architectural (or behavioral) synthesis

    • [Cong et al, ISPD’03] [Cong et al. ICCAD’03]

    • Regular Distributed Register (RDR) micro-architecture

      • Highly regular

      • Direct support of multicycle on-chip communication

    • MCAS: Architectural Synthesis for Multi-cycle Communication

      • Efficiently maps the behavioral descriptions to RDR uArch

      • Integrates architectural synthesis (e.g. resource binding, scheduling) with physical planning

  • This work

    • Extension of RDR and MCAS for interconnect pipelining


Outline1
Outline

  • Motivation

  • Our contributions

    • RDR-Pipe micro-architecture

      • Regular Distributed Register micro-architecture with interconnect pipelining

    • Synthesis flow and algorithms

      • MCAS-Pipe: automatic interconnect pipelining and sharing

  • Experimental results

  • Conclusions


Regular distributed register micro architecture

Reg. file

Reg. file

Reg. file

Island

FSM

FSM

FSM

LCC

LCC

LCC

2 cycles

1 cycle

K cycle

….

2 cycle

FSM

Local

Computational

Cluster (LCC)

Hi

K cycles

Global Interconnect

MUL

MUX

1 cycle

Reg. file

Reg. file

Reg. file

ALU

FSM

Wi

FSM

FSM

LCC

LCC

LCC

Regular Distributed Register Micro-Architecture

  • Distribute registers to each “island”

  • Choose the island size such that local computation and communication in each island can be done in a single cycle

  • Use register banks: registers in each island are partitioned to k banks for 1 cycle, 2 cycle, … k cycle interconnect communication in each island


Wiring overhead in rdr designs

Cycle 1

Cycle 2

Cycle 3

Cycle 4

Cycle 5

Wiring Overhead in RDR Designs

  • Data transfers r1r3 and r2r4 are overlapped

  • Two dedicated global wires are needed

+

ALU1

r1

+

r1

r2

r2

r3

r3

r4

MUL1

Interconnects with delay of 2 cycles

r4

*

+

*

ALU1

MUL1

Sender register

Receiver register


Architectural solution rdr pipe

Pipeline Register Station (PRS)

3

1

2

4

PRS

PRS

FSM

FSM

FSM

Reg. File

LCC

LCC

LCC

3

2

1

V channel

H channel

PRS

PRS

FSM

FSM

FSM

LCC

LCC

LCC

6

4

5

Architectural Solution: RDR-Pipe

  • Keep the intra-island structures

  • Inter-island pipeline register station (PRS) for global communications

  • PRS performs autonomous store-and-forward

    • Synchronous design

    • No global control signal needed for PRS


Reducing wiring overhead in rdr pipe

+

ALU1

r1

+

r1

r1

r3

r2

r3

r4

MUL1

2 cycle communication

r4

*

Sender register

Receiver register

+

*

ALU1

MUL1

Pipeline register

Cycle 1

Cycle 2

Cycle 3

Cycle 4

Cycle 5

Reducing Wiring Overhead in RDR-Pipe

  • Data transfers are pipelined

  • One wire with a pipeline register is enough


Synthesis flow mcas pipe system
Synthesis Flow: MCAS-Pipe System

  • Global interconnect sharing

    • After scheduling and functional unit binding

    • Before register and port binding

    • Enable multiple data communications to shar a physical link (a wire with pipeline registers)

  • Advantages over MCAS

    • Expect to reduce global wiring demand

    • No multicycle path constraint needed

C / VHDL

MCAS-Pipe

CDFG generation

CDFG

Resource allocation

& Functional unit binding

ICG

Scheduling-driven placement

Locations

Placement-driven

rescheduling & rebinding

Global interconnect sharing

Register and port binding

Datapath & FSM generation

RTL VHDL & Floorplan constraints


Global interconnect sharing

Pipeline register

Sender register

Receiver register

pg

Cycle 1

Cycle 2

pe

Cycle 3

Cycle 4

Cycle 5

Cycle 6

ce

cg

Cycle 7

Conflicted data transfers

A

B

D = 2

pg

Cycle 1

A

B

ce

D = 2

Cycle 2

pe,pg

cg

ce

pe

pe

Cycle 3

cg

pg

Cycle 4

  • Now, two producer registers can be merged, since their life-times become compatible

Cycle 5

  • Only one physical link is required to support the scheduled data transfers

Cycle 6

ce

cg

Cycle 7

Compatible data transfers

Global Interconnect Sharing

  • Two physical links are needed to support the concurrent data transfers

A

B

D = 2

pe

ce

pg

cg


Global pipelined interconnect minimization
Global Pipelined Interconnect Minimization

  • Definitions

    • Data links: pipelined global interconnects

    • Channel: set of data links between two islands

      • Width of a channel: number of its data links

    • Data transfer: movement of data from a producer to a consumer

  • Architectural assumption

    • Channels cannot share interconnects

  • Theorem

    • Global pipelined interconnects are minimized if and only if the width of every channel is minimized


Transfer scheduling for a single channel
Transfer Scheduling for a Single Channel

  • A decision problem formulation

    • Given:

      • A channel (A, B)containing m data links

      • A data transfer set {e | pe A and ce B}, where each transfer e is associated with an arrival time T(pe)+1, a deadline T(ce)-D(A, B), and unit effective occupancy time

    • Fact: for every time slot, at most one transfer can be issued on a data link

    • Objective: to find a feasible transfer schedule on these data links

  • Transfer scheduling is polynomial solvable

    • A special real-time scheduling problem [J. Blazewicz, 1979]

      • Binary search for minimum feasible channel width m

      • For each width, apply Earliest-Deadline-First (EDF) scheduling: O(nlogn)

      • Overall time complexity: O(nlog2n)


Edf based transfer scheduling example

Data Link 1

Data Link 1

Data Link 2

1

3

4

5

2

6

  • Ordered by left edge

EDF-Based Transfer Scheduling Example

Data Link 2

Time slot

Time slot

  • Successfully scheduling onto 2 data links

1

1

2

5

2

3

4

6

3

4

5

6

  • Ordered by Earliest-Deadline-First

4

1

3

5

2 ?

  • Failed for 2 data links!


Outline2
Outline

  • Motivation

  • Our contributions

    • RDR-Pipe micro-architecture

      • Regular Distributed Register micro-architecture with interconnect pipelining

    • Synthesis flow and algorithms

      • MCAS-Pipe: automatic interconnect pipelining and sharing

  • Experimental results

  • Conclusions


Experiment settings
Experiment Settings

C / VHDL

CDFG generation

Functional unit allocation & binding

uArch. spec.

Target clock period

Conventional flow

Scheduling-driven placement

Placement-driven

rebinding & rescheduling

Conventional Scheduling

MCAS flow

Global interconnect sharing

MCAS-Pipe flow

Register and port binding

Datapath & Control generation

RTL VHDL files

(for all flows)

Floorplan constraints (for MCAS and MCAS-Pipe); Multicycle path constraints (for MCAS only)

Altera QuartusII + Stratix


Experimental results register and le usage
Experimental Results: Register and LE Usage

  • Design environment: Altera QuartusII, Stratix EP1S40

  • MCAS vs. Conventional flow:

    • Uses more registers and logic elements (LE)

  • MCAS-Pipe vs. MCAS:

    • Slightly more registers, and comparable logic element cost


Experimental results performance
Experimental Results: Performance

  • Design environment: Altera QuartusII, Stratix EP1S40

  • MCAS vs. Conventional flow:

    • 36% reduction in clock period and 30% in total latency

  • MCAS-Pipe vs. MCAS:

    • Comparable design performance (4% better)

Total latency

Clock period


Interconnect structure of altera s stratix
Interconnect Structure of Altera’s Stratix

Global: H24

H8

H4

Local: LL, LO

Global:V16

V4

V8


Experimental results wirelength
Experimental Results: Wirelength

  • Wire types

    • LL, LO: local wires; H4, V4, H8, V8: short global wires

    • V16, H24: long global wires

  • MCAS-Pipe vs. MCAS:

    • 28.8% long global wires reduction, 19.3% total wirelength reduction


Conclusions
Conclusions

  • High-level automatic on-chip interconnect pipelining

    • RDR-Pipe: extension of RDR micro-architecture

      • Micro-architecture supporting interconnect pipelining

    • MCAS-Pipe: enhancement of MCAS synthesis system

      • Add in a novel global interconnect sharing algorithm to effectively reduce the global wiring

  • Experimental results

    • Matches or exceeds the RDR-based approach in performance

    • Greatly reduces wiring demand



ad