architecture level synthesis for automatic interconnect pipelining
Download
Skip this Video
Download Presentation
Architecture-Level Synthesis for Automatic Interconnect Pipelining

Loading in 2 Seconds...

play fullscreen
1 / 24

Architecture-Level Synthesis for Automatic Interconnect Pipelining - PowerPoint PPT Presentation


  • 66 Views
  • Uploaded on

Architecture-Level Synthesis for Automatic Interconnect Pipelining. Jason Cong, Yiping Fan , Zhiru Zhang VLSI CAD Lab Computer Science Department University of California, Los Angeles {cong, fanyp, zhiruz}@cs.ucla.edu. Funded by GSRC, NSF, and Altera Corp. Outline. Motivation

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Architecture-Level Synthesis for Automatic Interconnect Pipelining' - janae


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
architecture level synthesis for automatic interconnect pipelining

Architecture-Level Synthesis for Automatic Interconnect Pipelining

Jason Cong, Yiping Fan, Zhiru Zhang

VLSI CAD Lab

Computer Science Department

University of California, Los Angeles

{cong, fanyp, zhiruz}@cs.ucla.edu

Funded by GSRC, NSF, and Altera Corp.

outline
Outline
  • Motivation
  • Our contributions
    • RDR-Pipe micro-architecture
      • Regular Distributed Register micro-architecture with interconnect pipelining
    • Synthesis flow and algorithms
      • MCAS-Pipe: automatic interconnect pipelining and sharing
  • Experimental results
  • Conclusions
interconnect bottleneck in nanometer designs
Interconnect Bottleneck in Nanometer Designs
  • Challenge: single-cycle full chip communication will be no longer possible
  • Not supported by the current CAD toolset

5 cycles

  • ITRS’01 0.07um Tech
  • 5.63 GHz across-chip clock
  • 800 mm2 (28.3mm x 28.3mm)
  • IPEM BIWS estimations
    • Buffer size: 100x
    • Driver/receiver size: 100x
  • Semi-global layer (Tier 3)
    • Can travel up to 11.4mm in one cycle
    • Need 5 clock cycles From corner to corner

4 cycles

3 cycles

2 cycles

1 cycle

28.3

11.4

22.8

0

related work
Related Work
  • Retiming with placement or floorplanning
    • Retiming + multilevel partitioning [Cong et al, ICCAD’00] and coarse placement [Cong et al, DAC’03]
    • Retiming + floorplanning [Chong & Brayton, IWLS’01]
    • Retiming + placement for FPGAs [Singh & Brown, FPGA’02]
  • Global wire pipelining in ItaniumTM processor
    • [McInerney et al. ISPD’00]
  • Buffer and flip-flop insertion in RTL
    • [Lu et al. DATE’02]
    • [Cocchini, ICCAD’02]
limitation during logic physical level to explore multicycle communication

In a loop, 4 logic cells, 2 registers

  • Cell delay = 1ns
  • Interconnect delay = 1ns
  • DR ratio = (Dlogic+Dint)/#Registers = (4+4)/2 = 4ns
  • Clock period  4ns
Limitation during Logic/Physical Level to Explore Multicycle Communication
  • Minimum clock period achievable by logic optimization is bounded by max. delay-to-register (DR) ratio of the loops in the circuits [Papaefthymiou, MST’94]
  • Interconnect pipelining by flip-flop insertion ?
    • Requires considerable amount of manual rework on the original RTL descriptions
our approach
Our Approach
  • Consideration of multicycle communication during architectural (or behavioral) synthesis
    • [Cong et al, ISPD’03] [Cong et al. ICCAD’03]
    • Regular Distributed Register (RDR) micro-architecture
      • Highly regular
      • Direct support of multicycle on-chip communication
    • MCAS: Architectural Synthesis for Multi-cycle Communication
      • Efficiently maps the behavioral descriptions to RDR uArch
      • Integrates architectural synthesis (e.g. resource binding, scheduling) with physical planning
  • This work
    • Extension of RDR and MCAS for interconnect pipelining
outline1
Outline
  • Motivation
  • Our contributions
    • RDR-Pipe micro-architecture
      • Regular Distributed Register micro-architecture with interconnect pipelining
    • Synthesis flow and algorithms
      • MCAS-Pipe: automatic interconnect pipelining and sharing
  • Experimental results
  • Conclusions
regular distributed register micro architecture

Reg. file

Reg. file

Reg. file

Island

FSM

FSM

FSM

LCC

LCC

LCC

2 cycles

1 cycle

K cycle

….

2 cycle

FSM

Local

Computational

Cluster (LCC)

Hi

K cycles

Global Interconnect

MUL

MUX

1 cycle

Reg. file

Reg. file

Reg. file

ALU

FSM

Wi

FSM

FSM

LCC

LCC

LCC

Regular Distributed Register Micro-Architecture
  • Distribute registers to each “island”
  • Choose the island size such that local computation and communication in each island can be done in a single cycle
  • Use register banks: registers in each island are partitioned to k banks for 1 cycle, 2 cycle, … k cycle interconnect communication in each island
wiring overhead in rdr designs

Cycle 1

Cycle 2

Cycle 3

Cycle 4

Cycle 5

Wiring Overhead in RDR Designs
  • Data transfers r1r3 and r2r4 are overlapped
  • Two dedicated global wires are needed

+

ALU1

r1

+

r1

r2

r2

r3

r3

r4

MUL1

Interconnects with delay of 2 cycles

r4

*

+

*

ALU1

MUL1

Sender register

Receiver register

architectural solution rdr pipe

Pipeline Register Station (PRS)

3

1

2

4

PRS

PRS

FSM

FSM

FSM

Reg. File

LCC

LCC

LCC

3

2

1

V channel

H channel

PRS

PRS

FSM

FSM

FSM

LCC

LCC

LCC

6

4

5

Architectural Solution: RDR-Pipe
  • Keep the intra-island structures
  • Inter-island pipeline register station (PRS) for global communications
  • PRS performs autonomous store-and-forward
    • Synchronous design
    • No global control signal needed for PRS
reducing wiring overhead in rdr pipe

+

ALU1

r1

+

r1

r1

r3

r2

r3

r4

MUL1

2 cycle communication

r4

*

Sender register

Receiver register

+

*

ALU1

MUL1

Pipeline register

Cycle 1

Cycle 2

Cycle 3

Cycle 4

Cycle 5

Reducing Wiring Overhead in RDR-Pipe
  • Data transfers are pipelined
  • One wire with a pipeline register is enough
synthesis flow mcas pipe system
Synthesis Flow: MCAS-Pipe System
  • Global interconnect sharing
    • After scheduling and functional unit binding
    • Before register and port binding
    • Enable multiple data communications to shar a physical link (a wire with pipeline registers)
  • Advantages over MCAS
    • Expect to reduce global wiring demand
    • No multicycle path constraint needed

C / VHDL

MCAS-Pipe

CDFG generation

CDFG

Resource allocation

& Functional unit binding

ICG

Scheduling-driven placement

Locations

Placement-driven

rescheduling & rebinding

Global interconnect sharing

Register and port binding

Datapath & FSM generation

RTL VHDL & Floorplan constraints

global interconnect sharing

Pipeline register

Sender register

Receiver register

pg

Cycle 1

Cycle 2

pe

Cycle 3

Cycle 4

Cycle 5

Cycle 6

ce

cg

Cycle 7

Conflicted data transfers

A

B

D = 2

pg

Cycle 1

A

B

ce

D = 2

Cycle 2

pe,pg

cg

ce

pe

pe

Cycle 3

cg

pg

Cycle 4

  • Now, two producer registers can be merged, since their life-times become compatible

Cycle 5

  • Only one physical link is required to support the scheduled data transfers

Cycle 6

ce

cg

Cycle 7

Compatible data transfers

Global Interconnect Sharing
  • Two physical links are needed to support the concurrent data transfers

A

B

D = 2

pe

ce

pg

cg

global pipelined interconnect minimization
Global Pipelined Interconnect Minimization
  • Definitions
    • Data links: pipelined global interconnects
    • Channel: set of data links between two islands
      • Width of a channel: number of its data links
    • Data transfer: movement of data from a producer to a consumer
  • Architectural assumption
    • Channels cannot share interconnects
  • Theorem
    • Global pipelined interconnects are minimized if and only if the width of every channel is minimized
transfer scheduling for a single channel
Transfer Scheduling for a Single Channel
  • A decision problem formulation
    • Given:
      • A channel (A, B)containing m data links
      • A data transfer set {e | pe A and ce B}, where each transfer e is associated with an arrival time T(pe)+1, a deadline T(ce)-D(A, B), and unit effective occupancy time
    • Fact: for every time slot, at most one transfer can be issued on a data link
    • Objective: to find a feasible transfer schedule on these data links
  • Transfer scheduling is polynomial solvable
    • A special real-time scheduling problem [J. Blazewicz, 1979]
      • Binary search for minimum feasible channel width m
      • For each width, apply Earliest-Deadline-First (EDF) scheduling: O(nlogn)
      • Overall time complexity: O(nlog2n)
edf based transfer scheduling example

Data Link 1

Data Link 1

Data Link 2

1

3

4

5

2

6

  • Ordered by left edge
EDF-Based Transfer Scheduling Example

Data Link 2

Time slot

Time slot

  • Successfully scheduling onto 2 data links

1

1

2

5

2

3

4

6

3

4

5

6

  • Ordered by Earliest-Deadline-First

4

1

3

5

2 ?

  • Failed for 2 data links!
outline2
Outline
  • Motivation
  • Our contributions
    • RDR-Pipe micro-architecture
      • Regular Distributed Register micro-architecture with interconnect pipelining
    • Synthesis flow and algorithms
      • MCAS-Pipe: automatic interconnect pipelining and sharing
  • Experimental results
  • Conclusions
experiment settings
Experiment Settings

C / VHDL

CDFG generation

Functional unit allocation & binding

uArch. spec.

Target clock period

Conventional flow

Scheduling-driven placement

Placement-driven

rebinding & rescheduling

Conventional Scheduling

MCAS flow

Global interconnect sharing

MCAS-Pipe flow

Register and port binding

Datapath & Control generation

RTL VHDL files

(for all flows)

Floorplan constraints (for MCAS and MCAS-Pipe); Multicycle path constraints (for MCAS only)

Altera QuartusII + Stratix

experimental results register and le usage
Experimental Results: Register and LE Usage
  • Design environment: Altera QuartusII, Stratix EP1S40
  • MCAS vs. Conventional flow:
    • Uses more registers and logic elements (LE)
  • MCAS-Pipe vs. MCAS:
    • Slightly more registers, and comparable logic element cost
experimental results performance
Experimental Results: Performance
  • Design environment: Altera QuartusII, Stratix EP1S40
  • MCAS vs. Conventional flow:
    • 36% reduction in clock period and 30% in total latency
  • MCAS-Pipe vs. MCAS:
    • Comparable design performance (4% better)

Total latency

Clock period

interconnect structure of altera s stratix
Interconnect Structure of Altera’s Stratix

Global: H24

H8

H4

Local: LL, LO

Global:V16

V4

V8

experimental results wirelength
Experimental Results: Wirelength
  • Wire types
    • LL, LO: local wires; H4, V4, H8, V8: short global wires
    • V16, H24: long global wires
  • MCAS-Pipe vs. MCAS:
    • 28.8% long global wires reduction, 19.3% total wirelength reduction
conclusions
Conclusions
  • High-level automatic on-chip interconnect pipelining
    • RDR-Pipe: extension of RDR micro-architecture
      • Micro-architecture supporting interconnect pipelining
    • MCAS-Pipe: enhancement of MCAS synthesis system
      • Add in a novel global interconnect sharing algorithm to effectively reduce the global wiring
  • Experimental results
    • Matches or exceeds the RDR-based approach in performance
    • Greatly reduces wiring demand
ad