NanoMap: An Integrated Design Optimization Flow
Download
1 / 27

Wei Zhang † , Li Shang ‡ and Niraj K. Jha † - PowerPoint PPT Presentation


  • 71 Views
  • Uploaded on

NanoMap: An Integrated Design Optimization Flow for a Hybrid Nanotube/CMOS Dynamically Reconfigurable Architecture. Wei Zhang † , Li Shang ‡ and Niraj K. Jha † Dept. of Electrical Engineering Princeton University †

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Wei Zhang † , Li Shang ‡ and Niraj K. Jha †' - tierra


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

NanoMap: An Integrated Design Optimization Flow for a Hybrid Nanotube/CMOS Dynamically Reconfigurable Architecture

Wei Zhang†, Li Shang‡ and Niraj K. Jha†

Dept. of Electrical EngineeringPrinceton University†

Dept. of Electrical and Computer Engineering

Queen’s University ‡


Outline
Outline

  • Temporal Logic Folding

  • Background on NRAMs

  • Overview for hybrid NAnoTUbe/CMOS REconfigurable architecture (NATURE) (DAC 2006)

  • NanoMap: Design Optimization Flow

  • Experimental Results

  • Conclusions


Temporal logic folding
Temporal Logic Folding

  • Basic idea: Use run-time reconfiguration to realize different functions in the same resource every few cycles

LUT

1

LUT

1

LUT

2

LUT

2

LUT

3

LUT

3

LUT

1

LUT

2

LUT

3

MEM

i =abc’

l =(I’+e’+f’)h’

OUT =d’g’+l


Overview of nature
Overview of NATURE

  • Distributed non-volatile nanotube RAMs (NRAMs): main storage for reconfiguration bits

  • Fine-grain reconfiguration (even cycle-by-cycle) and logic folding

    • Area-delay trade-off flexibility

    • More than an order of magnitude increase in logic density

    • More than an order of magnitude reduction in area-time product

    • Comparisons assume NRAMs/ CMOS logic implemented in the same technology

    • Non-volatility: useful in low power & secure processing

CMOS fabrication

compatible

NRAM-based

Run-time

reconfiguration

NATURE

Temporal

logic folding

Logic

density

Design

flexibility


Overview of nature contd
Overview of NATURE (Contd.)

  • Challenges in nano-circuits/architectures

    • Many programmable nanofabrics proposed: Nanowire PLA (Dehon, 2004), CMOL (Strukov, 2005), etc.

    • Lack of a mature fabrication process

    • Fabrication defects and run-time failures (between 1% and 10%)

  • Regular, reconfigurable architectures, such as an FPGA, favored

    • Facilitates fabrication

    • Fault tolerance through reconfiguration

    • NATURE: fabricatable using CMOS-compatible fabrication process


Nramtm by nantero
NRAMTM by Nantero

  • Non-volatile nanotube random-access memory (NRAM)

    • Mechanically bent or not: determines bistable on/off states

    • Same/opposite voltage added to change the state

    • CMOS-compatible fabrication process

    • 10 Gbit NRAMs already fabricated: ready to be commercialized in the near future

Source: http://www.nantero.com/nram.html


Nrams
NRAMs

  • Properties of NRAMs

    • Non-volatile

    • Similar speed to SRAM

    • Similar density to DRAM

    • Chemically and mechanically stable

  • NATURE not tied to NRAMs

    • Phase change RAM

    • Magnetoresistive RAM

    • Ferroelectric RAM


Architecture of nature
Architecture of NATURE

  • Island-style logic blocks (LBs) connected by various levels of interconnects

  • An LB contains a super macroblock (SMB) and a local switch matrix


Architecture of a super macroblock smb
Architecture of a Super Macroblock (SMB)

  • n1macroblocks (MBs) comprise an SMB:here n1 = 4


Architecture of a macroblock mb
Architecture of a Macroblock (MB)

  • n2 logic elements (LEs) comprise an MB:here n2 = 4


Logic element basic configuration
Logic Element (Basic Configuration)

  • An LE implements a computation and contains:

    • An m-input look-up table (LUT)

    • l flip-flops

    • Input to flip-flop selected between LUT output and a primary input


Folding levels
Folding Levels

  • Logic folding at different levels of granularity, providing flexibility to perform area-delay trade-offs

  • Level-p folding: LE reconfiguration after the execution of p LUT computations

    • Reconfiguration time: 160ps

  • Larger folding level, typically delay decrease, area increase

(a) level-1 folding

(b) level-2 folding


Design optimization flow nanomap
Design Optimization Flow: NanoMap

  • Optimize and implement design on NATURE

  • Integrate temporal logic folding

    • Choose a proper folding level

    • Use force-directed scheduling (FDS) technique to balance resource usage across folding cycles

  • Input design specified in register-transfer level (RTL) and/or gate-level VHDL


Motivational example
Motivational Example

  • Different planes should have same number of folding stages to guarantee global synchronization

  • Key issue: how to achieve the optimization objective

    • Appropriate folding level

    • Assign the logic to folding stages

Level 1 register

Logic

in Plane

Folding

stage

Plane cycle

Folding

cycle

Plane

Level 2 register


Motivational example contd
Motivational Example (Contd.)

  • Example optimization objective

    • Minimize circuit delay under an area constraint of 32 LEs

    • Assume each LE contains one LUT and two flip-flops: 32 LEs provide 32 LUTs and 64 flip-flops

8 LUTs

Logic depth: 4

50 LUTs

14 flip-flops

Plane depth: 9

38 LUTs

Logic depth: 7


Iterative design flow
Iterative Design Flow

  • Start with initial guess for folding level and iteratively refine it

    • Large folding level -> better circuit delay, but large area cost

    • Initial #folding stages:

    • Initial folding levels:

  • Partition RTL modules into a series of connected LUT clusters

    • logic depth at most equal to the folding level

    • Significantly speeds up the mapping procedure


Iterative design flow contd
Iterative Design Flow (Contd.)

  • Cluster size should be smaller than the area constraint

34 LUTs

> 32 LUTs

Level-5 folding

Level-4 folding


Solution for the example
Solution for the Example

  • Three folding stages using level-4 folding

  • 32 LEs required for mapping the RTL circuit; area constraint satisfied

  • Circuit delay = 3 * folding cycle delay


Nanomap flow diagram
NanoMap: Flow Diagram

Input network

Output

1

reconfiguration bits

Optimization

Module

Routing

16

objective

Circuit parameter

library

search

Final routing

2

using VPR router

Folding level

15

computation

User

3

constraint

Final placement

using modified VPR

RTL module partition

placer

Logic

Mapping

4

14

Yes

No

Perform logic

folding

?

No

Satisfy delay

5

constraints

?

Yes

12

Schedule each LUT

/

Temporal

placement

LUT cluster

Delay estimation

using FDS

6

11

Yes

Map each

7

No

Placement

LUT

/

LUT cluster to

routable

?

SMBs

Temporal

clustering

10

7

Fast placement

Satisfy area

No

Refine

No

using modified VPR

constraints

?

placement

?

placer

8

13

Yes

Yes

9


Force directed scheduling
Force-Directed Scheduling

  • Perform FDS on RTL modules partitioned into LUTs/LUT clusters

  • Iteratively schedule LUT/(LUT cluster) to minimize overall resource usage

  • Model resource usage as a force: F = Kx

    • K: distribution graphs (DGs) that describe the probability of resource usage

    • Aim of FDS: minimize force, indicating minimum increase in resource usage

  • LE usage depends on LUT computations and register storage operations:two DGs needed


Temporal clustering
Temporal Clustering

  • For each folding stage, a constructive algorithm used to assign LUTs to LEs and pack LEs into MBs and SMBs

    • Unpacked LUT with a maximal number of inputs selected as initial seed

    • New LUTs with high attractions to the seed selected and assigned to the SMB

      • Attractions depend on timing criticality and input pin sharing

      • Considers attractions across all the folding cycles


Placement and routing
Placement and Routing

  • VPR (U. Toronto) modified to perform placement and support temporal logic folding

    • Simulated annealing approach

    • Cost function computed across the folding stages

  • Routing using VPR router performed hierarchically, considering direct link, length-1, length-4 and global interconnects


Experimental setup
Experimental Setup

  • Instance of architecture:

    • 4 MBs in an SMB

    • 4 LEs in an MB

    • LEs contain a 4-input LUT and 2 flip-flops

  • Impact of fixing k at 16 vs. allowing a high enough k to show design trade-offs

  • Results based on 100nm technology parameters to implement CMOS logicand NRAMs


Experimental results contd

#LE * Delay adv. for AT opt.

No folding

k enough

k = 16

18

16

14

12

10

8

6

4

2

0

ex1

ex2

FIR

c5315

Paulin

ASPP4

Biquad

(normalized to no-folding)

Experimental Results (Contd.)

1

1

1

1

1

1

1

1

1

2

2

2

2

1

2

1

2

1

1

2

2

1

2

2

2

2

1

1


Experimental results contd1

LE utilization around 100%

50% reduced need for a deep interconnect hierarchy for level-1 vs. no-folding – indicates trading interconnect area for NRAM area advantageous

Experimental Results (Contd.)

Improvement under AT optimization for RTL Benchmarks


Experimental results contd2
Experimental Results (Contd.)

  • Flexibility in choosing the best folding level and performing area-delay trade-offs

  • Mapping results for typical optimizations using Paulin benchmark as an example

Typical optimizations


Conclusions
Conclusions

  • NATURE: A new high-performance run-time reconfigurable architecture

  • NanoMap: an integrated optimization design flow for NATURE

  • Introduction of NRAMs into the architecture enables cycle-by-cycle reconfiguration and logic folding: leading to significant logic density and area-time product advantages

  • Can be very useful for cost-conscious embedded systems and improvement of future FPGAs

  • Non-volatility: helpful in secure and low power processing


ad