Digital integrated circuits a design perspective
This presentation is the property of its rightful owner.
Sponsored Links
1 / 123

Digital Integrated Circuits A Design Perspective PowerPoint PPT Presentation


  • 103 Views
  • Uploaded on
  • Presentation posted in: General

Digital Integrated Circuits A Design Perspective. System on a Chip Design. Application Specific Integrated Circuits: Introduction. Jun-Dong Cho SungKyunKwan Univ. Dept. of ECE, Vada Lab. http://vada.skku.ac.kr. Contents. Why ASIC? Introduction to System On Chip Design

Download Presentation

Digital Integrated Circuits A Design Perspective

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Digital integrated circuits a design perspective

Digital Integrated CircuitsA Design Perspective

System on a

Chip Design


Application specific integrated circuits introduction

Application Specific Integrated Circuits: Introduction

Jun-Dong Cho

SungKyunKwan Univ.

Dept. of ECE, Vada Lab.

http://vada.skku.ac.kr


Contents

Contents

  • Why ASIC?

  • Introduction to System On Chip Design

  • Hardware and Software Co-design

  • Low Power ASIC Designs


Why asic design productivity grows

Why ASIC – Design productivity grows!

Complexity increase 40 % per year

Design productivity increase 15 % per year

  • Integration of PCB on single die


Silicon in 2010

Silicon in 2010

Die Area:2.5x2.5 cm

Voltage:0.6 V

Technology:0.07 m


Asic principles

ASIC Principles

  • Value-added ASIC for huge volume opportunities; standard parts for quick time to market applications

  • Economics of Design

    • Fast Prototyping, Low Volume

    • Custom Design, Labor Intensive, High Volume

  • CAD Tools Needed to Achieve the Design Strategies

    • System-level design: Concept to VHDL/C

    • Physical design VHDL/C to silicon, Timing closure (Monterey, Magma, Synopsys, Cadence, Avant!)

  • Design Strategies:Hierarchy; Regularity; Modularity; Locality


Asic design strategies

ASIC Design Strategies

  • Design is a continuous tradeoff to achieve performance specs with adequate results in all the other parameters.

  • Performance Specs- function, timing, speed, power

  • Size of Die- manufacturing cost

  • Time to Design- engineering cost and schedule

  • Ease of Test Generation & Testability- engineering cost, manufacturing cost, schedule


Asic flow

ASIC Flow


Structured asic designs

Structured ASIC Designs

  • Hierarchy:Subdivide the design into many levels of sub-modules

  • Regularity: Subdivide to max number of similar sub-modules at each level

  • Modularity: Define sub-modules unambiguously & well defined interfaces

  • Locality: Max local connections, keeping critical paths within module boundaries


Asic design options

ASIC Design Options

  • Programmable Logic

  • Programmable Interconnect

  • Reprogrammable Gate Arrays

  • Sea of Gates & Gate Array Design

  • Standard Cell Design

  • Full Custom Mask Design

  • Symbolic Layout

  • Process Migration - Retargeting Designs


Asic design methodologies

ASIC Design Methodologies


Why soc

Why SOC?

  • SOC specs are coming from system engineers rather

  • than RTL descriptions

  • SOC will bridge the gap hardware/software and their implementation in novel, energy-efficient silicon architecture.

  • In SOC design, chips are assembled at IP block level (design reusable) and IP interfaces rather than gate level


Cmos density now allows complete system on a chip solutions

mP core

Dedicated logic

phone

book

keypad

intfc

phonebook

RAM & ROM

DMA

S/P

control

protocol

Demod

and

sync

Viterbi

Equal.

voice

recognition

speech

quality

enhancement

A

de-intl

&

decoder

RPE-LTP

speech

decoder

digital

down

conv

D

Analog

DSP core

CMOS density now allows complete System-on-a-chip Solutions

  • FPGA

  • Reconfigurable Interconnect

Source:

Brodersen, ICASSP ‘98

Also like to add

How do we design these chips?


Possible single chip radio architectures

Software Radio

GOAL: Simplify System Design Process

Seek architectures which are flexible such that hardware and protocols can be designed independently

APPROACH: Minimize the use of dedicated logic

Universal Radio

GOAL: Maximize Bandwidth Efficiency and Battery Life

Seek architectures which perform complex algorithms very fast with minimal energy

APPROACH: Minimize the use of programmable logic

Possible Single-Chip Radio Architectures

Why is SOC design so scary?


60 ghz sige transceiver for wireless lan applications

A low power 30 GHz LNA is designed as the front end of the receiver.

Wideband and high gain response is realized by a 2-stage design using a stagger-tuned technique.

The simulated performance predicts a forward gain of |S21| > 20 dB over a 6 GHz range with an input match of |S11| < -30 dB and output match of |S22| < -10 dB.

The mixer consists of a single balanced Gilbert cell.

A fully-integrated differential 25 GHz VCO is used, in conjunction with the mixer, to downconvert the RF input to a 5 GHz IF.

60 GHz SiGe Transceiver for Wireless LAN Applications

30 GHz receiver layout consisting of the LNA, mixer and VCO


Wideband cmos lc vco

A 1.8 GHz wideband LC VCO implemented in 0.18 µm bulk CMOS has been successfully designed, fabricated, and measured.

This VCO utilizes a 4-bit array of switched capacitors and a small accumulation-mode varactor to achieve a measured tuning range exceeding 2:1 (73%) and a worst-case tuning sensitivity of 270 MHz/V.

The amplitude reference level is programmable by means of a 3-bit DAC.

Wideband CMOS LC VCO

VCOs die photograph


A high level view of an industry standard design flow

HDL Entry

Front-End

good?

Synthesis

good?

Back-End

Floor-plan

Place & Route

good?

Physical Verification

DRC & LVS

good?

done

A High Level View of an Industry Standard Design Flow

source: Hitachi, Prof. R. W. Brodersen

  • Every step can loop to every other step

  • Each step can take hours or days for a 100,000 line description

  • HDL description contains no physical information

  • Different engineers handle the front-end and back-end design

Problems with this flow:

How have semiconductor companies made this flow work?


A more accurate picture of the standard flow

Architecture

10 months

Front-End

10 months

Back-End 2 months

Fabrication 2 months

A More Accurate Picture of the Standard Flow

Source: IBM Semiconductor, Prof. R. Newton

  • Architecture: Partition the chip into functional units and generate bit-true test vectors to specify the behavior of each unitTOOLS: Matlab, C, SPW, (VCC)FREEZE the test vectors

  • Front-End: Enter HDL code which matches the test vectorsTOOLS: HDL Simulators, Design CompilerFREEZE the HDL code

  • Back-End: Create a floor-plan and tweak the tools until a successful mask layout is createdTOOLS:Design Compiler, Floor-planners, Placers, Routers, Clock-tree generators, Physical Verification

How can we improve this flow?


Common fabric for ip blocks

Common Fabric for IP Blocks

  • Soft IP blocks are portable, but not as predictable as hard IP.

  • Hard IP blocks are very predictable since a specific physical implementation can be characterized, but are hard to port since are often tied to a specific process.

  • Common fabric is required for both portability and predictability.

  • Wide availability: Cell Based Array, metal programmable architecture that provides the performance of a standard cell and is optimized for synthesis.


Four main applications

Four main applications

  • Set-top box: Mobile multimedia system, base station for the home local-area network.

  • Digital PCTV: concurrent use of TV,3D graphics, and Internet services

  • Set-top box LAN service: Wireless home-networks, multi-user wireless LAN

  • Navigation system:steer and control traffic and/or goods-transportation

  • CMPRis a multipurpose program that can be used for displaying diffraction data, manual- & auto-indexing, peak fitting and other


Pc multimedia applications

PC-Multimedia Applications


Types of system on a chip designs

Types of System-on-a-Chip Designs


Physical gap

Physical gap

  • Timing closure problem: layout-driven logic and RT-level synthesis

  • Energy efficiency requires locality of computation and storage: match for stream-based data processing of speech,images, and multimedia-system packets.

  • Next generation SOC designers must bridge the architectural gap b/w system specification and energy-efficient IP-based architectures, while CAE vendors and IP providers will bridge the physical gap.


Circular y chart

Circular Y-Chart


Soc co design challenges

SOC Co-Design Challenges

  • Current systems are complex and heterogenous Contain many different types of components

  • Half of the chip can be filled with 200 low-power, RISC-like processors (ASIP) interconnected by field-programmable buses, embedded in 20Mbytes of distributed DRAM and flash memory, Another Half: ASIC

  • Computational power will not result from multi-GHz clocking but from parallelism, with below 200 MHz.

  • This will greatly simplify the design for correct timing, testability, and signal integrity.


Bridging the architectural gap

Bridging the architectural gap

  • One-M gate reconfigurable, one-M gate hardwired logic.

  • 50GIPS for programmable components or 500 GIPS for dedicated hardwares

  • Product reliability: design at a level far above the RT level, with reuse factors in excess of 100

  • Trade-off: 100MOPs/watt (microprocessor) 100GOPs/watt (hardwired) Reconf. Computing with a large number of computing nodes and a very restricted instruction set (Pleiades)


Why lower power

Portable systems

long battery life

light weight

small form factor

IC priority list

power dissipation

cost

performance

Technology direction

Reduced voltage/power designs based on mature high performance IC technology, high integration to minimize size, cost, power, and speed

Why Lower Power


Microprocessor power dissipation

Power(W)

Alpha 21164

Alpha 21264

50

45

P III 500

P II 300

40

35

Alpha21064 200

30

25

P6 166

20

P5 66

15

P-PC604 133

10

i486 DX2 66

P-PC601 50

i486 DX25

i386 DX 16

i486 DX4 100

5

i286

i486 DX 50

P-PC750 400

1980

1985

1990

1995

2000

year

Microprocessor Power Dissipation


Levels for low power design

Levels for Low Power Design


Power hungry applications

Power-hungry Applications

  • Signal Compression: HDTV Standard, ADPCM, Vector Quantization, H.263, 2-D motion estimation, MPEG-2 storage management

  • Digital Communications: Shaping Filters, Equalizers, Viterbi decoders, Reed-Solomon decoders


New computing platforms

New Computing Platforms

  • SOC power efficiency more than 10GOPs/w

    • Higher On Chip System Integration: COTS: 100W, SOC:10W (inter-chip capacitive loads, I/O buffers)

    • Speed & Performance: shorter interconnection,fewer drivers,faster devices,more efficient processing artchitectures

  • Mixed signal systems

  • Reuse of IP blocks

  • Multiprocessor, configurable computing

  • Domain-specific, combined memory-logic


Low power design flow i

Function

System

System-Level

Partitioning and

Level

Power Analysis

HW/SW Allocation

Specification

Behavioral

Software

Description

Functions

Power-driven

Behavioral-Level

Processor

Behavioral

Power Analysis

Selection

Transformation

Power Conscious

Behavioral

Description

High-Level

Software-Level

RT-Level

Software

Synthesis and

Power Analysis

Power Analysis

Optimization

Optimization

To RT-Level Design

Low Power Design Flow I


Low power design flow ii

RT-level

Description

Controller

Data-path

Logic Synthesis

Gate-Level

RTL

RTL

and

Power Analysis

mapping

Library

Optimization

Gate-level

Description

High-Level

Switch-Level

Standard cell

Synthesis and

Processor

Memory

Power Analysis

Library

Optimization

Control and

RTL

Steering Logic

Macrocells

Switch-level

Description

Low Power Design Flow II


Three factors affecting energy

Three Factors affecting Energy

  • Reducing waste by Hardware Simplification: redundant h/w extraction, Locality of reference,Demand-driven / Data-driven computation,Application-specific processing,Preservation of data correlations, Distributed processing

  • All in one Approach(SOC): I/O pin and buffer reduction

  • Voltage Reducible Hardwares

    • 2-D pipelining (systolic arrays)

    • SIMD:Parallel Processing:useful for data w/ parallel structure

    • VLIW: Approach- flexible


Ibm s powerpc lower power architecture

IBM’s PowerPC Lower Power Architecture

  • Optimum Supply Voltage through Hardware Parallel, Pipelining ,Parallel instruction execution

    • 603e executes five instruction in parallel (IU, FPU, BPU, LSU, SRU)

    • FPU is pipelined so a multiply-add instruction can be issued every clock cycle

    • Low power 3.3-volt design

  • Use small complex instruction with smaller instruction length

    • IBM’s PowerPC 603e is RISC

  • Superscalar: CPI < 1

    • 603e issues as many as three instructions per cycle

  • Low Power Management

    • 603e provides four software controllable power-saving modes.

  • Copper Processor with SOI

  • IBM’s Blue Logic ASIC :New design reduces of power by a factor of 10 times


Power down techniques

Power-Down Techniques

Lowering the voltage along with the clock actually alters the energy-per-operation of the microprocessor, reducing the energy

required to perform a fixed amount of work


Implementing digital systems

Implementing Digital Systems


H w and s w co design

H/W and S/W Co-design


Three co design approaches

Three Co-Design Approaches

  • IFIP International Conference FORTE/PSTV’98, Nov.’98 N.S. Voros et.al, “Hardware -software co-design of embedded systems using multiple formalisms for application development”

  • ASIP co-design: builds a specific programmable processor for an application, and translates the application into software code. H/w and s/w partitioning includes the instruction set design.

  • H/w s/w synchronous system co-design: s/w processor as a master controller, and a set of h/w accelerators as co-processors. Vulcan, Codes, Tosca, Cosyma

  • H/w s/w for distributed systems: mapping of a set of communication processors onto a set of interconnected processors. Behavioral decomposition, process allocation and communication transformation. Coware(powerful), Siera (reuse), Ptolemy (DSP)


Mixing h w and s w

Mixing H/W and S/W

  • Argument: Mixed hardware/ software systems

    represent the best of both worlds.

    High performance, flexibility, design reuse, etc.

  • Counterpoint: From a design standpoint, it is

    the worst of both worlds

    • Simulation: Problems of verification, and test become harder

    • Interface: Too many tools, too many interactions, too much heterogeneity

    • Hardware/ software partitioning is “AI- complete”!

    • (MIT, Stanford: by analogy with "NP-complete") A term used to describe problems in artificial intelligence, to indicate that the solution presupposes a solution to the "strong AI problem" (that is, the synthesis of a human-level intelligence). A problem that is AI-complete is just too hard.


Low power partitioning approach

Low power partitioning approach

  • Different HW resources are invoked according to the instruction executed at a specific point in time

  • During the execution of the add op., ALU and register are used, but Multiplier is in idle state.

  • Non-active resources will still consume energy since the according circuit continue to switch

  • Calculate wasting energy

  • Adding application specific core and partial running

    Whenever one core performing, all the other cores are shut down


Asip application specific instruction processors design

ASIP (Application Specific Instruction Processors) Design

  • Given a set of applications, determine micro architecture of ASIP (i. e., configuration of functional units in datapaths, instruction set)

  • To accurately evaluate performance of processor on a given application need to compile the application program onto the processor datapath and simulate object code.

  • The micro architecture of the processor is a design parameter!


Asip design flow

ASIP Design Flow


Cross disciplinary nature

Cross-Disciplinary nature

  • Software for low power:loop transformation leads to much higher temporal and spatial locality of data.

  • Code size becomes an important objective Software will eventually become a part of the chip

  • Behavior-platform-compiler codesign: codesigned with C++ or JAVA, describing their h/w and s/w implementation.

  • Multidisciplinary system thinking is required for future designs (e.g., Eindhoven Embedded Systems Institutehttp://www.eesi.tue.nl/english)


Vlsi signal processing design methodology

VLSI Signal Processing Design Methodology

  • pipelining, parallel processing, retiming, folding, unfolding, look-ahead, relaxed look-ahead, and approximate filtering

  • bit-serial, bit-parallel and digit-serial architectures, carry save architecture

  • redundant and residue systems

  • Viterbi decoder, motion compensation, 2D-filtering, and data transmission systems


Low power dsp

Low Power DSP

  • DO-LOOPDominant

  • VSELP Vocoder: 83.4 %

  • 2D 8x8 DCT: 98.3 %

  • LPC computation: 98.0 %

DO-LOOPPower Minimization

==> DSPPower Minimization

VSELP : Vector Sum Excited Linear Prediction

LPC : Linear Prediction Coding


Deep submicron design flows

Deep-Submicron Design Flows

  • Rapid evaluation of complex designs for area and performance

  • Timing convergence via estimated routing parasitics

  • In-place timing repair without resynthesis

  • Shorter design intervals, minimum iterations

  • Block-level design and place and route

  • Localized changes without disturbance

  • Integration of complex projects and design reuse


Soc cad companies

Avant! www.avanticorp.com

Cadence www.cadence.com

Duet Tech www.duettech.com

Escalade www.escalade.com

Logic visions www.logicvision.com

Mentor Graphics www.mentor.com

Palmchip www.palmchip.com

Sonic www.sonicsinc.com

Summit Design www.summit-design.com

Synopsys www.synopsys.com

Topdown design solutions www.topdown.com

Xynetix Design Systems www.xynetix.com

Zuken-Redac www.redac.co.uk

SOC CAD Companies


Design technology for low power radio systems

Design Technology for Low Power Radio Systems

Rhett Davis

Dept. of EECS

Univ. of Calif.

Berkeley

http://bwrc.eecs.berkeley.edu


Domain of interest

Domain of Interest

  • Highly integrated system-on-a-chip solutions – SOC’s

  • Wireless communications with associated processing, e.g. multimedia processing, compression, switching, etc…

  • Primary computation is high complexity dataflow with a relatively small amount of control


Why systems on a chip soc

Why Systems-on-a-Chip - SOC ?

State-of-the-Art CMOS is easily able to implement complete systems (or what was on a board before)

  • A microprocessor core is only 1-2 mm2 (1-2 % of the area of a $4 chip)

  • Portability (size) is critical to meet the cost, power and size requirements of future wireless systems

  • Chips will be required to support the complete application (wireless internet, multimedia)

  • Dedicated stand-alone computation is replacing general purpose processors as the semiconductor industry driver


Digital integrated circuits a design perspective

Small

Signal RF

Power

RF

Power

Management

Digital Cellular Market

(Phones Shipped)

1996 1997 1998 1999 2000

Analog

Baseband

Units48M 86M 162M 260M 435M

Digital Baseband

(DSP + MCU)

Cellular Phones: An example

(Courtesy Mike McMahon, Texas Instruments)


Digital integrated circuits a design perspective

Cellular Phone Baseband SOC

ROM

MCU

DSP

Gates

RAM

Analog

2000+ phones on each 8” wafer @ .15 Leff

1Million Baseband Chips per Day!!!

(Courtesy Mike McMahon, Texas Instruments)


Wireless system design issues

Wireless System Design Issues

  • It is now possible to use CMOS to integrate all digital radio functions – but what is the “best” architectural way to use CMOS???

  • Computation rates for wireless systems will easily range up to 100’s of GOPS in signal processing

    • What’s keeping us from achieving this in silicon?

    • What can we do about it?


Computational efficiency metrics

Computational Efficiency Metrics

  • Definition: MOPS

    • Millions of algorithmically defined arithmetic operations (e.g. multiply, add, shift) – in a GP processor several instructions per “useful” operation

  • Figures of merit

    • MOPS/mW - Energy efficiency (battery life)

    • MOPS/mm2 - Area efficiency (cost)

      Optimization of these “efficiencies” is the basic goal assuming functionality is met


Energy efficiency of architectures

1000

Dedicated

HW

Direct mapped

100-1000 MOPS/mW

100

ReconfigurableProcessor/Logic

Reconfiguration (???)

Potential of 10-100 MOPS/mW

Energy Efficiency

MOPS/mW (or MIPS/mW)

10

ASIPs

DSPs

1

DSP

1-10 MIPS/mW

Embedded mProcessors

Microprocessor

.1-1 MIPS/mW

0.1

Flexibility (Coverage)

Energy-Efficiency of Architectures


Software processors energy trends

300

A21164-300

A21064A

250

MIPS R5000

200

MIPS R4400

PPro200

HP PA8000

MIPS R10000

UltraSparc-167

PP166

150

PPro-150

Freq(MHz)

HP PA7200

PP-133

PPC 604-120

DX4 100

100

PP-100

PPC603e-100

SuperSparc2-90

PPC 601-80

486-66

PP-66

50

i386C-33

i486C-33

i386

0

1991

1992

1993

1994

1995

1996

Software Processors: Energy Trends

Primary means of performance increase of software processors has been by increasing clock rate

Decreasing Energy Efficiency

E  C  VDD2


Software processors area trends

Software Processors: Area Trends

  • Increasing clock rate results in a memory bottleneck – addressed by bringing memory on-chip

  • Area is increasingly dominated by memory – degrading MOPs/mm2

16x16 multiplier

(.05 mm2)

DSP processor with 1 multiplier

(25 mm2)

Why time multiplex to save area if the overhead is much greater than the area saved????


Parallelism is the answer but

Parallelism is the answer, but …

  • Not by putting Von Neumann processors in parallel and programming with a sequential language

    • Attempts to do this have failed over and over again…

    • The parallel computer compiler problem is very difficult

  • Not by trying to capture parallelism at the instruction level

    • Superscalar, VLIW, etc… are very inefficient

    • Hardware can’t figure out the parallelism from a sequential language either

      The problem is the initial sequential description (e.g. C) which is poorly matched to highly parallel applications


What is really hapenning

What is really hapenning…

Then try to rediscover the parallelism

Re-entering it using a sequential description

Starting with a parallel algorithmic description

While (i=0;i++:i<num) {

a = a * c[i];

b[i] = sin (a * pi) + cos(a*pi);

};

Outfil = b[i] * indata;

We take this path so that we can use an architecture

that is orders of magnitude less efficient in energy and area

??????


What can a fully parallel cmos solution potentially do

What can a fully parallel CMOS solution potentially do?

In .25 micron a multiplier requires .05 mm2 and 7pJ per operation at 1 V. Adders and registers are about 10 times smaller and 10 times lower energy

Lets implement a 50mm2 , .25 micron chip using adders, registers and multipliers

  • We can have 2000 adders/registers and 200 multipliers in less than 1/2 of the chip, also assume 1/3 of power goes into clocks

  • 25 MHz clock (1 volt) gives ~50 Gops at 100mW

  • 500 MOPS/mW and 1000 MOPS/mm2


Start with a parallel description of the algorithm

Start with a parallel description of the algorithm…


Then directly map into hardware

S reg

X reg

Add,

Sub,

Shift

Mult2

Mac2

Mac1

Mult1

Then directly map into hardware …


Results in fully parallel solutions

Results in fully parallel solutions

(numbers taken from vendor-published benchmarks)

Orders of magnitude lower efficiency even for an optimized processor architecture


Reasons software solutions seem attractive

Reasons software solutions seem attractive

(1) Believed to reduce time-to-system-implementation

(2) Provides flexibility

(3) Locks the customers into an architecture they can’t change

(4) Difficulty in getting dedicated SOC chips designed

Are these good reasons???


1 believed to reduce time to system implementation

(1) Believed to reduce time-to-system implementation

  • Software decreases time to get first prototype, but time to fully verified system is much longer (hardware is often ready but software still needs to be done)

  • Limitations of software prototype often sets the ultimate limit of the system performance

  • Software solutions can be shipped with bugs, not a real option for SOC


2 need flexibility

(2) Need flexibility

  • Software is not always flexible

    • Can be hard to verify

  • Flexibility does not imply software programmability

    • Domain specific design can have multiple modules, coefficients and local state control (the factor of 100 in efficiency) to address a range of applications

    • Reconfiguration of interconnect can achieve flexibility with high levels of efficiency


Flexibility without software

Flexibility without software

Energy per Transform

vs. FFT size

Transforms per Second per mm2

vs. FFT size

* All results are scaled to 0.18mm


Reasons software solutions seem attractive1

Reasons software solutions seem attractive

(1) Believed to reduce time-to-system implementation

(2) Provides flexibility

(3) Locks the customers into an architecture they can’t change

(4) Difficulty in getting dedicated SOC chips designed


Standard dsp asic design flow

AlgorithmDesign

Floating-PointSimulation

Sequential

System/ArchitectureDesign

Mixed Sequential & Structural

Fixed-PointSimulation

Hardware/Front-End Design

Integer only,Structural w/SequentialLeaf-cells

RTL Code

Physical/Back-End Design

Single-wire Connectivityw/ TimingConstraints

Mask Layout

Standard DSP-ASIC Design Flow

  • Three translations of design data

  • Requirements for re-verification at each stage

  • Uncontrolled looping when pipeline stalls

Problems:

Prohibitively Long Design Time for Direct Mapped Architectures


Direct mapping design flow

Algorithm/System

Simulation

Back-End

Front-End

Floorplan

RTL Libraries

Automated Flow

Mask Layout

Performance Estimates

Direct Mapping Design Flow

  • Encourages iterations of layout

  • Controls looping

  • Reduces the flow to a single phase

  • Depends on fast automation


D j vu

Déjà vu???

  • An automated style of design with parameterized modules processed through foundries is just the reincarnation of good ole Silicon Compilation of >10 years ago

  • What happened?

    • A decline of research into design methodologies

    • A single dominant flow has resulted - the Verilog-Synopsys-Standard Cell

    • Lack of tool flows to support alternative styles of design

    • Research community lost access to technology – moved to highly sub-optimal processor and FPGA solutions


Capturing design decisions

reg.file

MAC

add

shift

reg. file

S

Capturing Design Decisions

Categories:

  • Function - basic input-output behavior

  • Signal - physical signals and types

  • Circuit - transistors

  • Floorplan - physical positions

How to get layout and performance estimates in a day?


Simplified view of the flow

dataflow graph

elaborate

netlist

macrolibrary

floorplan

merge

autoLayout

route

layout

Simplified View of the Flow

New Software:

  • Generation of netlists from a dataflow graph

  • Merging of floorplan from last iteration

  • Automatic routing and performance analysis

  • Automation of flow as a dependency graph (UNIX MAKE program)


Why simulink

Time-Multiplexed FIR Filter

Why Simulink?

  • Simulink is an easy sell to algorithm developers

  • Closely integrated with popular system design tool Matlab

  • Successfully models digital and analog circuits


Modeling datapath logic

Modeling Datapath Logic

  • Discrete-Time(cycle accurate)

  • Fixed-Point Types(bit true)

  • Completely specify function and signal decisions

  • No need for RTL

Multiply / Accumulate


Modeling control logic

Modeling Control Logic

  • Extended finite state-machine editor

  • Co-simulation with dataflow graph

  • New Software:Stateflow-VHDL translator

  • No need for RTL

Address Generator / MAC Reset


Specifying circuit decisions

Black Box

RTL CodeorData-pathGeneratorCodeorCustomModule

Stateflow-VHDLtranslator

Time-Multiplexed FIR Filter

Specifying Circuit Decisions

  • Macro choices embedded in dataflow graph

  • Cross-check simulations required


Hierarchy hardened progressively

System-Level

Design Environment

layout and characterize

new hard macro

estimate

performance:

power, area, delay

Hard Macro Characterization Libraries

Hierarchy Hardened Progressively

  • Macro characterization saved for fast estimates

  • Each level of hierarchy becomes a new hard macro

  • Higher levels of hierarchy are adjusted

  • When top level of hierarchy is hardened, the design is done


Capturing floorplan decisions

Parallel Pipelined FIR Filter

Capturing Floorplan Decisions

  • Commercial physical design tools used

  • Instance names in floorplan match dataflow graph

  • Placements merged on each iteration

  • Manhattan distance can be used for parasitic estimates


Reduced impact of interconnect

FO4 invdelay

Wire

delay

...

Reduced Impact of Interconnect

  • 0.18 mm

Long wires can be modeled as lumped capacitances


Race immune clock tree synthesis

t < t - t

skew(max)

clk-Q(min)

hold(max)

Hierarchical Clock Tree Synthesis

Example Clock TreeStages: 22Sinks: 7650Skew: 320 psClock Power: 2.8 mWLogic Power: 21 mW

Race-Immune Clock Tree Synthesis

Race margin= 580 ps

  • 0.18 mm

  • VDD = 1 V

Demonstrated on a 600k transistor design


Example 1 macro hardening

parallel pipelined FIR filter

area in 0.25 mm

1.4 mm2

power @ 25 MHz (1 V, PowerMill)

13.0 mW

critical path delay (1 V, PathMill)

18.0 ns

cells

21 k

transistors

240 k

execution time(elaborate / route)(characterization)

3 hours9 hours

disk space(elaborate / route)(characterization)

180 MB1.5 GB

Example 1: Macro Hardening

Most time/disk space spent on extraction and power simulation


Example 2 test chip

Example 2: Test Chip

  • 300k transistors

  • 0.25 mm

  • 1.0 V

  • 25 MHz

  • 6.8 mm2

  • 14 mW

  • 2 phase clock

  • 3 layers of P&R hierarchy

Parallel Pipelined FIR Filter(8X decimation filter for 12-bit 200 MHz SD)


Tdma baseband receiver

carrierdetection

frequency estimation

rotate & correlate

control

TDMA Baseband Receiver

  • 600k transistors

  • 0.18 mm

  • 1.0 V

  • 25 MHz

  • 1.1 mm2

  • 21 mW

  • single phase clock

  • 5 clock domains

  • 2 layers of P&R hierarchy


Conclusions

Conclusions

  • Direct-Mapped hardware is the most efficient use of silicon

  • Direct-Mapped hardware can be easier to design and verify than embedded hardware/software systems

  • Don’t translate design data, refine it

  • Design with dataflow graphs, not sequential code

  • Design flow automation speeds up design space exploration


Embedded processor architectures and re configurable computing

Embedded Processor Architectures and (Re)Configurable Computing

Vandana Prabhu

Professor Jan M. Rabaey

Jan 10, 2000


Digital integrated circuits a design perspective

Pico Radio Architecture

Embedded uP

FPGA

Dedicated FSM

Dedicated

DSP

Reconfigurable

DataPath


Reconfigurable computing merging efficiency and versatility

Reconfigurable Computing:Merging Efficiency and Versatility

Spatially programmed connection of processing elements.

  • “Hardware” customized to specifics of problem.

    • Direct map of problem specific dataflow, control.

  • Circuits “adapted” as problem requirements change.


Matching computation and architecture

AddressGen

AddressGen

Memory

Memory

Convolution

MAC

MAC

L

G

C

Control

Processor

Two architectural models:

sequential control+ data-driven

Two models of computation:

communicating processes + data-flow

Matching Computation and Architecture


Implementation fabrics for data processing

Implementation Fabrics for Data Processing

300 million multiplications/sec

357 million add-sub’s/sec

Data In

16 Mmacs/mW!


Software methodology flow

Software Methodology Flow

Algorithms

Area &

m

proc

&

Timing

Accelerator

Constraints

PDA Models

Kernel Detection

Behavioral

Xform’s

Estimation/Exploration

for low

power

Premapped

Power & Timing Estimation

Kernels

of Various Kernel Implementations

Kernels

Partitioning

Executable Intemediate

Form

Reconfig HW

Software Compilation

Reconfig. Hardware Mapping

Interface Code Generation

Interconnect

Optimization

(Marlene Wan)


Maia reconfigurable baseband processor for wireless

Maia: Reconfigurable Baseband Processor for Wireless

  • 0.25um tech: 4.5mm x 6mm

  • 1.2 Million transistors

  • 40 MHz at 1V

  • 1 mW VCELP voice coder

  • Hardware

    • 1 ARM-8

    • 8 SRAMs & 8 AGPs

    • 2 MACs

    • 2 ALUs

    • 2 In-Ports and 2 Out-Ports

    • 14x8 FPGA


Implementation fabrics for protocols

RACH

akn

RACH

req

Memory

idle

RACH

BUF

BUF

slotset

write

read

update

R_ENA

idle

W_ENA

Slot_Set_Tbl

2x16

addr

Slot

start

slot_set

<31:0>

Slot_no

<5:0>

Pkt

end

Implementation Fabrics for Protocols

A protocol = Extended FSM

  • ASIC: 1V, 0.25 mm CMOS process

  • FPGA: 1.5 V 0.25 mm CMOS low-energy FPGA

  • ARM8: 1 V 25 MHz processor; n = 13,000

  • Ratio: 1 - 8 - >> 400

Idea: Exploit model of computation: concurrent finite state machines, communicating through message passing

Intercom TDMA MAC


Low power fpga

Low-Power FPGA

  • Low Energy Embedded FPGA(Varghese George)

  • Test chip

    • 8x8 CLB array

    • 5 in - 3 out CLB

    • 3-level interconnect hierarchy

    • 4 mm2 in 0.25 mm ST CMOS

    • 0.8 and 1.5 V supply

  • Simulation Results

    • 125 MHz Toggle Frequency

    • 50 MHz 8-bit adder

    • energy 70 times lower than comparable Xilinx


An energy efficient p system

Integrated

dc-dc

converter

An Energy-Efficient µP System

  • Dynamic Voltage Scaling (Trevor Pering & Tom Burd)

Lower speed,Lower voltage, Lower energy

Before

µProc. Speed

After

Idle


Xtensa configurable processor

Xtensa Configurable Processor

  • Xtensa (Tensilica,Inc) for embedded CPU

    • Configurability allows designer to keep “minimal” hardware overhead

    • ISA (compatible with 32 bit RISC) can be extended for software optimizations

    • Fully synthesizable

    • Complete HW/SW suite

  • VCC modeling for exploration

    • Requires mapping of “fuzzy” instructions of VCC processor model to real ISA

    • Requires multiple models depending on memory configuration

    • ISS simulation to validate accuracy of model

(Vandana Prabhu)


Microprocessor optimizations for network protocols

Total Execution

Time

calloc

memcpy

other

Memory Routines

Microprocessor Optimizations for Network Protocols

  • ImplementsTransport layer on configurable processor

    • TDMA control and channel usage management

  • Upper layer of protocol is dominated by processor control flow

    • Memory routines, Branches, Procedure calls

  • Artifacts of code generation tools is significant

    • Excessively modular code introduces procedure calls

    • Uses dynamic memory allocation

  • Configurable processor

    • Increased size of register file

    • Customized instructions help datapath but not control

Efficient implementaion at code generation and architecture levels!

(Kevin Camera & Tim Tuan )


Implementation methodology for reconfigurable wireless protocol

Implementation Methodology for Reconfigurable Wireless Protocol

  • Changing granularity within protocol stack requires estimation tool for energy-efficient implementation

  • Software exploration on processors

    • Exploring Xtensa’s TIE

  • Hardware exploration on FPGA platforms

    • Optimal FPGA architecture

    • Alternately “Reconfigurable FSM” analogous to Pleiades approach for datapath kernels

(Suetfei Li & Tim Tuan)


Tci a first generation piconode

TCI - A First Generation PicoNode

Memory

Sub-system

Tensilica

Embedded Proc.

Sonics Backplane

Programmable

Protocol Stack

ConfigurableLogic

(Physical Layer)

Baseband Processing


The system on a chip nightmare

System Bus

DMA

CPU

DSP

Mem

Ctrl.

Bridge

MPEG

C

I

O

O

Custom Interfaces

Peripheral

Bus

Control Wires

The System-on-a-Chip Nightmare

The “Board-on-a-Chip”

Approach

Courtesy of Sonics, Inc


The communications perspective

Open Core

ProtocolTM

DMA

DSP

CPU

MPEG

SiliconBackplane

AgentTM

C

MEM

I

O

Guaranteed Bandwidth

Arbitration

Example: “The Silicon Backplane”

(Sonics, Inc)

The Communications Perspective

(Mike Sheets)

Communications-based Design


Summary

Summary

  • Design for low-energy impacts all stages of the design process — the earlier the better

  • Energy reduction requires clear communication and computation abstractions

  • Efficient and abstract modeling of energy at behavior and architecture level is crucial

  • Efficient hardware implementation of protocol stack

  • Beat the SoC monster!


Targeting tiled architectures in design exploration

1 LESTER Lab

Université de Bretagne Sud

Lorient, France

{lilian.bossuet, guy.gogniat, [email protected]

2 Department of Electrical

and Computer Engineering

University of Massachusetts,

Amherst, USA

{burleson, vanand, [email protected]

Targeting Tiled Architectures in Design Exploration

Lilian Bossuet1, Wayne Burleson2, Guy Gogniat1,

Vikas Anand2, Andrew Laffely2, Jean-Luc Philippe1


Design space exploration motivations

Design Space Exploration: Motivations

  • Design solutions for new telecommunication and multimedia applications targeting embedded systems

  • Optimization and reduction of SoC power consumption

  • Increase computing performance

    • Increase parallelism

    • Increase speed

  • Be flexible

    • Take into account run-time reconfiguration

    • Targeting multi-granularity (heterogeneous) architectures


Design space exploration flow

Design Space Exploration: Flow

  • Progressive design space reduction:

    • iterative exploration

    • refinement of architecture model

    • increase of performance estimation accuracy

  • One level of abstraction for one level of estimation accuracy


Reconfigurable architectures

Reconfigurable Architectures

  • Bridging the flexibility gap between ASICs and microprocessor [Hartenstein DATE 2001]

  • Energyefficient and solution to low power programmable DSP[Rabaey ICASSP 1997, FPL 2000]

  • Run Time Reconfigurable [Compton & Hauck 1999]

  • => A key ingredient for future silicon platforms[Schaumont & all. DAC 2001]


Design space of reconfigurable architecture

Design Space of Reconfigurable Architecture

RECONFIGURABLE ARCHITECTURES

(R-SOC)

MULTI GRANULARITY

(Heterogeneous)

FINE GRAIN

(FPGA)

COARSE GRAIN

(Systolic)

Tile-Based

Architecture

Processor +

Coprocessor

Island

Topology

Hierarchical Topology

Coarse Grain Coprocessor

Fine Grain

Coprocessor

Mesh

Topology

Linear

Topology

Hierarchical

Topology

  • RAW

  • CHESS

  • MATRIX

  • KressArray

  • Systolix Pulsedsp

  • Xilinx Virtex

  • Xilinx Spartran

  • Atmel AT40K

  • Lattice ispXPGA

  • Altera Stratix

  • Altera Apex

  • Altera Cyclone

  • Chameleon

  • REMARC

  • Morphosys

  • Pleiades

  • Garp

  • FIPSOC

  • Triscend E5

  • Triscend A7

  • Xilinx Virtex-II Pro

  • Altera Excalibur

  • Atmel FPSIC

  • aSoC

  • E-FPFA

  • Systolic Ring

  • RaPiD

  • PipeRench

  • DART

  • FPFA


A target architecture asoc

A Target Architecture: aSoC

  • Adaptive System-on-a-Chip (aSoC)

  • Tiled architecture containing many heterogeneous processing cores (RISC, DSP, FPGA, Motion Estimation, Viterbi Decoder)

  • Mesh communication network controlled with statically determined communication schedule

  • A scalable architecture.


Fpga in system on a chip

FPGA in System-on-a-Chip

  • Fast Time-To-Market

  • Post-Fabrication Customization

    • Broaden application domain

    • Run-time Reconfiguration

    • Bug Fixes

    • Upgrades

  • 10x-100x Worse:

    • Area

    • Performance

    • Power

Mark L. Chang [email protected]


Asoc architecture

North

West

East

ctrl

  • Point-to-point connections

  • Communication Interface

South

Core

aSoC Architecture

tile

  • Heterogeneous Cores

uProc

MUL

FPGA

MUL


Asoc communications interface

aSoC Communications Interface

  • Interface Crossbar

    • inter-tile transfer

    • tile to core transfer

  • Interconnect/Instruction Memory

    • contains instructions to configure the interface crossbar (cycle-by-cycle)

  • Interface Controller

    • selects the instruction

  • Coreports

    • data interface and storage for transfers with the tile IP core

  • Dynamic Voltage and Frequency Selection

    • Dynamic Power Management

Core

Coreports

Interface Crossbar

North

North

South

South

East

East

West

West

Outputs

Inputs

Local

Config

.

Local

Decoder

Controller

Frequency

& Voltage

North to South & East

PC

Instruction Memory


Asoc exploration

aSoC Exploration ...

  • Type of tiles

  • Number of each type of tile

  • Placement of the tiles

  • Intern architecture of reconfigurable tiles (FPGA core)

  • Communication scheduling


Design space exploration goals

Design Space Exploration: Goals

  • Goal: Rapid exploration of various architectural solutions to be implemented on heterogeneous reconfigurable architectures (aSoC) in order to select the most efficient architecture for one or several applications

  • Take place before architectural synthesis (algorithmic specification with high level abstraction language)

  • Estimations are based on a functional architecture model (generic, technology-independent)

  • Iterative exploration flow to progressively refine the architecture definition, from a coarse model to a dedicated model


Design exploration flow targeting tiled architecture

C

SPECIFICATION

C to HCDFG parser

Model of the aSOC Architectures

HCDFG Graphs of the application

T

Tile

A

aSOC

2

1

App

Application

F

Function

1

2

T

Tile

1

F

Function

1

T

1

F

1

T

2

F

2

THF Model

HF Model

Application

Analysis

aSOC

Builder

Tile Exploration

Final model of

aSOC architecture

Results of the Tile exploration step

Static Communication

Scheduling

Function

Tile

Performance

F

T

T

, C

,

Occ

1

1

11

11

11

T

T

, C

,

Occ

2

21

21

21

F

T

T

, C

,

Occ

2

1

12

12

12

T

T

, C

,

Occ

2

22

22

22

aSOC

Analysis

Design Exploration Flow Targeting Tiled Architecture


Application analysis

Use of algorithmic metrics and dedicated scheduling algorithms to highlight the target architectures

Algorithmic metrics:

Characterize the application orientation

Processing

Memory

Control

Characterize the application potential parallelism

Processing

Memory

Application Analysis


Tile exploration with 3 steps

Projection:

Link between necessary resources (application) and available resources (tile)

Use of an allocation algorithm based on communication costs reduction

Composition:

Take into account of the function scheduling to estimate additional resources (register, mux, …)

Estimation:

performance interval computation (lower and upper bounds)

speed/resource utilization/power characterization

Tile Exploration: with 3 steps


Asoc builder

Environment AppMapper

Partition and assignment

based on Run Time Estimation

Compilation

Communication Scheduling

Core compilation

Generate tiles configuration

Communications instructions

Bitstreams (for reconfigurable tile)

RISC instructions

aSoC Builder


Asoc analysis

Use the results of previous steps

Functions scheduling

Tile allocation

Communication scheduling

Complete estimation of the proposed solution

Global execution time

Global power consumption

Total area

aSoC Analysis


Power aware system on a chip

Power-Aware System on a Chip

A. Laffely, J. Liang, R. Tessier, C. A. Moritz, W. Burleson

University of Massachusetts Amherst

Boston Area Architecture Conference

30 Jan 2003

{alaffely, jliang, tessier, moritz, [email protected]

This material is based upon work supported by the National Science Foundation under Grant No. 9988238.

Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.


Adaptive system on a chip

Tile

Communication

Interface

North

mProc

Multiplier

East

West

ctrl

Multiplier

FPGA

South

Core

Adaptive System-on-a-Chip

  • Tiled architecture with mesh interconnect

    • Point to point communication pipeline

  • Allows for heterogeneous cores

    • Differing sizes, clock rates, voltages

  • Low-overhead core interface for

    • On-chip bus substitute for streaming applications

  • Based on static scheduling

    • Fast and predictable


Asoc implementation

aSoC Implementation

2500 l

.18 m technology

Full custom

3000 l


Some results

Some Results

  • 9 and 16 core systems tested for IIR, MPEG encoding and Image processing applications

    • ~ 2 x the performance compared to Coreconnect bus Burst and Hierarchical

    • ~ 1.5 x the performance of an oblivious routing network1 (Dynamic routing)

    • Max speedup is 5 x

1. W. Dally and H. Aoki, “Deadlock-free Adaptive Routing in Multi-computer Networks

Using Virtual Routing”, IEEE Transactions on Parallel and Distributed Systems, April 1993


  • Login