Digital integrated circuits a design perspective
Download
1 / 123

Digital Integrated Circuits A Design Perspective - PowerPoint PPT Presentation


  • 143 Views
  • Uploaded on

Digital Integrated Circuits A Design Perspective. System on a Chip Design. Application Specific Integrated Circuits: Introduction. Jun-Dong Cho SungKyunKwan Univ. Dept. of ECE, Vada Lab. http://vada.skku.ac.kr. Contents. Why ASIC? Introduction to System On Chip Design

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Digital Integrated Circuits A Design Perspective' - magee


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Digital integrated circuits a design perspective

Digital Integrated CircuitsA Design Perspective

System on a

Chip Design


Application specific integrated circuits introduction

Application Specific Integrated Circuits: Introduction

Jun-Dong Cho

SungKyunKwan Univ.

Dept. of ECE, Vada Lab.

http://vada.skku.ac.kr


Contents
Contents

  • Why ASIC?

  • Introduction to System On Chip Design

  • Hardware and Software Co-design

  • Low Power ASIC Designs


Why asic design productivity grows
Why ASIC – Design productivity grows!

Complexity increase 40 % per year

Design productivity increase 15 % per year

  • Integration of PCB on single die


Silicon in 2010
Silicon in 2010

Die Area: 2.5x2.5 cm

Voltage: 0.6 V

Technology: 0.07 m


Asic principles
ASIC Principles

  • Value-added ASIC for huge volume opportunities; standard parts for quick time to market applications

  • Economics of Design

    • Fast Prototyping, Low Volume

    • Custom Design, Labor Intensive, High Volume

  • CAD Tools Needed to Achieve the Design Strategies

    • System-level design: Concept to VHDL/C

    • Physical design VHDL/C to silicon, Timing closure (Monterey, Magma, Synopsys, Cadence, Avant!)

  • Design Strategies:Hierarchy; Regularity; Modularity; Locality


Asic design strategies
ASIC Design Strategies

  • Design is a continuous tradeoff to achieve performance specs with adequate results in all the other parameters.

  • Performance Specs- function, timing, speed, power

  • Size of Die- manufacturing cost

  • Time to Design- engineering cost and schedule

  • Ease of Test Generation & Testability- engineering cost, manufacturing cost, schedule



Structured asic designs
Structured ASIC Designs

  • Hierarchy:Subdivide the design into many levels of sub-modules

  • Regularity: Subdivide to max number of similar sub-modules at each level

  • Modularity: Define sub-modules unambiguously & well defined interfaces

  • Locality: Max local connections, keeping critical paths within module boundaries


Asic design options
ASIC Design Options

  • Programmable Logic

  • Programmable Interconnect

  • Reprogrammable Gate Arrays

  • Sea of Gates & Gate Array Design

  • Standard Cell Design

  • Full Custom Mask Design

  • Symbolic Layout

  • Process Migration - Retargeting Designs



Why soc
Why SOC?

  • SOC specs are coming from system engineers rather

  • than RTL descriptions

  • SOC will bridge the gap hardware/software and their implementation in novel, energy-efficient silicon architecture.

  • In SOC design, chips are assembled at IP block level (design reusable) and IP interfaces rather than gate level


Cmos density now allows complete system on a chip solutions

mP core

Dedicated logic

phone

book

keypad

intfc

phonebook

RAM & ROM

DMA

S/P

control

protocol

Demod

and

sync

Viterbi

Equal.

voice

recognition

speech

quality

enhancement

A

de-intl

&

decoder

RPE-LTP

speech

decoder

digital

down

conv

D

Analog

DSP core

CMOS density now allows complete System-on-a-chip Solutions

  • FPGA

  • Reconfigurable Interconnect

Source:

Brodersen, ICASSP ‘98

Also like to add

How do we design these chips?


Possible single chip radio architectures

Software Radio

GOAL: Simplify System Design Process

Seek architectures which are flexible such that hardware and protocols can be designed independently

APPROACH: Minimize the use of dedicated logic

Universal Radio

GOAL: Maximize Bandwidth Efficiency and Battery Life

Seek architectures which perform complex algorithms very fast with minimal energy

APPROACH: Minimize the use of programmable logic

Possible Single-Chip Radio Architectures

Why is SOC design so scary?


60 ghz sige transceiver for wireless lan applications

A low power 30 GHz LNA is designed as the front end of the receiver.

Wideband and high gain response is realized by a 2-stage design using a stagger-tuned technique.

The simulated performance predicts a forward gain of |S21| > 20 dB over a 6 GHz range with an input match of |S11| < -30 dB and output match of |S22| < -10 dB.

The mixer consists of a single balanced Gilbert cell.

A fully-integrated differential 25 GHz VCO is used, in conjunction with the mixer, to downconvert the RF input to a 5 GHz IF.

60 GHz SiGe Transceiver for Wireless LAN Applications

30 GHz receiver layout consisting of the LNA, mixer and VCO


Wideband cmos lc vco

A 1.8 GHz wideband LC VCO implemented in 0.18 µm bulk CMOS has been successfully designed, fabricated, and measured.

This VCO utilizes a 4-bit array of switched capacitors and a small accumulation-mode varactor to achieve a measured tuning range exceeding 2:1 (73%) and a worst-case tuning sensitivity of 270 MHz/V.

The amplitude reference level is programmable by means of a 3-bit DAC.

Wideband CMOS LC VCO

VCOs die photograph


A high level view of an industry standard design flow

HDL Entry has been successfully designed, fabricated, and measured.

Front-End

good?

Synthesis

good?

Back-End

Floor-plan

Place & Route

good?

Physical Verification

DRC & LVS

good?

done

A High Level View of an Industry Standard Design Flow

source: Hitachi, Prof. R. W. Brodersen

  • Every step can loop to every other step

  • Each step can take hours or days for a 100,000 line description

  • HDL description contains no physical information

  • Different engineers handle the front-end and back-end design

Problems with this flow:

How have semiconductor companies made this flow work?


A more accurate picture of the standard flow

Architecture has been successfully designed, fabricated, and measured.

10 months

Front-End

10 months

Back-End 2 months

Fabrication 2 months

A More Accurate Picture of the Standard Flow

Source: IBM Semiconductor, Prof. R. Newton

  • Architecture: Partition the chip into functional units and generate bit-true test vectors to specify the behavior of each unitTOOLS: Matlab, C, SPW, (VCC)FREEZE the test vectors

  • Front-End: Enter HDL code which matches the test vectorsTOOLS: HDL Simulators, Design CompilerFREEZE the HDL code

  • Back-End: Create a floor-plan and tweak the tools until a successful mask layout is createdTOOLS:Design Compiler, Floor-planners, Placers, Routers, Clock-tree generators, Physical Verification

How can we improve this flow?


Common fabric for ip blocks
Common Fabric for IP Blocks has been successfully designed, fabricated, and measured.

  • Soft IP blocks are portable, but not as predictable as hard IP.

  • Hard IP blocks are very predictable since a specific physical implementation can be characterized, but are hard to port since are often tied to a specific process.

  • Common fabric is required for both portability and predictability.

  • Wide availability: Cell Based Array, metal programmable architecture that provides the performance of a standard cell and is optimized for synthesis.


Four main applications
Four main applications has been successfully designed, fabricated, and measured.

  • Set-top box: Mobile multimedia system, base station for the home local-area network.

  • Digital PCTV: concurrent use of TV,3D graphics, and Internet services

  • Set-top box LAN service: Wireless home-networks, multi-user wireless LAN

  • Navigation system:steer and control traffic and/or goods-transportation

  • CMPRis a multipurpose program that can be used for displaying diffraction data, manual- & auto-indexing, peak fitting and other


Pc multimedia applications
PC-Multimedia Applications has been successfully designed, fabricated, and measured.


Types of system on a chip designs
Types of System-on-a-Chip Designs has been successfully designed, fabricated, and measured.


Physical gap
Physical gap has been successfully designed, fabricated, and measured.

  • Timing closure problem: layout-driven logic and RT-level synthesis

  • Energy efficiency requires locality of computation and storage: match for stream-based data processing of speech,images, and multimedia-system packets.

  • Next generation SOC designers must bridge the architectural gap b/w system specification and energy-efficient IP-based architectures, while CAE vendors and IP providers will bridge the physical gap.


Circular y chart
Circular Y-Chart has been successfully designed, fabricated, and measured.


Soc co design challenges
SOC Co-Design Challenges has been successfully designed, fabricated, and measured.

  • Current systems are complex and heterogenous Contain many different types of components

  • Half of the chip can be filled with 200 low-power, RISC-like processors (ASIP) interconnected by field-programmable buses, embedded in 20Mbytes of distributed DRAM and flash memory, Another Half: ASIC

  • Computational power will not result from multi-GHz clocking but from parallelism, with below 200 MHz.

  • This will greatly simplify the design for correct timing, testability, and signal integrity.


Bridging the architectural gap
Bridging the architectural gap has been successfully designed, fabricated, and measured.

  • One-M gate reconfigurable, one-M gate hardwired logic.

  • 50GIPS for programmable components or 500 GIPS for dedicated hardwares

  • Product reliability: design at a level far above the RT level, with reuse factors in excess of 100

  • Trade-off: 100MOPs/watt (microprocessor) 100GOPs/watt (hardwired) Reconf. Computing with a large number of computing nodes and a very restricted instruction set (Pleiades)


Why lower power

Portable systems has been successfully designed, fabricated, and measured.

long battery life

light weight

small form factor

IC priority list

power dissipation

cost

performance

Technology direction

Reduced voltage/power designs based on mature high performance IC technology, high integration to minimize size, cost, power, and speed

Why Lower Power


Microprocessor power dissipation

Power(W) has been successfully designed, fabricated, and measured.

Alpha 21164

Alpha 21264

50

45

P III 500

P II 300

40

35

Alpha21064 200

30

25

P6 166

20

P5 66

15

P-PC604 133

10

i486 DX2 66

P-PC601 50

i486 DX25

i386 DX 16

i486 DX4 100

5

i286

i486 DX 50

P-PC750 400

1980

1985

1990

1995

2000

year

Microprocessor Power Dissipation


Levels for low power design
Levels for Low Power Design has been successfully designed, fabricated, and measured.


Power hungry applications
Power-hungry Applications has been successfully designed, fabricated, and measured.

  • Signal Compression: HDTV Standard, ADPCM, Vector Quantization, H.263, 2-D motion estimation, MPEG-2 storage management

  • Digital Communications: Shaping Filters, Equalizers, Viterbi decoders, Reed-Solomon decoders


New computing platforms
New Computing Platforms has been successfully designed, fabricated, and measured.

  • SOC power efficiency more than 10GOPs/w

    • Higher On Chip System Integration: COTS: 100W, SOC:10W (inter-chip capacitive loads, I/O buffers)

    • Speed & Performance: shorter interconnection,fewer drivers,faster devices,more efficient processing artchitectures

  • Mixed signal systems

  • Reuse of IP blocks

  • Multiprocessor, configurable computing

  • Domain-specific, combined memory-logic


Low power design flow i

Function has been successfully designed, fabricated, and measured.

System

System-Level

Partitioning and

Level

Power Analysis

HW/SW Allocation

Specification

Behavioral

Software

Description

Functions

Power-driven

Behavioral-Level

Processor

Behavioral

Power Analysis

Selection

Transformation

Power Conscious

Behavioral

Description

High-Level

Software-Level

RT-Level

Software

Synthesis and

Power Analysis

Power Analysis

Optimization

Optimization

To RT-Level Design

Low Power Design Flow I


Low power design flow ii

RT-level has been successfully designed, fabricated, and measured.

Description

Controller

Data-path

Logic Synthesis

Gate-Level

RTL

RTL

and

Power Analysis

mapping

Library

Optimization

Gate-level

Description

High-Level

Switch-Level

Standard cell

Synthesis and

Processor

Memory

Power Analysis

Library

Optimization

Control and

RTL

Steering Logic

Macrocells

Switch-level

Description

Low Power Design Flow II


Three factors affecting energy
Three Factors affecting Energy has been successfully designed, fabricated, and measured.

  • Reducing waste by Hardware Simplification: redundant h/w extraction, Locality of reference,Demand-driven / Data-driven computation,Application-specific processing,Preservation of data correlations, Distributed processing

  • All in one Approach(SOC): I/O pin and buffer reduction

  • Voltage Reducible Hardwares

    • 2-D pipelining (systolic arrays)

    • SIMD:Parallel Processing:useful for data w/ parallel structure

    • VLIW: Approach- flexible


Ibm s powerpc lower power architecture
IBM’s PowerPC Lower Power Architecture has been successfully designed, fabricated, and measured.

  • Optimum Supply Voltage through Hardware Parallel, Pipelining ,Parallel instruction execution

    • 603e executes five instruction in parallel (IU, FPU, BPU, LSU, SRU)

    • FPU is pipelined so a multiply-add instruction can be issued every clock cycle

    • Low power 3.3-volt design

  • Use small complex instruction with smaller instruction length

    • IBM’s PowerPC 603e is RISC

  • Superscalar: CPI < 1

    • 603e issues as many as three instructions per cycle

  • Low Power Management

    • 603e provides four software controllable power-saving modes.

  • Copper Processor with SOI

  • IBM’s Blue Logic ASIC :New design reduces of power by a factor of 10 times


Power down techniques
Power-Down Techniques has been successfully designed, fabricated, and measured.

Lowering the voltage along with the clock actually alters the energy-per-operation of the microprocessor, reducing the energy

required to perform a fixed amount of work


Implementing digital systems
Implementing Digital Systems has been successfully designed, fabricated, and measured.


H w and s w co design
H/W and S/W Co-design has been successfully designed, fabricated, and measured.


Three co design approaches
Three Co-Design Approaches has been successfully designed, fabricated, and measured.

  • IFIP International Conference FORTE/PSTV’98, Nov.’98 N.S. Voros et.al, “Hardware -software co-design of embedded systems using multiple formalisms for application development”

  • ASIP co-design: builds a specific programmable processor for an application, and translates the application into software code. H/w and s/w partitioning includes the instruction set design.

  • H/w s/w synchronous system co-design: s/w processor as a master controller, and a set of h/w accelerators as co-processors. Vulcan, Codes, Tosca, Cosyma

  • H/w s/w for distributed systems: mapping of a set of communication processors onto a set of interconnected processors. Behavioral decomposition, process allocation and communication transformation. Coware(powerful), Siera (reuse), Ptolemy (DSP)


Mixing h w and s w
Mixing H/W and S/W has been successfully designed, fabricated, and measured.

  • Argument: Mixed hardware/ software systems

    represent the best of both worlds.

    High performance, flexibility, design reuse, etc.

  • Counterpoint: From a design standpoint, it is

    the worst of both worlds

    • Simulation: Problems of verification, and test become harder

    • Interface: Too many tools, too many interactions, too much heterogeneity

    • Hardware/ software partitioning is “AI- complete”!

    • (MIT, Stanford: by analogy with "NP-complete") A term used to describe problems in artificial intelligence, to indicate that the solution presupposes a solution to the "strong AI problem" (that is, the synthesis of a human-level intelligence). A problem that is AI-complete is just too hard.


Low power partitioning approach
Low power partitioning approach has been successfully designed, fabricated, and measured.

  • Different HW resources are invoked according to the instruction executed at a specific point in time

  • During the execution of the add op., ALU and register are used, but Multiplier is in idle state.

  • Non-active resources will still consume energy since the according circuit continue to switch

  • Calculate wasting energy

  • Adding application specific core and partial running

    Whenever one core performing, all the other cores are shut down


Asip application specific instruction processors design
ASIP ( has been successfully designed, fabricated, and measured.Application Specific Instruction Processors) Design

  • Given a set of applications, determine micro architecture of ASIP (i. e., configuration of functional units in datapaths, instruction set)

  • To accurately evaluate performance of processor on a given application need to compile the application program onto the processor datapath and simulate object code.

  • The micro architecture of the processor is a design parameter!


Asip design flow
ASIP Design Flow has been successfully designed, fabricated, and measured.


Cross disciplinary nature
Cross-Disciplinary nature has been successfully designed, fabricated, and measured.

  • Software for low power:loop transformation leads to much higher temporal and spatial locality of data.

  • Code size becomes an important objective Software will eventually become a part of the chip

  • Behavior-platform-compiler codesign: codesigned with C++ or JAVA, describing their h/w and s/w implementation.

  • Multidisciplinary system thinking is required for future designs (e.g., Eindhoven Embedded Systems Institutehttp://www.eesi.tue.nl/english)


Vlsi signal processing design methodology
VLSI Signal Processing Design Methodology has been successfully designed, fabricated, and measured.

  • pipelining, parallel processing, retiming, folding, unfolding, look-ahead, relaxed look-ahead, and approximate filtering

  • bit-serial, bit-parallel and digit-serial architectures, carry save architecture

  • redundant and residue systems

  • Viterbi decoder, motion compensation, 2D-filtering, and data transmission systems


Low power dsp
Low Power DSP has been successfully designed, fabricated, and measured.

  • DO-LOOPDominant

  • VSELP Vocoder : 83.4 %

  • 2D 8x8 DCT : 98.3 %

  • LPC computation : 98.0 %

DO-LOOPPower Minimization

==> DSPPower Minimization

VSELP : Vector Sum Excited Linear Prediction

LPC : Linear Prediction Coding


Deep submicron design flows
Deep-Submicron Design Flows has been successfully designed, fabricated, and measured.

  • Rapid evaluation of complex designs for area and performance

  • Timing convergence via estimated routing parasitics

  • In-place timing repair without resynthesis

  • Shorter design intervals, minimum iterations

  • Block-level design and place and route

  • Localized changes without disturbance

  • Integration of complex projects and design reuse


Soc cad companies

Avant! www.avanticorp.com has been successfully designed, fabricated, and measured.

Cadence www.cadence.com

Duet Tech www.duettech.com

Escalade www.escalade.com

Logic visions www.logicvision.com

Mentor Graphics www.mentor.com

Palmchip www.palmchip.com

Sonic www.sonicsinc.com

Summit Design www.summit-design.com

Synopsys www.synopsys.com

Topdown design solutions www.topdown.com

Xynetix Design Systems www.xynetix.com

Zuken-Redac www.redac.co.uk

SOC CAD Companies


Design technology for low power radio systems

Design Technology has been successfully designed, fabricated, and measured.for Low Power Radio Systems

Rhett Davis

Dept. of EECS

Univ. of Calif.

Berkeley

http://bwrc.eecs.berkeley.edu


Domain of interest
Domain of Interest has been successfully designed, fabricated, and measured.

  • Highly integrated system-on-a-chip solutions – SOC’s

  • Wireless communications with associated processing, e.g. multimedia processing, compression, switching, etc…

  • Primary computation is high complexity dataflow with a relatively small amount of control


Why systems on a chip soc
Why Systems-on-a-Chip - SOC ? has been successfully designed, fabricated, and measured.

State-of-the-Art CMOS is easily able to implement complete systems (or what was on a board before)

  • A microprocessor core is only 1-2 mm2 (1-2 % of the area of a $4 chip)

  • Portability (size) is critical to meet the cost, power and size requirements of future wireless systems

  • Chips will be required to support the complete application (wireless internet, multimedia)

  • Dedicated stand-alone computation is replacing general purpose processors as the semiconductor industry driver


Small has been successfully designed, fabricated, and measured.

Signal RF

Power

RF

Power

Management

Digital Cellular Market

(Phones Shipped)

1996 1997 1998 1999 2000

Analog

Baseband

Units48M 86M 162M 260M 435M

Digital Baseband

(DSP + MCU)

Cellular Phones: An example

(Courtesy Mike McMahon, Texas Instruments)


Cellular Phone Baseband SOC has been successfully designed, fabricated, and measured.

ROM

MCU

DSP

Gates

RAM

Analog

2000+ phones on each 8” wafer @ .15 Leff

1Million Baseband Chips per Day!!!

(Courtesy Mike McMahon, Texas Instruments)


Wireless system design issues
Wireless System Design Issues has been successfully designed, fabricated, and measured.

  • It is now possible to use CMOS to integrate all digital radio functions – but what is the “best” architectural way to use CMOS???

  • Computation rates for wireless systems will easily range up to 100’s of GOPS in signal processing

    • What’s keeping us from achieving this in silicon?

    • What can we do about it?


Computational efficiency metrics
Computational Efficiency Metrics has been successfully designed, fabricated, and measured.

  • Definition: MOPS

    • Millions of algorithmically defined arithmetic operations (e.g. multiply, add, shift) – in a GP processor several instructions per “useful” operation

  • Figures of merit

    • MOPS/mW - Energy efficiency (battery life)

    • MOPS/mm2 - Area efficiency (cost)

      Optimization of these “efficiencies” is the basic goal assuming functionality is met


Energy efficiency of architectures

1000 has been successfully designed, fabricated, and measured.

Dedicated

HW

Direct mapped

100-1000 MOPS/mW

100

ReconfigurableProcessor/Logic

Reconfiguration (???)

Potential of 10-100 MOPS/mW

Energy Efficiency

MOPS/mW (or MIPS/mW)

10

ASIPs

DSPs

1

DSP

1-10 MIPS/mW

Embedded mProcessors

Microprocessor

.1-1 MIPS/mW

0.1

Flexibility (Coverage)

Energy-Efficiency of Architectures


Software processors energy trends

300 has been successfully designed, fabricated, and measured.

A21164-300

A21064A

250

MIPS R5000

200

MIPS R4400

PPro200

HP PA8000

MIPS R10000

UltraSparc-167

PP166

150

PPro-150

Freq(MHz)

HP PA7200

PP-133

PPC 604-120

DX4 100

100

PP-100

PPC603e-100

SuperSparc2-90

PPC 601-80

486-66

PP-66

50

i386C-33

i486C-33

i386

0

1991

1992

1993

1994

1995

1996

Software Processors: Energy Trends

Primary means of performance increase of software processors has been by increasing clock rate

Decreasing Energy Efficiency

E  C  VDD2


Software processors area trends
Software Processors: Area Trends has been successfully designed, fabricated, and measured.

  • Increasing clock rate results in a memory bottleneck – addressed by bringing memory on-chip

  • Area is increasingly dominated by memory – degrading MOPs/mm2

16x16 multiplier

(.05 mm2)

DSP processor with 1 multiplier

(25 mm2)

Why time multiplex to save area if the overhead is much greater than the area saved????


Parallelism is the answer but
Parallelism is the answer, but … has been successfully designed, fabricated, and measured.

  • Not by putting Von Neumann processors in parallel and programming with a sequential language

    • Attempts to do this have failed over and over again…

    • The parallel computer compiler problem is very difficult

  • Not by trying to capture parallelism at the instruction level

    • Superscalar, VLIW, etc… are very inefficient

    • Hardware can’t figure out the parallelism from a sequential language either

      The problem is the initial sequential description (e.g. C) which is poorly matched to highly parallel applications


What is really hapenning
What is really hapenning… has been successfully designed, fabricated, and measured.

Then try to rediscover the parallelism

Re-entering it using a sequential description

Starting with a parallel algorithmic description

While (i=0;i++:i<num) {

a = a * c[i];

b[i] = sin (a * pi) + cos(a*pi);

};

Outfil = b[i] * indata;

We take this path so that we can use an architecture

that is orders of magnitude less efficient in energy and area

??????


What can a fully parallel cmos solution potentially do
What can a fully parallel CMOS solution potentially do? has been successfully designed, fabricated, and measured.

In .25 micron a multiplier requires .05 mm2 and 7pJ per operation at 1 V. Adders and registers are about 10 times smaller and 10 times lower energy

Lets implement a 50mm2 , .25 micron chip using adders, registers and multipliers

  • We can have 2000 adders/registers and 200 multipliers in less than 1/2 of the chip, also assume 1/3 of power goes into clocks

  • 25 MHz clock (1 volt) gives ~50 Gops at 100mW

  • 500 MOPS/mW and 1000 MOPS/mm2


Start with a parallel description of the algorithm
Start with a parallel description of the algorithm… has been successfully designed, fabricated, and measured.


Then directly map into hardware

S reg has been successfully designed, fabricated, and measured.

X reg

Add,

Sub,

Shift

Mult2

Mac2

Mac1

Mult1

Then directly map into hardware …


Results in fully parallel solutions
Results in fully parallel solutions has been successfully designed, fabricated, and measured.

(numbers taken from vendor-published benchmarks)

Orders of magnitude lower efficiency even for an optimized processor architecture


Reasons software solutions seem attractive
Reasons software solutions seem attractive has been successfully designed, fabricated, and measured.

(1) Believed to reduce time-to-system-implementation

(2) Provides flexibility

(3) Locks the customers into an architecture they can’t change

(4) Difficulty in getting dedicated SOC chips designed

Are these good reasons???


1 believed to reduce time to system implementation
(1) Believed to reduce time-to-system implementation has been successfully designed, fabricated, and measured.

  • Software decreases time to get first prototype, but time to fully verified system is much longer (hardware is often ready but software still needs to be done)

  • Limitations of software prototype often sets the ultimate limit of the system performance

  • Software solutions can be shipped with bugs, not a real option for SOC


2 need flexibility
(2) Need flexibility has been successfully designed, fabricated, and measured.

  • Software is not always flexible

    • Can be hard to verify

  • Flexibility does not imply software programmability

    • Domain specific design can have multiple modules, coefficients and local state control (the factor of 100 in efficiency) to address a range of applications

    • Reconfiguration of interconnect can achieve flexibility with high levels of efficiency


Flexibility without software
Flexibility without software has been successfully designed, fabricated, and measured.

Energy per Transform

vs. FFT size

Transforms per Second per mm2

vs. FFT size

* All results are scaled to 0.18mm


Reasons software solutions seem attractive1
Reasons software solutions seem attractive has been successfully designed, fabricated, and measured.

(1) Believed to reduce time-to-system implementation

(2) Provides flexibility

(3) Locks the customers into an architecture they can’t change

(4) Difficulty in getting dedicated SOC chips designed


Standard dsp asic design flow

Algorithm has been successfully designed, fabricated, and measured.Design

Floating-PointSimulation

Sequential

System/ArchitectureDesign

Mixed Sequential & Structural

Fixed-PointSimulation

Hardware/Front-End Design

Integer only,Structural w/SequentialLeaf-cells

RTL Code

Physical/Back-End Design

Single-wire Connectivityw/ TimingConstraints

Mask Layout

Standard DSP-ASIC Design Flow

  • Three translations of design data

  • Requirements for re-verification at each stage

  • Uncontrolled looping when pipeline stalls

Problems:

Prohibitively Long Design Time for Direct Mapped Architectures


Direct mapping design flow

Algorithm/System has been successfully designed, fabricated, and measured.

Simulation

Back-End

Front-End

Floorplan

RTL Libraries

Automated Flow

Mask Layout

Performance Estimates

Direct Mapping Design Flow

  • Encourages iterations of layout

  • Controls looping

  • Reduces the flow to a single phase

  • Depends on fast automation


D j vu
Déjà vu??? has been successfully designed, fabricated, and measured.

  • An automated style of design with parameterized modules processed through foundries is just the reincarnation of good ole Silicon Compilation of >10 years ago

  • What happened?

    • A decline of research into design methodologies

    • A single dominant flow has resulted - the Verilog-Synopsys-Standard Cell

    • Lack of tool flows to support alternative styles of design

    • Research community lost access to technology – moved to highly sub-optimal processor and FPGA solutions


Capturing design decisions

reg. has been successfully designed, fabricated, and measured.file

MAC

add

shift

reg. file

S

Capturing Design Decisions

Categories:

  • Function - basic input-output behavior

  • Signal - physical signals and types

  • Circuit - transistors

  • Floorplan - physical positions

How to get layout and performance estimates in a day?


Simplified view of the flow

dataflow graph has been successfully designed, fabricated, and measured.

elaborate

netlist

macrolibrary

floorplan

merge

autoLayout

route

layout

Simplified View of the Flow

New Software:

  • Generation of netlists from a dataflow graph

  • Merging of floorplan from last iteration

  • Automatic routing and performance analysis

  • Automation of flow as a dependency graph (UNIX MAKE program)


Why simulink

Time-Multiplexed FIR Filter has been successfully designed, fabricated, and measured.

Why Simulink?

  • Simulink is an easy sell to algorithm developers

  • Closely integrated with popular system design tool Matlab

  • Successfully models digital and analog circuits


Modeling datapath logic
Modeling Datapath Logic has been successfully designed, fabricated, and measured.

  • Discrete-Time(cycle accurate)

  • Fixed-Point Types(bit true)

  • Completely specify function and signal decisions

  • No need for RTL

Multiply / Accumulate


Modeling control logic
Modeling Control Logic has been successfully designed, fabricated, and measured.

  • Extended finite state-machine editor

  • Co-simulation with dataflow graph

  • New Software:Stateflow-VHDL translator

  • No need for RTL

Address Generator / MAC Reset


Specifying circuit decisions

Black Box has been successfully designed, fabricated, and measured.

RTL CodeorData-pathGeneratorCodeorCustomModule

Stateflow-VHDLtranslator

Time-Multiplexed FIR Filter

Specifying Circuit Decisions

  • Macro choices embedded in dataflow graph

  • Cross-check simulations required


Hierarchy hardened progressively

System-Level has been successfully designed, fabricated, and measured.

Design Environment

layout and characterize

new hard macro

estimate

performance:

power, area, delay

Hard Macro Characterization Libraries

Hierarchy Hardened Progressively

  • Macro characterization saved for fast estimates

  • Each level of hierarchy becomes a new hard macro

  • Higher levels of hierarchy are adjusted

  • When top level of hierarchy is hardened, the design is done


Capturing floorplan decisions

Parallel Pipelined FIR Filter has been successfully designed, fabricated, and measured.

Capturing Floorplan Decisions

  • Commercial physical design tools used

  • Instance names in floorplan match dataflow graph

  • Placements merged on each iteration

  • Manhattan distance can be used for parasitic estimates


Reduced impact of interconnect

FO4 inv has been successfully designed, fabricated, and measured.delay

Wire

delay

...

Reduced Impact of Interconnect

  • 0.18 mm

Long wires can be modeled as lumped capacitances


Race immune clock tree synthesis

t < t - t has been successfully designed, fabricated, and measured.

skew(max)

clk-Q(min)

hold(max)

Hierarchical Clock Tree Synthesis

Example Clock Tree Stages: 22 Sinks: 7650 Skew: 320 ps Clock Power: 2.8 mW Logic Power: 21 mW

Race-Immune Clock Tree Synthesis

Race margin= 580 ps

  • 0.18 mm

  • VDD = 1 V

Demonstrated on a 600k transistor design


Example 1 macro hardening

parallel pipelined FIR filter has been successfully designed, fabricated, and measured.

area in 0.25 mm

1.4 mm2

power @ 25 MHz (1 V, PowerMill)

13.0 mW

critical path delay (1 V, PathMill)

18.0 ns

cells

21 k

transistors

240 k

execution time (elaborate / route) (characterization)

3 hours9 hours

disk space (elaborate / route) (characterization)

180 MB1.5 GB

Example 1: Macro Hardening

Most time/disk space spent on extraction and power simulation


Example 2 test chip
Example 2: Test Chip has been successfully designed, fabricated, and measured.

  • 300k transistors

  • 0.25 mm

  • 1.0 V

  • 25 MHz

  • 6.8 mm2

  • 14 mW

  • 2 phase clock

  • 3 layers of P&R hierarchy

Parallel Pipelined FIR Filter(8X decimation filter for 12-bit 200 MHz SD)


Tdma baseband receiver

carrier has been successfully designed, fabricated, and measured.detection

frequency estimation

rotate & correlate

control

TDMA Baseband Receiver

  • 600k transistors

  • 0.18 mm

  • 1.0 V

  • 25 MHz

  • 1.1 mm2

  • 21 mW

  • single phase clock

  • 5 clock domains

  • 2 layers of P&R hierarchy


Conclusions
Conclusions has been successfully designed, fabricated, and measured.

  • Direct-Mapped hardware is the most efficient use of silicon

  • Direct-Mapped hardware can be easier to design and verify than embedded hardware/software systems

  • Don’t translate design data, refine it

  • Design with dataflow graphs, not sequential code

  • Design flow automation speeds up design space exploration


Embedded processor architectures and re configurable computing

Embedded Processor Architectures and (Re)Configurable Computing

Vandana Prabhu

Professor Jan M. Rabaey

Jan 10, 2000


Pico Radio Architecture Computing

Embedded uP

FPGA

Dedicated FSM

Dedicated

DSP

Reconfigurable

DataPath


Reconfigurable computing merging efficiency and versatility
Reconfigurable Computing: ComputingMerging Efficiency and Versatility

Spatially programmed connection of processing elements.

  • “Hardware” customized to specifics of problem.

    • Direct map of problem specific dataflow, control.

  • Circuits “adapted” as problem requirements change.


Matching computation and architecture

AddressGen Computing

AddressGen

Memory

Memory

Convolution

MAC

MAC

L

G

C

Control

Processor

Two architectural models:

sequential control+ data-driven

Two models of computation:

communicating processes + data-flow

Matching Computation and Architecture


Implementation fabrics for data processing
Implementation Fabrics for ComputingData Processing

300 million multiplications/sec

357 million add-sub’s/sec

Data In

16 Mmacs/mW!


Software methodology flow
Software Methodology Flow Computing

Algorithms

Area &

m

proc

&

Timing

Accelerator

Constraints

PDA Models

Kernel Detection

Behavioral

Xform’s

Estimation/Exploration

for low

power

Premapped

Power & Timing Estimation

Kernels

of Various Kernel Implementations

Kernels

Partitioning

Executable Intemediate

Form

Reconfig HW

Software Compilation

Reconfig. Hardware Mapping

Interface Code Generation

Interconnect

Optimization

(Marlene Wan)


Maia reconfigurable baseband processor for wireless
Maia: Reconfigurable Baseband Processor for Wireless Computing

  • 0.25um tech: 4.5mm x 6mm

  • 1.2 Million transistors

  • 40 MHz at 1V

  • 1 mW VCELP voice coder

  • Hardware

    • 1 ARM-8

    • 8 SRAMs & 8 AGPs

    • 2 MACs

    • 2 ALUs

    • 2 In-Ports and 2 Out-Ports

    • 14x8 FPGA


Implementation fabrics for protocols

RACH Computing

akn

RACH

req

Memory

idle

RACH

BUF

BUF

slotset

write

read

update

R_ENA

idle

W_ENA

Slot_Set_Tbl

2x16

addr

Slot

start

slot_set

<31:0>

Slot_no

<5:0>

Pkt

end

Implementation Fabrics for Protocols

A protocol = Extended FSM

  • ASIC: 1V, 0.25 mm CMOS process

  • FPGA: 1.5 V 0.25 mm CMOS low-energy FPGA

  • ARM8: 1 V 25 MHz processor; n = 13,000

  • Ratio: 1 - 8 - >> 400

Idea: Exploit model of computation: concurrent finite state machines, communicating through message passing

Intercom TDMA MAC


Low power fpga
Low-Power FPGA Computing

  • Low Energy Embedded FPGA(Varghese George)

  • Test chip

    • 8x8 CLB array

    • 5 in - 3 out CLB

    • 3-level interconnect hierarchy

    • 4 mm2 in 0.25 mm ST CMOS

    • 0.8 and 1.5 V supply

  • Simulation Results

    • 125 MHz Toggle Frequency

    • 50 MHz 8-bit adder

    • energy 70 times lower than comparable Xilinx


An energy efficient p system

Integrated Computing

dc-dc

converter

An Energy-Efficient µP System

  • Dynamic Voltage Scaling (Trevor Pering & Tom Burd)

Lower speed,Lower voltage, Lower energy

Before

µProc. Speed

After

Idle


Xtensa configurable processor
Xtensa Configurable Processor Computing

  • Xtensa (Tensilica,Inc) for embedded CPU

    • Configurability allows designer to keep “minimal” hardware overhead

    • ISA (compatible with 32 bit RISC) can be extended for software optimizations

    • Fully synthesizable

    • Complete HW/SW suite

  • VCC modeling for exploration

    • Requires mapping of “fuzzy” instructions of VCC processor model to real ISA

    • Requires multiple models depending on memory configuration

    • ISS simulation to validate accuracy of model

(Vandana Prabhu)


Microprocessor optimizations for network protocols

Total Execution Computing

Time

calloc

memcpy

other

Memory Routines

Microprocessor Optimizations for Network Protocols

  • ImplementsTransport layer on configurable processor

    • TDMA control and channel usage management

  • Upper layer of protocol is dominated by processor control flow

    • Memory routines, Branches, Procedure calls

  • Artifacts of code generation tools is significant

    • Excessively modular code introduces procedure calls

    • Uses dynamic memory allocation

  • Configurable processor

    • Increased size of register file

    • Customized instructions help datapath but not control

Efficient implementaion at code generation and architecture levels!

(Kevin Camera & Tim Tuan )


Implementation methodology for reconfigurable wireless protocol
Implementation Methodology for Reconfigurable Wireless Protocol

  • Changing granularity within protocol stack requires estimation tool for energy-efficient implementation

  • Software exploration on processors

    • Exploring Xtensa’s TIE

  • Hardware exploration on FPGA platforms

    • Optimal FPGA architecture

    • Alternately “Reconfigurable FSM” analogous to Pleiades approach for datapath kernels

(Suetfei Li & Tim Tuan)


Tci a first generation piconode
TCI - A First Generation PicoNode Protocol

Memory

Sub-system

Tensilica

Embedded Proc.

Sonics Backplane

Programmable

Protocol Stack

ConfigurableLogic

(Physical Layer)

Baseband Processing


The system on a chip nightmare

System Bus Protocol

DMA

CPU

DSP

Mem

Ctrl.

Bridge

MPEG

C

I

O

O

Custom Interfaces

Peripheral

Bus

Control Wires

The System-on-a-Chip Nightmare

The “Board-on-a-Chip”

Approach

Courtesy of Sonics, Inc


The communications perspective

Open Core Protocol

ProtocolTM

DMA

DSP

CPU

MPEG

SiliconBackplane

AgentTM

C

MEM

I

O

Guaranteed Bandwidth

Arbitration

Example: “The Silicon Backplane”

(Sonics, Inc)

The Communications Perspective

(Mike Sheets)

Communications-based Design


Summary
Summary Protocol

  • Design for low-energy impacts all stages of the design process — the earlier the better

  • Energy reduction requires clear communication and computation abstractions

  • Efficient and abstract modeling of energy at behavior and architecture level is crucial

  • Efficient hardware implementation of protocol stack

  • Beat the SoC monster!


Targeting tiled architectures in design exploration

1 ProtocolLESTER Lab

Université de Bretagne Sud

Lorient, France

{lilian.bossuet, guy.gogniat, [email protected]

2 Department of Electrical

and Computer Engineering

University of Massachusetts,

Amherst, USA

{burleson, vanand, [email protected]

Targeting Tiled Architectures in Design Exploration

Lilian Bossuet1, Wayne Burleson2, Guy Gogniat1,

Vikas Anand2, Andrew Laffely2, Jean-Luc Philippe1


Design space exploration motivations
Design Space Exploration: Motivations Protocol

  • Design solutions for new telecommunication and multimedia applications targeting embedded systems

  • Optimization and reduction of SoC power consumption

  • Increase computing performance

    • Increase parallelism

    • Increase speed

  • Be flexible

    • Take into account run-time reconfiguration

    • Targeting multi-granularity (heterogeneous) architectures


Design space exploration flow
Design Space Exploration: Flow Protocol

  • Progressive design space reduction:

    • iterative exploration

    • refinement of architecture model

    • increase of performance estimation accuracy

  • One level of abstraction for one level of estimation accuracy


Reconfigurable architectures
Reconfigurable Architectures Protocol

  • Bridging the flexibility gap between ASICs and microprocessor [Hartenstein DATE 2001]

  • Energyefficient and solution to low power programmable DSP[Rabaey ICASSP 1997, FPL 2000]

  • Run Time Reconfigurable [Compton & Hauck 1999]

  • => A key ingredient for future silicon platforms [Schaumont & all. DAC 2001]


Design space of reconfigurable architecture
Design Space of Reconfigurable Architecture Protocol

RECONFIGURABLE ARCHITECTURES

(R-SOC)

MULTI GRANULARITY

(Heterogeneous)

FINE GRAIN

(FPGA)

COARSE GRAIN

(Systolic)

Tile-Based

Architecture

Processor +

Coprocessor

Island

Topology

Hierarchical Topology

Coarse Grain Coprocessor

Fine Grain

Coprocessor

Mesh

Topology

Linear

Topology

Hierarchical

Topology

  • RAW

  • CHESS

  • MATRIX

  • KressArray

  • Systolix Pulsedsp

  • Xilinx Virtex

  • Xilinx Spartran

  • Atmel AT40K

  • Lattice ispXPGA

  • Altera Stratix

  • Altera Apex

  • Altera Cyclone

  • Chameleon

  • REMARC

  • Morphosys

  • Pleiades

  • Garp

  • FIPSOC

  • Triscend E5

  • Triscend A7

  • Xilinx Virtex-II Pro

  • Altera Excalibur

  • Atmel FPSIC

  • aSoC

  • E-FPFA

  • Systolic Ring

  • RaPiD

  • PipeRench

  • DART

  • FPFA


A target architecture asoc
A Target Architecture: aSoC Protocol

  • Adaptive System-on-a-Chip (aSoC)

  • Tiled architecture containing many heterogeneous processing cores (RISC, DSP, FPGA, Motion Estimation, Viterbi Decoder)

  • Mesh communication network controlled with statically determined communication schedule

  • A scalable architecture.


Fpga in system on a chip
FPGA in System-on-a-Chip Protocol

  • Fast Time-To-Market

  • Post-Fabrication Customization

    • Broaden application domain

    • Run-time Reconfiguration

    • Bug Fixes

    • Upgrades

  • 10x-100x Worse:

    • Area

    • Performance

    • Power

Mark L. Chang [email protected]


Asoc architecture

North Protocol

West

East

ctrl

  • Point-to-point connections

  • Communication Interface

South

Core

aSoC Architecture

tile

  • Heterogeneous Cores

uProc

MUL

FPGA

MUL


Asoc communications interface
aSoC Communications Interface Protocol

  • Interface Crossbar

    • inter-tile transfer

    • tile to core transfer

  • Interconnect/Instruction Memory

    • contains instructions to configure the interface crossbar (cycle-by-cycle)

  • Interface Controller

    • selects the instruction

  • Coreports

    • data interface and storage for transfers with the tile IP core

  • Dynamic Voltage and Frequency Selection

    • Dynamic Power Management

Core

Coreports

Interface Crossbar

North

North

South

South

East

East

West

West

Outputs

Inputs

Local

Config

.

Local

Decoder

Controller

Frequency

& Voltage

North to South & East

PC

Instruction Memory


Asoc exploration
aSoC Exploration ... Protocol

  • Type of tiles

  • Number of each type of tile

  • Placement of the tiles

  • Intern architecture of reconfigurable tiles (FPGA core)

  • Communication scheduling


Design space exploration goals
Design Space Exploration: Goals Protocol

  • Goal: Rapid exploration of various architectural solutions to be implemented on heterogeneous reconfigurable architectures (aSoC) in order to select the most efficient architecture for one or several applications

  • Take place before architectural synthesis (algorithmic specification with high level abstraction language)

  • Estimations are based on a functional architecture model (generic, technology-independent)

  • Iterative exploration flow to progressively refine the architecture definition, from a coarse model to a dedicated model


Design exploration flow targeting tiled architecture

C Protocol

SPECIFICATION

C to HCDFG parser

Model of the aSOC Architectures

HCDFG Graphs of the application

T

Tile

A

aSOC

2

1

App

Application

F

Function

1

2

T

Tile

1

F

Function

1

T

1

F

1

T

2

F

2

THF Model

HF Model

Application

Analysis

aSOC

Builder

Tile Exploration

Final model of

aSOC architecture

Results of the Tile exploration step

Static Communication

Scheduling

Function

Tile

Performance

F

T

T

, C

,

Occ

1

1

11

11

11

T

T

, C

,

Occ

2

21

21

21

F

T

T

, C

,

Occ

2

1

12

12

12

T

T

, C

,

Occ

2

22

22

22

aSOC

Analysis

Design Exploration Flow Targeting Tiled Architecture


Application analysis

Use of algorithmic metrics and dedicated scheduling algorithms to highlight the target architectures

Algorithmic metrics:

Characterize the application orientation

Processing

Memory

Control

Characterize the application potential parallelism

Processing

Memory

Application Analysis


Tile exploration with 3 steps

Projection algorithms to highlight the target architectures :

Link between necessary resources (application) and available resources (tile)

Use of an allocation algorithm based on communication costs reduction

Composition:

Take into account of the function scheduling to estimate additional resources (register, mux, …)

Estimation:

performance interval computation (lower and upper bounds)

speed/resource utilization/power characterization

Tile Exploration: with 3 steps


Asoc builder

Environment algorithms to highlight the target architectures AppMapper

Partition and assignment

based on Run Time Estimation

Compilation

Communication Scheduling

Core compilation

Generate tiles configuration

Communications instructions

Bitstreams (for reconfigurable tile)

RISC instructions

aSoC Builder


Asoc analysis

Use the results of previous steps algorithms to highlight the target architectures

Functions scheduling

Tile allocation

Communication scheduling

Complete estimation of the proposed solution

Global execution time

Global power consumption

Total area

aSoC Analysis


Power aware system on a chip

Power-Aware System on a Chip algorithms to highlight the target architectures

A. Laffely, J. Liang, R. Tessier, C. A. Moritz, W. Burleson

University of Massachusetts Amherst

Boston Area Architecture Conference

30 Jan 2003

{alaffely, jliang, tessier, moritz, [email protected]

This material is based upon work supported by the National Science Foundation under Grant No. 9988238.

Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.


Adaptive system on a chip

Tile algorithms to highlight the target architectures

Communication

Interface

North

mProc

Multiplier

East

West

ctrl

Multiplier

FPGA

South

Core

Adaptive System-on-a-Chip

  • Tiled architecture with mesh interconnect

    • Point to point communication pipeline

  • Allows for heterogeneous cores

    • Differing sizes, clock rates, voltages

  • Low-overhead core interface for

    • On-chip bus substitute for streaming applications

  • Based on static scheduling

    • Fast and predictable


Asoc implementation
aSoC Implementation algorithms to highlight the target architectures

2500 l

.18 m technology

Full custom

3000 l


Some results
Some Results algorithms to highlight the target architectures

  • 9 and 16 core systems tested for IIR, MPEG encoding and Image processing applications

    • ~ 2 x the performance compared to Coreconnect bus Burst and Hierarchical

    • ~ 1.5 x the performance of an oblivious routing network1 (Dynamic routing)

    • Max speedup is 5 x

1. W. Dally and H. Aoki, “Deadlock-free Adaptive Routing in Multi-computer Networks

Using Virtual Routing”, IEEE Transactions on Parallel and Distributed Systems, April 1993


ad