flexible wireless communication architectures n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Flexible wireless communication architectures PowerPoint Presentation
Download Presentation
Flexible wireless communication architectures

Loading in 2 Seconds...

play fullscreen
1 / 41

Flexible wireless communication architectures - PowerPoint PPT Presentation


  • 135 Views
  • Uploaded on

Flexible wireless communication architectures. Sridhar Rajagopal Department of Electrical and Computer Engineering Rice University, Houston TX Faculty Candidate Seminar – Southern Methodist University April 23, 2003. This work has been supported in part by NSF, Nokia and Texas Instruments.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Flexible wireless communication architectures' - gordon


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
flexible wireless communication architectures

Flexible wireless communication architectures

Sridhar Rajagopal

Department of Electrical and Computer Engineering

Rice University, Houston TX

Faculty Candidate Seminar – Southern Methodist University

April 23, 2003

This work has been supported in part by NSF, Nokia and Texas Instruments

future wireless devices demand flexibility

Wireless Cellular

Bluetooth/

Home Networks

Wireless LAN

Future wireless devices demand flexibility
  • Multiple algorithms and environments supported in same device
  • High data rate mobile devices with multimedia
    • Flexible algorithms: Multiple antennas, complex signal processing
    • Flexible architectures: High performance (Mbps), low power (mW)
  • Fast design with structured exploration
flexibility needed in different layers

Flexible Algorithms

Mapping

Flexible Architectures

Flexibility needed in different layers

Application Layer

Puppeteer project at Rice

http://www.cs.rice.edu/CS/Systems/Puppeteer/

Network Layer

MAC Layer

Physical Layer

Analog RF

research vision attain flexibility
Algorithms:

Flexibility: support variety of sophisticated algorithms

Architectures:

Flexibility: adapts hardware to algorithms

Fast, structured design exploration

Design me

Research vision: Attain flexibility
contributions algorithms
Contributions: Algorithms

Multi-user channel estimation:[Jnl. Of VLSI Sig. Proc.’02, ASAP’00]

  • Matrix-inversions
  • Numerical techniques
    • conjugate-gradient descent for complexity reduction

Multi-user detection: [ISCAS’01]

  • Block-based computation to streaming computations
    • Pipelining, lower memory requirements

Parallel, fixed-point, streaming VLSI implementations [IEEE Trans. Wireless Comm.’02]

contributions architectures
Contributions: Architectures

Heterogeneous DSP-FPGA system designs: [ICSPAT’00]

Computer arithmetic:[Symp. On Comp. Arith’01]

Dynamic truncation in ASICs using on-line arithmetic

with Most Significant Digit First computation

[Ph.D. Thesis]

Scalable Wireless Application-specific Processors (SWAPs)

Rapid, structured architectures with flexibility-performance tradeoffs

scalable wireless application specific processors

+

+

+

+

?

?

?

?

*

*

*

*

*

*

*

*

Scalable Wireless Application-specific Processors
  • Family of flexible programmable processors
    • Clusters of ALUs
    • High performance by supporting 100’s of ALUs
    • Can provide customization for various algorithms
    • Adapts (“swaps”) architecture dynamically for power

Scale

ALUs

Scale Clusters

rapid structured design for swaps

+

+

+

+

?

?

?

?

*

*

*

*

*

*

*

*

Rapid, structured design for SWAPs

Low “complexity”, parallel, fixed point

algorithms

Architecture Exploration

ASIC

design

apply

SWAPs

DSP

design

apply

research vision summary
Research vision summary
  • Provide a structured framework to rapidly explore:
    • flexible, high performance, low power architectures (SWAPs)
  • Efficient algorithm design for mapping to SWAPs
  • Understanding of algorithms, DSPs and ASICs used
  • Flexibility-performance trade-offs

Inter-disciplinary research:

Wireless communications, VLSI Signal Processing, Computer architecture, Computer arithmetic, Circuits, CAD, Compilers

talk outline
Talk Outline
  • Research vision
  • SWAPs - Background
  • Algorithm design for SWAPs
  • Architecture design for SWAPs
  • Current and Future Research Goals
swaps borrow from dsps

1 ALU

RF

4

16

32

Register

File

SWAPs borrow from DSPs
  • DSPs use : Instruction Level Parallelism (ILP) Subword Parallelism (MMX)
  • Not enough ALUs for GOPs of computation-- Need 100’s
    • TI C6x has 8 ALUs
  • Why not more ALUs?
    • Cannot support more registers (area,ports)
    • Difficult to find ILP as ALUs increase
swaps borrow from asics
SWAPs borrow from ASICs

Exploit data parallelism (DP)

  • Available in many wireless algorithms
  • This is what ASICs do!

int i,a[N],b[N],sum[N]; // 32 bits

short int c[N],d[N],diff[N]; // 16 bitspacked

for (i = 0; i< 1024; ++i)

{

sum[i] = a[i] + b[i];

diff[i] = c[i] - d[i];

}

DP

ILP

Subword

swaps borrow from stream processors

Kernel

Stream

Input Data

Output Data

Interference

Cancellation

Viterbi

decoding

receivedsignal

Matched

filter

Decoded bits

Correlator

channel

estimation

SWAPs borrow from stream processors
  • Kernels (computation) and streams (communication)
  • Use local data in clusters providing GOPs support
  • Imagine stream processor at Stanford [Rixner’01]

Scott Rixner. Stream Processor Architecture, Kluwer Academic Publishers: Boston, MA, 2001.

swaps are multi cluster dsps

Internal

Memory

+

+

ILP

+

*

*

*

SWAPs are multi-cluster DSPs

Memory: Stream Register File (SRF)

+

+

+

+

+

+

+

+

ILP

+

+

+

+

*

*

*

*

*

*

*

*

*

*

*

*

DP

SWAPs

adapt clusters to DP

Identical clusters, same operations.

Power-down unused FUs, clusters

DSP

(1 cluster)

arithmetic clusters in swaps

SRF

Arithmetic clusters in SWAPs

Distributed Register Files

(supports more ALUs)

From/To SRF

+

+

+

+

+

+

*

*

+

+

*

*

Cross Point

/

Intercluster Network

/

/

/

Comm. Unit

Scratchpad

(indexed accesses)

talk outline1
Talk Outline
  • Research vision
  • SWAPs Background
  • Algorithm design for SWAPs
  • Architecture design for SWAPs
  • Current and Future Research Goals
swaps physical layer algorithms
SWAPs: Physical layer algorithms

Antenna

Baseband processing

Detection

Decoding

Higher

(MAC/Network/OS)

Layers

RF

Front-end

Channel

estimation

Complex signal processing algorithms with GOPs of computation

swap mapping example viterbi decoding
SWAP mapping example: Viterbi decoding
  • Multiple antenna systems (MIMO systems)
    • Complexity exponential with transmit x receive antennas
  • Estimation: Linear MMSE, blind, conjugate gradient….
  • Detection: FFT, (blind) interference cancellation….
  • Decoding: Viterbi, Turbo, LDPC…. & joint schemes
  • SWAP flexibility lets you use the best algorithms for the situation

Example for concept demonstration: Viterbi decoding

parallel viterbi decoding for swaps
Parallel Viterbi Decoding for SWAPs

ACS

Unit

Traceback

Unit

Decoded

bits

Detected

bits

  • Add-Compare-Select (ACS) : trellis interconnect : computations
    • Parallelism depends on constraint length (#states)
  • Traceback: searching
    • Conventional
      • Sequential (No DP) with dynamic branching
      • Difficult to implement in parallel architecture
    • Use Register Exchange (RE)
      • parallel solution
parallel viterbi needs re ordering for swaps

ACS in SWAPs

Regular ACS

DP

vector

X(0)

X(0)

X(0)

X(0)

X(1)

X(2)

X(1)

X(1)

X(2)

X(4)

X(2)

X(2)

X(3)

X(6)

X(3)

X(3)

X(4)

X(4)

X(4)

X(8)

X(5)

X(10)

X(5)

X(5)

X(12)

X(6)

X(6)

X(6)

X(14)

X(7)

X(7)

X(7)

X(8)

X(8)

X(8)

X(1)

X(9)

X(9)

X(9)

X(3)

X(5)

X(10)

X(10)

X(10)

X(7)

X(11)

X(11)

X(11)

X(12)

X(12)

X(12)

X(9)

X(13)

X(11)

X(13)

X(13)

X(14)

X(13)

X(14)

X(14)

X(15)

X(15)

X(15)

X(15)

Parallel Viterbi needs re-ordering for SWAPs

Exploiting Viterbi DP in SWAPs:

  • Use RE instead of regular traceback
  • Re-order ACS, RE
talk outline2
Talk Outline
  • Research vision
  • SWAP Background
  • Algorithm design for SWAPs
  • Architecture design for SWAPs
  • Current and Future Research Goals
swap architecture design
SWAP architecture design

More clusters better than more ALUs/per cluster (if #clusters > 2)

  • Decide how many clusters
    • Exploit DP
  • Decide what to put within each cluster
    • Maximize ILP with high functional unit efficiency
    • Search design space with “explore” tool

Time-power-area characterization

+

+

+

+

?

?

?

?

ILP

*

*

*

*

*

*

*

*

DP

design a swap cluster explore

(80,34)

(85,24)

(85,17)

160

(85,13)

140

(85,11)

(70,59)

120

(73,41)

100

(62,62)

Instruction count

(76,33)

80

(72,22)

(65,45)

(54,59)

(43,58)

(72,19)

(47,43)

(61,33)

60

(39,41)

(60,26)

(49,33)

40

(61,22)

(40,32)

(48,26)

1

1

(39,27)

(50,22)

2

2

(39,22)

3

3

#Multipliers

#Adders

4

4

5

5

Design a SWAP cluster: “Explore”

Auto-exploration of adders and multipliers for “ACS"

(Adder util%, Multiplier util%)

explore tool benefits
“Explore” tool benefits
  • Instruction count vs. ALU efficiency
    • What goes inside each cluster
  • Design customized application-specific units
    • Better performance with increased ALU utilization
  • Explore multiple algorithms
    • turn off functional units not in use for given kernel
    • Vdd-gating, clock gating techniques
example for swap architecture design
Example for SWAP architecture design

DP

Explore Algorithm 1 : 3 adders, 3 multipliers, 32 clusters

Explore Algorithm 2 : 4 adders, 1 multiplier, 64 clusters

Explore Algorithm 3 : 2 adders, 2 multipliers, 64 clusters

Explore Algorithm 4 : 2 adders, 2 multipliers, 16 clusters

Chosen Architecture: 4 adders, 3 multipliers, 64 clusters

ILP

swap flexibility provides power savings
SWAP flexibility provides power savings
  • Multiple algorithms
    • Different ALU, cluster requirements
  • Turning off ALUs ( –add –mul compiler options)
    • Use the right #ALUs from “explore” tool
  • Turning off clusters
    • Data across SRF of all clusters
    • Cluster only has access to its own SRF
    • Next kernel may need data from SRF of other clusters
    • Reconfiguration support needs to be provided
swaps provide cluster reconfiguration
SWAPs provide cluster reconfiguration

SRF

Mux-Demux

Network

With

Stream

buffers

Clusters

Additional latency (few cycles) due to microcontroller stalls

- Minimal loss in performance

cluster reconfiguration for viterbi
Cluster reconfiguration for Viterbi

DP

Can be turned OFF

Packet 1

Constraint length 7

(16 clusters)

Packet 2

Constraint length 9

(64 clusters)

Packet 3

Constraint length 5

(4 clusters)

execution time cycles
Execution Time (cycles)SWAPs provide flexibility at negligible overhead

Clusters

Memory

64-bit Rate ½

Packet 1

K = 7

Kernels

(Computation)

No Data Memory

accesses

Packet 2

K = 9

Packet 3

K = 5

swap exploration for viterbi decoding
SWAP exploration for Viterbi decoding

1000

K = 9

K = 7

Different SWAPs

(Without reconfiguration)

DSP

K = 5

Same SWAP

(With reconfiguration)

100

Frequency needed to attain real-time (in MHz)

10

Max

DP

1

1

10

100

Number of clusters

Ideal C64x (w/o co-proc) needs ~200 MHz for real-time

swaps salient features
SWAPs : Salient features
  • 1-2 orders of magnitude better than a DSP
  • Any constraint length  10 MHz at 128 Kbps
  • Same code for all constraint lengths
    • no need to re-compile or load another code
    • as long as parallelism/cluster ratio is constant
  • Power savings due to dynamic cluster scaling
expected swap power consumption

Viterbi

Clusters Used

Peak Power

K = 9

64

~90 mW

K = 7

16

~28.57 mW

K = 5

4

~13.8 mW

overhead

0

~8.1 mW

90

80

70

60

50

Power (in mW)

40

30

20

10

0

0

10

20

30

40

50

60

70

Active Clusters (max 64)

Expected SWAP power consumption
  • Power model based on [Khailany’03]
  • 64 clusters and 1 multiplier per cluster:
    • 0.13 micron, 1.2 V
    • Peak Active Power: ~9 mW at 1 MHz (DSP ~1 mW)
    • Area: ~53.7 mm2
  • 10 MHz, 128 Kbps with reconfiguration

DSP, K = 9

1

~200 mW

Exploring the VLSI Scalability of Stream Processors, Brucek Khailany et al, Proceedings of the

Ninth Symposium on High Performance Computer Architecture, February 8-12, 2003

slide33

100000

FAST

MEDIUM

DSP

SLOW

10000

32-user base-station

1000

Frequency needed to attain real-time (in MHz)

100

Mobile

10

100

1

10

Number of clusters

Multiuser Estimation-Detection+Decoding

Real-time target : 128 Kbps per user

Fading

scenarios

Ideal C64x (w/o co-proc) needs ~15 GHz for real-time

expected swap power base station
Expected SWAP power : base-station
  • 32 user base-station with 3 X’s per cluster and 64 clusters:
    • 0.13 micron, 1.2 V
    • Peak Active Power: ~18.19 mW for 1 MHz (increased X)
    • Area: ~93.4 mm2
  • Total Peak Base-station power consumption:
    • ~18.19 W at 1 GHz for 32 users at 128 Kbps/user
talk outline3
Talk Outline
  • Research vision
  • SWAP Background
  • Algorithm design for SWAPs
  • Architecture design for SWAPs
  • Current and Future Research Goals
current research flexibility vs performance
Current research: Flexibility vs. performance

SWAPs: 128 Kbps at ~10-100 mW for Viterbi

    • Borrow DP from ASICs!
  • suitable for base-stations
    • Flexibility more important than power
  • suitable for mobile devices
    • Power constraints tighter
    • can be customized for further power savings

Handset SWAPs (H-SWAPs)

    • Borrow Task pipelining from ASICs!
    • Application-specific units and specialized comm. network
handset swaps h swaps

SRF

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

*

+

+

*

*

+

+

+

*

+

+

+

+

+

+

+

+

+

+

+

+

*

*

*

*

*

*

*

*

+

+

+

+

+

+

+

+

+

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

Limited

DP

Limited

DP

Limited

DP

*

*

*

*

*

*

*

*

*

Limited

DP

DP

H-SWAPs

(collection of customized

SWAPlets)

SWAPlet

(limit

clusters)

Handset SWAPs: H-SWAPs
  • Trade Data Parallelism for Task Pipelining

SWAPs

(max. clusters

and reconfigure)

sample points in architecture exploration
Sample points in architecture exploration

Programmable solutions with increased customization

DSPs

(1 cluster)

SWAPs

(multiple)

H-SWAPs

(optimized for handsets)

ILP

Subword

DP

Task Pipelining

Custom ALUs

ILP

Subword

DP

ILP

Subword

Performance, Power benefits

(with decreasing flexibility)

future efficient algorithms and mapping
Future: Efficient algorithms and mapping

Multiple antenna systems with

1-2 orders-of-magnitude higher complexity

future research architectures
Future research: Architectures

Generalized and structured framework and tools

    • Joint algorithm-architecture exploration
    • Area-time-power-flexibility tradeoffs

Potential applications: embedded systems

  • Image and Video processing:
    • Cameras : variety of compression algorithms
  • Biomedical applications:
    • Hearing aids: DSP running on body heat*
  • Sensor networks
    • Compression of data before transmission

*Quote: Gene Frantz, TI Fellow

swaps flexibility performance power
SWAPs: Flexibility, Performance, Power
  • Need flexibility in future wireless devices
    • Algorithms and Architectures
  • Rapid Exploration for Scalable, Wireless Application-specific Processors
    • Structured approach with flexibility-performance trade-offs
  • SWAPs - flexibility, high performance and low power
    • Exploit data parallelism like ASICs
    • 1-2 orders better performance than DSPs
    • Turn off unused clusters and unused ALUs for low power