programmable processors for wireless base stations
Download
Skip this Video
Download Presentation
Programmable processors for wireless base-stations

Loading in 2 Seconds...

play fullscreen
1 / 64

Programmable processors for wireless base-stations - PowerPoint PPT Presentation


  • 73 Views
  • Uploaded on

Programmable processors for wireless base-stations. Sridhar Rajagopal ( [email protected] ) December 9, 2003. Fact#1: Wireless rates  clock rates. 4. 10. Clock frequency (MHz). 3. 10. 2. 10. W-LAN data rate (Mbps). 1. 10. 0. 10. -1. 10. Cellular data rate (Mbps). -2. 10. -3.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Programmable processors for wireless base-stations' - kesia


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
fact 1 wireless rates clock rates
Fact#1: Wireless rates  clock rates

4

10

Clock frequency (MHz)

3

10

2

10

W-LAN data rate (Mbps)

1

10

0

10

-1

10

Cellular data rate (Mbps)

-2

10

-3

10

1996

1997

1998

1999

2000

2001

2002

2003

2004

2005

2006

Year

Need to process 100X more bits per clock cycle today than in 1996

4 GHz

54-100 Mbps

200 MHz

2-10 Mbps

1 Mbps

9.6 Kbps

Source: Intel, IEEE 802.11x, 3GPP

fact 2 base stations need horsepower
Fact#2: base-stations need horsepower

RF

Network Interface

Baseband processing

LNA

E1/T1

Chip level

Symbol

BSC/RNC

or

Demodulation

Detection

Interface

Packet

Despreading

RF RX

Network

ADC

Packet/

Channel

Symbol

Circuit Switch

DDC

estimation

Decoding

Control

Frequency

Power Measurement and Gain

Power Supply

Offset

Control (AGC)

and Control

Compensation

Unit

Sophisticated signal processing for multiple users

Need 100-1000s of arithmetic operations to process 1 bit

Source: Texas Instruments

need 100 alus in base stations
Need  100 ALUs in base-stations

Example:

1000 arithmetic operations/bit with 1 bit/10 cycles

  • 100 arithmetic operations/clock cycle

Base-stations need  100 ALUs

  • irrespective of the type of (clocked) architecture
fact 3 base stations need power efficiency
Fact #3: Base-stations need power-efficiency*

Wireless gets blacked out too

Trying to use your cell phone during the blackout was nearly impossible. What went wrong?August 16, 2003: 8:58 AM EDT By Paul R. La Monica, CNN/Money Senior Writer

Wireless systems getting denser

  • More base-stations per unit area
  • operational and maintenance costs

Architectures first tested on base-stations

*implies does not waste power – does not imply low power

fact 4 base stations need flexibility
Fact #4: Base-stations need flexibility*
  • Wireless systems are continuously evolving
    • New algorithms designed and evaluated
    • allow upgrading, co-existing, minimize design time, reuse
  • Flexibility needed for power-efficiency
    • Base-stations rarely operate at full capacity
    • Varying users, data rates, spreading, modulation, coding
    • Adapt resources to needs

*how much flexibility? – as flexible as possible

fact 5 current base stations not flexible not power efficient
Fact #5: Current base-stations not flexible / not power-efficient

DSP(s)

‘Symbol rate’

processing

RF

‘Chip rate’

Control and

(Analog)

processing

protocol

Decoding

ASIC(s)

Co-processor(s)

DSP or

and/or

and/or

RISC

ASSP(s)

ASIC(s)

processor

and/or

FPGA(s)

Change implies re-partitioning algorithms, designing new hardware

Design done for the worst case – no adaptation with workload

Source: [Baines2003]

thesis addresses the following problem
Thesis addresses the following problem
  • design a base-station
  • supports 100’s of ALUs
  • power-efficient (adapts resources to needs)
  • as flexible as possible
  • How many ALUs at what clock frequency?
    • HYPOTHESIS:
  • Programmable* processors for wireless base-stations

*how much programmable? – as programmable as possible

programmable processors
Programmable processors
  • No processor optimization for specific algorithm
    • As programmable as possible
    • Example: no instruction for Viterbi decoding
    • FPGAs, ASICs, ASIPs etc. notconsidered
  • Use characteristics of wireless systems
    • precision, parallelism, operations,..
    • MMX extensions for multimedia
single processors won t do
Single processors won’t do

(1) Find ways for increasing clock frequency

  • C64x DSP: 600 – 720 – 1GHz – 100GHz?
  • Easiest solution but physical limits to scaling f
  • Not good for power, given cubic dependence with f

(2) Increasing ALUs

  • Limited instruction level parallelism (ILP,MMX)
  • Register file area, ports explosion
  • Compiler issues in extracting more ILP

(3) Multiprocessors

related work multiprocessors
Related work - Multiprocessors

Multiprocessors

Control

Reconfigurable*

Cannot scale to

support 100’s of

arithmetic units

processors

MIMD

SIMD

(Multiple Instructions

(Single Instruction

:

Multiple Data)

Multiple Data)

Data Parallel

RAW

Chameleon

picoChip

Single chip

Multi-chip

Array

:

TI TMS320C40 DSP

:

Sundance

TM

ClearSpeed

Cm*

MasPar

Vector

Illiac-IV

BSP

:

CODE

Multi-threading

Vector IRAM

Chip

(MT)

Cray 1

multiprocessor

Stream

(CMP)

:

Clustered VLIW

Sandbridge SandBlaster

DSP

:

Cray MTA

TI TMS320C8x DSP

:

Imagine

Sun MAJC

TI TMS320C6x DSP

Hydra

TM

Motorola RSVP

PowerPC RS64IV

Multiflow TRACE

IBM Power4

Alpha 21464

Alpha 21264

*Reconfigurable processor uses reconfiguration for execution time benefits

challenges in proving hypothesis
Challenges in proving hypothesis
  • Architecture choice for design exploration
    • SIMD generally more programmable* than reconfigurable
    • Compiler, simulators, tools and support play a major role
  • Benchmark workloads need to be designed
    • Previously done as ASICs, so none available
    • Not easy – finite precision, algorithms changing
  • Need detailed knowledge of wireless algorithms, architectures, mapping, compilers, design tools

*Programmable here refers to ease of use and write code for

architecture choice stream processors
Architecture choice: Stream processors
  • State-of-the-art programmable media processors
    • Can scale to 1000’s of arithmetic units [Khailany 2003]
    • Wireless algorithms have similar characteristics
  • Cycle-accurate simulator with open-source code
  • Parameters such as ALUs, register files can be varied
  • Graphical tools to investigate FU utilization, bottlenecks, memory stalls, communication overhead …
  • Almost anything can be changed, some changes easier than others!
thesis contributions
Thesis contributions
  • Mapping algorithms on stream processors
    • designing data-parallel algorithm versions
    • tradeoffs between packing, ALU utilization and memory
    • reduced inter-cluster communication network
  • Improve power efficiency in stream processors
    • adapting compute resources to workload variations
    • varying voltage and frequency to real-time requirements
  • Design exploration between #ALUs and clock frequency to minimize power consumption
    • fast real-time performance prediction
outline
Outline
  • Background
    • Wireless systems
    • Stream processors
  • Contribution #1 : Mapping
  • Contribution #2 : Power-efficiency
  • Contribution #3 : Design exploration
  • Broader impact and limitations
wireless workloads 2g basic
Wireless workloads : 2G (Basic)

2G physical layer signal processing

User 1

User 1

Code

Viterbi

Matched

decoder

Filter

MAC

Sliding

and

correlator

Network

layers

Received

signal

User K

User K

after

Code

DDC

Viterbi

Matched

decoder

Filter

Sliding

correlator

32 users

16 Kbps/user

Single-user algorithms

(other users noise)

> 2 GOPs

3g multiuser system
3G Multiuser system

3G physical layer signal processing

Multiuser detection

User 1

User 1

Code

Viterbi

Matched

decoder

Received

Filter

signal

Parallel

MAC

after

Interference

and

DDC

Cancellation

Network

Stages

layers

User K

User K

Code

Viterbi

Matched

decoder

Filter

Multiuser

channel

estimation

32 users

128 Kbps/user

Multi-user algorithms

(cancels

interference)

> 20 GOPs

4g mimo system
4G MIMO system

M antennas

4G physical layer signal processing

User 1, Antenna 1

User 1

Code

Chip level

LDPC

Matched

Equalization

decoder

Filter

Received

signal

after DDC

Channel

Estimation

User 1, Antenna T

Code

Chip level

Matched

Equalization

Filter

MAC

and

Network

Channel

layers

estimation

User K, Antenna 1

User K

Code

Chip level

LDPC

Matched

Equalization

decoder

Filter

Channel

Estimation

User K, Antenna T

Code

Chip level

Matched

Equalization

Filter

Channel

estimation

32 users

1 Mbps/user

Multiple antennas

(higher spectral

efficiency, higher data rates)

> 200 GOPs

programmable processors1
Programmable processors

int i,a[N],b[N],sum[N]; // 32 bits

short int c[N],d[N],diff[N]; // 16 bitspacked

for (i = 0; i< 1024; ++i) {

sum[i] = a[i] + b[i];

diff[i] = c[i] - d[i];

}

Instruction Level Parallelism (ILP) - DSP

Subword Parallelism (MMX) - DSP

Data Parallelism (DP) – Vector Processor

  • DP can decrease by increasing ILP and MMX

– Example: loop unrolling

DP

ILP

MMX

stream processors multi cluster dsps
Stream Processors : multi-cluster DSPs

Internal

Memory

micro

controller

micro

controller

+

+

ILP

MMX

+

*

*

*

Memory: Stream Register File (SRF)

+

+

+

+

+

+

+

+

ILP

MMX

+

+

+

+

*

*

*

*

*

*

*

*

*

*

*

*

DP

adapt clusters to DP

Identical clusters, same operations.

Power-down unused FUs, clusters

VLIW DSP

(1 cluster)

outline1
Outline

Contribution #1

  • Mapping algorithms to stream processors (parallel, fixed pt)
  • Tradeoffs between packing, ALU utilization and memory
  • Reduced inter-cluster communication network
packing
Packing
  • Packing introduced around 1996 for exploiting subword parallelism
    • Intel MMX
    • Subword parallelism never looked back
    • Integrated into all current microprocessors and DSPs
  • SIMD + MMX : Stream processor/vector IRAM : 2000 +
    • relatively new concept
  • Not necessarily useful in SIMD processors
    • May add to inter-cluster communication
packing may not be useful
Packing may not be useful

a

3 4

5 6

7 8

1 2

Multiplication

p

3

5

7

1

q

4

6

8

2

Algorithm:

Re-ordering data

short a;

p

3

x

x

1

int y;

m

7

x

x

5

{

for(i= 1; i < 8 ; ++i)

n

x

2

4

x

y[i] = a[i]*a[i];

q

x

6

8

x

Add

}

p

3

2

4

1

q

7

6

8

5

Re-ordering data

p

2

3

4

1

q

6

7

7

8

5

Packing uses odd-even grouping

data re ordering in memory
Data re-ordering in memory
  • Matrix transpose
    • Common in wireless communication systems
    • Column access to data expensive
  • Re-ordering data inside the ALUs
    • Faster
    • Lower power
trade offs during memory re ordering
Trade-offs during memory re-ordering

ALUs

Memory

ALUs

Memory

ALUs

t

t

t

1

1

1

Transpose

Transpose

t

t

t

t

mem

3

alu

mem

t

2

t

t

2

2

t = t

+ t

2

stalls

t = t

+ t

t = t

0 < t

<

t

2

alu

2

stalls

mem

(c)

(b)

(a)

transpose uses odd even grouping
Transpose uses odd-even grouping

N

IN

B

C

D

0

A

A

B

C

D

3

4

2

1

OUT

M

A

1

B

2

M

/2

1

3

4

2

D

4

3

C

Repeat LOG(M

) times

{

IN = OUT;

}

alu bandwidth memory bandwidth
ALU Bandwidth > Memory Bandwidth

Transpose in memory (t

): DRAM 8 cycles

mem

Transpose in memory (t

): DRAM 3 cycles

mem

5

10

Transpose in ALU (t

)

alu

Execution time (cycles)

4

10

3

10

4

10

Matrix sizes (32x32, 64x64, 128x128)

viterbi needs odd even grouping
Viterbi needs odd-even grouping

ACS in SWAPs

Regular ACS

DP

vector

X(0)

X(0)

X(0)

X(0)

X(1)

X(1)

X(2)

X(1)

X(2)

X(2)

X(2)

X(4)

X(3)

X(3)

X(6)

X(3)

X(4)

X(4)

X(8)

X(4)

X(5)

X(10)

X(5)

X(5)

X(6)

X(6)

X(6)

X(12)

X(14)

X(7)

X(7)

X(7)

X(8)

X(8)

X(8)

X(1)

X(9)

X(9)

X(9)

X(3)

X(5)

X(10)

X(10)

X(10)

X(11)

X(7)

X(11)

X(11)

X(12)

X(9)

X(12)

X(12)

X(13)

X(13)

X(13)

X(11)

X(14)

X(13)

X(14)

X(14)

X(15)

X(15)

X(15)

X(15)

Exploiting Viterbi DP in SWAPs:

  • Use Register exchange (RE) instead of regular traceback
  • Re-order ACS, RE
performance of viterbi decoding
Performance of Viterbi decoding

1000

K = 9

K = 7

DSP

K = 5

100

Frequency needed to attain real-time (in MHz)

10

Max

DP

1

1

10

100

Number of clusters

Ideal C64x (w/o co-proc) needs ~200 MHz for real-time

pattern in inter cluster comm
Pattern in inter-cluster comm
  • Broadcasting
    • Matrix-vector multiplication, matrix-matrix multiplication, outer product updates
  • Odd-even grouping
    • Transpose, Packing, Viterbi decoding
odd even grouping
Odd-even grouping

4 Clusters

Data

0/4

1/5

2/6

3/7

0 1 2 3 4 5 6 7

0 2 4 8 1 3 5 7

Inter-cluster communication

Entire chip length

Limits clock frequency

Limits scaling

2

2

O(C

) wires, O(C

) interconnections, 8 cycles

a reduced inter cluster comm network
A reduced inter-cluster comm network

4 Clusters

0/4

1/5

2/6

3/7

Data

Multiplexer

Broadcasting

support

Registers

Odd-even

(pipelining)

grouping

Demultiplexer

O(C

log(C)

) wires, O(C

) interconnections, 8 cycles

only nearest neighbor interconnections

outline2
Outline

Contribution #2 : Power-efficiency

High performance is low power

- Mark Horowitz

flexibility needed in workloads
Flexibility needed in workloads

25

2G base-station (16 Kbps/user)

3G base-station (128 Kbps/user)

20

15

Operation count (in GOPs)

10

5

0

(4,7)

(4,9)

(8,7)

(8,9)

(16,7)

(16,9)

(32,7)

(32,9)

(Users, Constraint lengths)

Note:

GOPs refer

only to arithmetic

computations

Billions of computations per second needed

Workload variation from ~1 GOPs for 4 users, constraint 7 viterbi

to ~23 GOPs for 32 users, constraint 9 viterbi

flexibility affects data parallelism
Flexibility affects Data Parallelism*

*Data Parallelism is defined as the parallelism available after subword packing and loop unrolling

U - Users, K - constraint length,

N - spreading gain, R - decoding rate

adapting clusters to data parallelism
Adapting #clusters to Data Parallelism

No reconfiguration

4: 2 reconfiguration

4:1 reconfiguration

All clusters off

C

C

C

C

C

C

C

Turned off using

voltage gating to

eliminate static and

dynamic power dissipation

Adaptive

Multiplexer

Network

C

C

C

C

cluster utilization variation
Cluster utilization variation

100

(4,9)

(4,7)

50

0

0

5

10

15

20

25

30

100

(8,9)

(8,7)

50

0

Cluster Utilization

0

5

10

15

20

25

30

100

50

(16,9)

(16,7)

0

0

5

10

15

20

25

30

100

(32,9)

50

(32,7)

0

0

5

10

15

20

25

30

Cluster Index

Cluster utilization variation on a 32-cluster processor

(32, 9) = 32 users, constraint length 9 Viterbi

frequency variation
Frequency variation

1200

Mem Stall

uC Stall

Busy

1000

800

Real-time Frequency (in MHz)

600

400

200

0

(4,7)

(4,9)

(8,7)

(8,9)

(16,7)

(16,9)

(32,7)

(32,9)

operation
Operation
  • Dynamic Voltage-Frequency scaling when system changes significantly
    • Users, data rates …
    • Coarse time scale (every few seconds)
  • Turn off clusters
    • when parallelism changes significantly
    • Memory operations
    • Exceed real-time requirements
    • Finer time scales (100’s of microseconds)
power voltage gating scaling
Power : Voltage Gating & Scaling

Power can change from 12.38 W to 300 mW

depending on workload changes

outline3
Outline

Contribution #3 : Design exploration

  • How many adders, multipliers, clusters, clock frequency
  • Quickly predict real-time performance
deciding alus vs clock frequency
Deciding ALUs vs. clock frequency
  • No independent variables
    • Clusters, ALUs, frequency, voltage (c,a,m,f)
    • Trade-offs exist
  • How to find the right combination for lowest power!
static design exploration
Static design exploration

Dynamic part

(Memory stalls

Microcontroller stalls)

Execution Time

Static part

(computations)

also helps in quickly predicting real-time performance

sensitivity analysis important
Sensitivity analysis important
  • We have a capacitance model [Khailany2003]
  • All equations not exact
    • Need to see how variations affect solutions
design exploration methodology
Design exploration methodology
  • 3 types of parallelism: ILP, MMX, DP
  • For best performance (power)
    • Maximize the use of all
  • Maximize ILP and MMX at expense of DP
    • Loop unrolling, packing
    • Schedule on sufficient number of adders/multipliers
  • If DP remains, use clusters = DP
    • No other way to exploit that parallelism
setting clusters adders multipliers
Setting clusters, adders, multipliers
  • If sufficient DP, linear decrease in frequency with clusters
    • Set clusters depending on DP and execution time estimate
  • To find adders and multipliers,
    • Let compiler schedule algorithm workloads across different numbers of adders and multipliers and let it find execution time
  • Put all numbers in power equation
    • Compare increase in capacitance due to added ALUs and clusters with benefits in execution time
  • Choose the solution that minimizes the power
design exploration
Design exploration

For sufficiently large

#adders, #multipliers per cluster

Explore Algorithm 1 : 32 clusters (t1)

Explore Algorithm 2 : 64 clusters (t2)

Explore Algorithm 3 : 64 clusters (t3)

Explore Algorithm 4 : 16 clusters (t4)

DP

ILP

clusters frequency and power
Clusters: frequency and power

4

1

10

0.9

0.8

0.7

Power

µ

f

2

Power

µ

f

0.6

3

Frequency (MHz)

Power

µ

f

Normalized Power

3

0.5

10

0.4

0.3

0.2

0.1

2

0

10

0

10

20

30

40

50

60

70

0

1

2

10

10

10

Clusters

Clusters

32 clusters at frequency = 836.692 MHz (p = 1)

64 clusters at frequency = 543.444 MHz (p = 2)

64 clusters at frequency = 543.444 MHz (p = 3)

3G workload

alu utilization with frequency
ALU utilization with frequency

(78,18)

(78,27)

1100

(78,45)

1000

900

(64,31)

Real-Time Frequency (in MHz) with FU utilization(+,*)

800

(50,31)

(65,46)

700

(38,28)

600

(51,42)

(67,62)

(32,28)

3

500

(42,37)

2.8

1

2.6

1.5

(33,34)

(55,62)

2.4

2

2.2

2.5

(43,56)

2

3

1.8

#Multipliers

3.5

(36,53)

1.6

#Adders

4

1.4

4.5

1.2

1

5

3G workload

exploration results
Exploration results

*************************

Final Design Conclusion

*************************

Clusters : 64

Multipliers/cluster : 1

Utilization: 62%

Adders/cluster : 3

Utilization: 55%

Real-time frequency : 568.68 MHz

*************************

Exploration done with plots generated in seconds….

outline4
Outline

Broader impact and limitations

broader impact
Broader impact
  • Results not specific to base-stations
    • High performance, low power system designs
  • Concepts can be extended to handsets
  • Mux network applicable to all SIMD processors
    • Power efficiency in scientific computing
  • Results #2, #3 applicable to all stream applications
    • Design and power efficiency
    • Multimedia, MPEG, …
limitations
Limitations

Don’t believe the model is the reality

(Proof is in the pudding)

  • Fabrication needed to verify concepts
    • Cycle accurate simulator
    • Extrapolating models for power
  • LDPC decoding (in progress)
    • Sparse matrix requires permutations over large data
    • Indexed SRF may help
  • 3G requires 1 GHz at 128 Kbps/user
    • 4G equalization at 1 Mbps breaks down (expected)
conclusions
Conclusions
  • Road ends - conventional architectures[Agarwal2000]
  • Wide range of architectures – DSP, ASSP, ASIP, reconfigurable,stream, ASIC, programmable +
    • Difficult to compare and contrast
    • Need new definitions that allow comparisons
  • Wireless workloads – SPECwireless standard needed
  • utilizing 100-1000s ALUs/clock cycle and mapping algorithms not easy in programmable architectures
    • my thesis lays the initial foundations
alternate view of the cmp dsp
Alternate view of the CMP DSP

Streaming Memory system

L2

internal

memory

Bank

C

Bank

2

Bank

1

Prefetch

Buffers

Clusters

Of

C64x

Instruction

decoder

cluster 0

cluster C

cluster 1

Inter-cluster

communication

network

adapting clusters using 1 memory transfers
Adapting clusters using (1) memory transfers

Memory

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

Stream A

Step 1:

Step 2:

Stream A'

SRF

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

0

1

2

4

4

5

6

7

X

X

X

X

8

9

10

11

12

13

14

15

Clusters

2 using conditional streams
(2) Using conditional streams

Cluster index

0

1

2

3

0

1

2

3

Conditional Buffer

A

B

C

D

A

B

C

D

1

1

0

0

Condition Switch

1

1

0

0

C

D

-

-

Data received

A

B

-

-

Access 0

Access 1

4-clusters reconfiguring to 2

arithmetic clusters in stream processors
Arithmetic clusters in stream processors

SRF

Distributed Register Files

(supports more ALUs)

From/To SRF

+

+

+

+

+

+

*

*

+

+

*

*

Cross Point

/

Intercluster Network

/

/

/

Comm. Unit

Scratchpad

(indexed accesses)

programming model
Programming model

kernel add(istream a, istream b, ostream sum)

{

int inputA, inputB, output;

loop_stream(a)

{

a >> inputA;

b >> inputB;

output = a + b;

stream a(1024);

sum << output;

stream b(1024);

}

stream sum(1024);

stream c(512);

}

stream d(512);

stream diff(512);

add(a,b,sum);

kernel sub(istream c, istream d,

ostream diff)

sub(c,d,diff);

{

int inputC, inputD, output;

loop_stream(c)

{

c >> inputC;

d >> inputD;

output = c - d;

diff << output;

}

}

Your new hardware won’t run your old software – Balch’s law

stream processor programming
Stream processor programming

Kernel

Stream

Input Data

Output Data

Interference

Cancellation

Viterbi

decoding

receivedsignal

Matched

filter

Decoded bits

Correlator

channel

estimation

  • Kernels (computation) and streams (communication)
  • Use local data in clusters providing GOPs support
  • Imagine stream processor at Stanford [Rixner’01]

Scott Rixner. Stream Processor Architecture, Kluwer Academic Publishers: Boston, MA, 2001.

parallel viterbi decoding
Parallel Viterbi Decoding
  • Add-Compare-Select (ACS) : trellis interconnect : computations
    • Parallelism depends on constraint length (#states)
  • Traceback: searching
    • Conventional
      • Sequential (No DP) with dynamic branching
      • Difficult to implement in parallel architecture
    • Use Register Exchange (RE)
      • parallel solution

ACS

Unit

Traceback

Unit

Decoded

bits

Detected

bits

ad