Programmable processors for wireless base stations
Download
1 / 64

Programmable processors for wireless base-stations - PowerPoint PPT Presentation


  • 72 Views
  • Uploaded on

Programmable processors for wireless base-stations. Sridhar Rajagopal ( [email protected] ) December 9, 2003. Fact#1: Wireless rates  clock rates. 4. 10. Clock frequency (MHz). 3. 10. 2. 10. W-LAN data rate (Mbps). 1. 10. 0. 10. -1. 10. Cellular data rate (Mbps). -2. 10. -3.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Programmable processors for wireless base-stations' - kesia


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

Fact 1 wireless rates clock rates
Fact#1: Wireless rates  clock rates

4

10

Clock frequency (MHz)

3

10

2

10

W-LAN data rate (Mbps)

1

10

0

10

-1

10

Cellular data rate (Mbps)

-2

10

-3

10

1996

1997

1998

1999

2000

2001

2002

2003

2004

2005

2006

Year

Need to process 100X more bits per clock cycle today than in 1996

4 GHz

54-100 Mbps

200 MHz

2-10 Mbps

1 Mbps

9.6 Kbps

Source: Intel, IEEE 802.11x, 3GPP


Fact 2 base stations need horsepower
Fact#2: base-stations need horsepower

RF

Network Interface

Baseband processing

LNA

E1/T1

Chip level

Symbol

BSC/RNC

or

Demodulation

Detection

Interface

Packet

Despreading

RF RX

Network

ADC

Packet/

Channel

Symbol

Circuit Switch

DDC

estimation

Decoding

Control

Frequency

Power Measurement and Gain

Power Supply

Offset

Control (AGC)

and Control

Compensation

Unit

Sophisticated signal processing for multiple users

Need 100-1000s of arithmetic operations to process 1 bit

Source: Texas Instruments


Need 100 alus in base stations
Need  100 ALUs in base-stations

Example:

1000 arithmetic operations/bit with 1 bit/10 cycles

  • 100 arithmetic operations/clock cycle

    Base-stations need  100 ALUs

  • irrespective of the type of (clocked) architecture


Fact 3 base stations need power efficiency
Fact #3: Base-stations need power-efficiency*

Wireless gets blacked out too

Trying to use your cell phone during the blackout was nearly impossible. What went wrong?August 16, 2003: 8:58 AM EDT By Paul R. La Monica, CNN/Money Senior Writer

Wireless systems getting denser

  • More base-stations per unit area

  • operational and maintenance costs

    Architectures first tested on base-stations

*implies does not waste power – does not imply low power


Fact 4 base stations need flexibility
Fact #4: Base-stations need flexibility*

  • Wireless systems are continuously evolving

    • New algorithms designed and evaluated

    • allow upgrading, co-existing, minimize design time, reuse

  • Flexibility needed for power-efficiency

    • Base-stations rarely operate at full capacity

    • Varying users, data rates, spreading, modulation, coding

    • Adapt resources to needs

*how much flexibility? – as flexible as possible


Fact 5 current base stations not flexible not power efficient
Fact #5: Current base-stations not flexible / not power-efficient

DSP(s)

‘Symbol rate’

processing

RF

‘Chip rate’

Control and

(Analog)

processing

protocol

Decoding

ASIC(s)

Co-processor(s)

DSP or

and/or

and/or

RISC

ASSP(s)

ASIC(s)

processor

and/or

FPGA(s)

Change implies re-partitioning algorithms, designing new hardware

Design done for the worst case – no adaptation with workload

Source: [Baines2003]


Thesis addresses the following problem
Thesis addresses the following problem power-efficient

  • design a base-station

  • supports 100’s of ALUs

  • power-efficient (adapts resources to needs)

  • as flexible as possible

  • How many ALUs at what clock frequency?

    • HYPOTHESIS:

  • Programmable* processors for wireless base-stations

*how much programmable? – as programmable as possible


Programmable processors
Programmable processors power-efficient

  • No processor optimization for specific algorithm

    • As programmable as possible

    • Example: no instruction for Viterbi decoding

    • FPGAs, ASICs, ASIPs etc. notconsidered

  • Use characteristics of wireless systems

    • precision, parallelism, operations,..

    • MMX extensions for multimedia


Single processors won t do
Single processors won’t do power-efficient

(1) Find ways for increasing clock frequency

  • C64x DSP: 600 – 720 – 1GHz – 100GHz?

  • Easiest solution but physical limits to scaling f

  • Not good for power, given cubic dependence with f

    (2) Increasing ALUs

  • Limited instruction level parallelism (ILP,MMX)

  • Register file area, ports explosion

  • Compiler issues in extracting more ILP

    (3) Multiprocessors


Related work multiprocessors
Related work - Multiprocessors power-efficient

Multiprocessors

Control

Reconfigurable*

Cannot scale to

support 100’s of

arithmetic units

processors

MIMD

SIMD

(Multiple Instructions

(Single Instruction

:

Multiple Data)

Multiple Data)

Data Parallel

RAW

Chameleon

picoChip

Single chip

Multi-chip

Array

:

TI TMS320C40 DSP

:

Sundance

TM

ClearSpeed

Cm*

MasPar

Vector

Illiac-IV

BSP

:

CODE

Multi-threading

Vector IRAM

Chip

(MT)

Cray 1

multiprocessor

Stream

(CMP)

:

Clustered VLIW

Sandbridge SandBlaster

DSP

:

Cray MTA

TI TMS320C8x DSP

:

Imagine

Sun MAJC

TI TMS320C6x DSP

Hydra

TM

Motorola RSVP

PowerPC RS64IV

Multiflow TRACE

IBM Power4

Alpha 21464

Alpha 21264

*Reconfigurable processor uses reconfiguration for execution time benefits


Challenges in proving hypothesis
Challenges in proving hypothesis power-efficient

  • Architecture choice for design exploration

    • SIMD generally more programmable* than reconfigurable

    • Compiler, simulators, tools and support play a major role

  • Benchmark workloads need to be designed

    • Previously done as ASICs, so none available

    • Not easy – finite precision, algorithms changing

  • Need detailed knowledge of wireless algorithms, architectures, mapping, compilers, design tools

*Programmable here refers to ease of use and write code for


Architecture choice stream processors
Architecture choice: Stream processors power-efficient

  • State-of-the-art programmable media processors

    • Can scale to 1000’s of arithmetic units [Khailany 2003]

    • Wireless algorithms have similar characteristics

  • Cycle-accurate simulator with open-source code

  • Parameters such as ALUs, register files can be varied

  • Graphical tools to investigate FU utilization, bottlenecks, memory stalls, communication overhead …

  • Almost anything can be changed, some changes easier than others!


Thesis contributions
Thesis contributions power-efficient

  • Mapping algorithms on stream processors

    • designing data-parallel algorithm versions

    • tradeoffs between packing, ALU utilization and memory

    • reduced inter-cluster communication network

  • Improve power efficiency in stream processors

    • adapting compute resources to workload variations

    • varying voltage and frequency to real-time requirements

  • Design exploration between #ALUs and clock frequency to minimize power consumption

    • fast real-time performance prediction


Outline
Outline power-efficient

  • Background

    • Wireless systems

    • Stream processors

  • Contribution #1 : Mapping

  • Contribution #2 : Power-efficiency

  • Contribution #3 : Design exploration

  • Broader impact and limitations


Wireless workloads 2g basic
Wireless workloads : 2G (Basic) power-efficient

2G physical layer signal processing

User 1

User 1

Code

Viterbi

Matched

decoder

Filter

MAC

Sliding

and

correlator

Network

layers

Received

signal

User K

User K

after

Code

DDC

Viterbi

Matched

decoder

Filter

Sliding

correlator

32 users

16 Kbps/user

Single-user algorithms

(other users noise)

> 2 GOPs


3g multiuser system
3G Multiuser system power-efficient

3G physical layer signal processing

Multiuser detection

User 1

User 1

Code

Viterbi

Matched

decoder

Received

Filter

signal

Parallel

MAC

after

Interference

and

DDC

Cancellation

Network

Stages

layers

User K

User K

Code

Viterbi

Matched

decoder

Filter

Multiuser

channel

estimation

32 users

128 Kbps/user

Multi-user algorithms

(cancels

interference)

> 20 GOPs


4g mimo system
4G MIMO system power-efficient

M antennas

4G physical layer signal processing

User 1, Antenna 1

User 1

Code

Chip level

LDPC

Matched

Equalization

decoder

Filter

Received

signal

after DDC

Channel

Estimation

User 1, Antenna T

Code

Chip level

Matched

Equalization

Filter

MAC

and

Network

Channel

layers

estimation

User K, Antenna 1

User K

Code

Chip level

LDPC

Matched

Equalization

decoder

Filter

Channel

Estimation

User K, Antenna T

Code

Chip level

Matched

Equalization

Filter

Channel

estimation

32 users

1 Mbps/user

Multiple antennas

(higher spectral

efficiency, higher data rates)

> 200 GOPs


Programmable processors1
Programmable processors power-efficient

int i,a[N],b[N],sum[N]; // 32 bits

short int c[N],d[N],diff[N]; // 16 bitspacked

for (i = 0; i< 1024; ++i) {

sum[i] = a[i] + b[i];

diff[i] = c[i] - d[i];

}

Instruction Level Parallelism (ILP) - DSP

Subword Parallelism (MMX) - DSP

Data Parallelism (DP) – Vector Processor

  • DP can decrease by increasing ILP and MMX

    – Example: loop unrolling

DP

ILP

MMX


Stream processors multi cluster dsps
Stream Processors : multi-cluster DSPs power-efficient

Internal

Memory

micro

controller

micro

controller

+

+

ILP

MMX

+

*

*

*

Memory: Stream Register File (SRF)

+

+

+

+

+

+

+

+

ILP

MMX

+

+

+

+

*

*

*

*

*

*

*

*

*

*

*

*

DP

adapt clusters to DP

Identical clusters, same operations.

Power-down unused FUs, clusters

VLIW DSP

(1 cluster)


Outline1
Outline power-efficient

Contribution #1

  • Mapping algorithms to stream processors (parallel, fixed pt)

  • Tradeoffs between packing, ALU utilization and memory

  • Reduced inter-cluster communication network


Packing
Packing power-efficient

  • Packing introduced around 1996 for exploiting subword parallelism

    • Intel MMX

    • Subword parallelism never looked back

    • Integrated into all current microprocessors and DSPs

  • SIMD + MMX : Stream processor/vector IRAM : 2000 +

    • relatively new concept

  • Not necessarily useful in SIMD processors

    • May add to inter-cluster communication


Packing may not be useful
Packing may not be useful power-efficient

a

3 4

5 6

7 8

1 2

Multiplication

p

3

5

7

1

q

4

6

8

2

Algorithm:

Re-ordering data

short a;

p

3

x

x

1

int y;

m

7

x

x

5

{

for(i= 1; i < 8 ; ++i)

n

x

2

4

x

y[i] = a[i]*a[i];

q

x

6

8

x

Add

}

p

3

2

4

1

q

7

6

8

5

Re-ordering data

p

2

3

4

1

q

6

7

7

8

5

Packing uses odd-even grouping


Data re ordering in memory
Data re-ordering in memory power-efficient

  • Matrix transpose

    • Common in wireless communication systems

    • Column access to data expensive

  • Re-ordering data inside the ALUs

    • Faster

    • Lower power


Trade offs during memory re ordering
Trade-offs during memory re-ordering power-efficient

ALUs

Memory

ALUs

Memory

ALUs

t

t

t

1

1

1

Transpose

Transpose

t

t

t

t

mem

3

alu

mem

t

2

t

t

2

2

t = t

+ t

2

stalls

t = t

+ t

t = t

0 < t

<

t

2

alu

2

stalls

mem

(c)

(b)

(a)


Transpose uses odd even grouping
Transpose uses odd-even grouping power-efficient

N

IN

B

C

D

0

A

A

B

C

D

3

4

2

1

OUT

M

A

1

B

2

M

/2

1

3

4

2

D

4

3

C

Repeat LOG(M

) times

{

IN = OUT;

}


Alu bandwidth memory bandwidth
ALU Bandwidth > Memory Bandwidth power-efficient

Transpose in memory (t

): DRAM 8 cycles

mem

Transpose in memory (t

): DRAM 3 cycles

mem

5

10

Transpose in ALU (t

)

alu

Execution time (cycles)

4

10

3

10

4

10

Matrix sizes (32x32, 64x64, 128x128)


Viterbi needs odd even grouping
Viterbi needs odd-even grouping power-efficient

ACS in SWAPs

Regular ACS

DP

vector

X(0)

X(0)

X(0)

X(0)

X(1)

X(1)

X(2)

X(1)

X(2)

X(2)

X(2)

X(4)

X(3)

X(3)

X(6)

X(3)

X(4)

X(4)

X(8)

X(4)

X(5)

X(10)

X(5)

X(5)

X(6)

X(6)

X(6)

X(12)

X(14)

X(7)

X(7)

X(7)

X(8)

X(8)

X(8)

X(1)

X(9)

X(9)

X(9)

X(3)

X(5)

X(10)

X(10)

X(10)

X(11)

X(7)

X(11)

X(11)

X(12)

X(9)

X(12)

X(12)

X(13)

X(13)

X(13)

X(11)

X(14)

X(13)

X(14)

X(14)

X(15)

X(15)

X(15)

X(15)

Exploiting Viterbi DP in SWAPs:

  • Use Register exchange (RE) instead of regular traceback

  • Re-order ACS, RE


Performance of viterbi decoding
Performance of Viterbi decoding power-efficient

1000

K = 9

K = 7

DSP

K = 5

100

Frequency needed to attain real-time (in MHz)

10

Max

DP

1

1

10

100

Number of clusters

Ideal C64x (w/o co-proc) needs ~200 MHz for real-time


Pattern in inter cluster comm
Pattern in inter-cluster comm power-efficient

  • Broadcasting

    • Matrix-vector multiplication, matrix-matrix multiplication, outer product updates

  • Odd-even grouping

    • Transpose, Packing, Viterbi decoding


Odd even grouping
Odd-even grouping power-efficient

4 Clusters

Data

0/4

1/5

2/6

3/7

0 1 2 3 4 5 6 7

0 2 4 8 1 3 5 7

Inter-cluster communication

Entire chip length

Limits clock frequency

Limits scaling

2

2

O(C

) wires, O(C

) interconnections, 8 cycles


A reduced inter cluster comm network
A reduced inter-cluster comm network power-efficient

4 Clusters

0/4

1/5

2/6

3/7

Data

Multiplexer

Broadcasting

support

Registers

Odd-even

(pipelining)

grouping

Demultiplexer

O(C

log(C)

) wires, O(C

) interconnections, 8 cycles

only nearest neighbor interconnections


Outline2
Outline power-efficient

Contribution #2 : Power-efficiency

High performance is low power

- Mark Horowitz


Flexibility needed in workloads
Flexibility needed in workloads power-efficient

25

2G base-station (16 Kbps/user)

3G base-station (128 Kbps/user)

20

15

Operation count (in GOPs)

10

5

0

(4,7)

(4,9)

(8,7)

(8,9)

(16,7)

(16,9)

(32,7)

(32,9)

(Users, Constraint lengths)

Note:

GOPs refer

only to arithmetic

computations

Billions of computations per second needed

Workload variation from ~1 GOPs for 4 users, constraint 7 viterbi

to ~23 GOPs for 32 users, constraint 9 viterbi


Flexibility affects data parallelism
Flexibility affects Data Parallelism* power-efficient

*Data Parallelism is defined as the parallelism available after subword packing and loop unrolling

U - Users, K - constraint length,

N - spreading gain, R - decoding rate


Adapting clusters to data parallelism
Adapting #clusters to Data Parallelism power-efficient

No reconfiguration

4: 2 reconfiguration

4:1 reconfiguration

All clusters off

C

C

C

C

C

C

C

Turned off using

voltage gating to

eliminate static and

dynamic power dissipation

Adaptive

Multiplexer

Network

C

C

C

C


Cluster utilization variation
Cluster utilization variation power-efficient

100

(4,9)

(4,7)

50

0

0

5

10

15

20

25

30

100

(8,9)

(8,7)

50

0

Cluster Utilization

0

5

10

15

20

25

30

100

50

(16,9)

(16,7)

0

0

5

10

15

20

25

30

100

(32,9)

50

(32,7)

0

0

5

10

15

20

25

30

Cluster Index

Cluster utilization variation on a 32-cluster processor

(32, 9) = 32 users, constraint length 9 Viterbi


Frequency variation
Frequency variation power-efficient

1200

Mem Stall

uC Stall

Busy

1000

800

Real-time Frequency (in MHz)

600

400

200

0

(4,7)

(4,9)

(8,7)

(8,9)

(16,7)

(16,9)

(32,7)

(32,9)


Operation
Operation power-efficient

  • Dynamic Voltage-Frequency scaling when system changes significantly

    • Users, data rates …

    • Coarse time scale (every few seconds)

  • Turn off clusters

    • when parallelism changes significantly

    • Memory operations

    • Exceed real-time requirements

    • Finer time scales (100’s of microseconds)


Power voltage gating scaling
Power : Voltage Gating & Scaling power-efficient

Power can change from 12.38 W to 300 mW

depending on workload changes


Outline3
Outline power-efficient

Contribution #3 : Design exploration

  • How many adders, multipliers, clusters, clock frequency

  • Quickly predict real-time performance


Deciding alus vs clock frequency
Deciding ALUs vs. clock frequency power-efficient

  • No independent variables

    • Clusters, ALUs, frequency, voltage (c,a,m,f)

    • Trade-offs exist

  • How to find the right combination for lowest power!


Static design exploration
Static design exploration power-efficient

Dynamic part

(Memory stalls

Microcontroller stalls)

Execution Time

Static part

(computations)

also helps in quickly predicting real-time performance


Sensitivity analysis important
Sensitivity analysis important power-efficient

  • We have a capacitance model [Khailany2003]

  • All equations not exact

    • Need to see how variations affect solutions


Design exploration methodology
Design exploration methodology power-efficient

  • 3 types of parallelism: ILP, MMX, DP

  • For best performance (power)

    • Maximize the use of all

  • Maximize ILP and MMX at expense of DP

    • Loop unrolling, packing

    • Schedule on sufficient number of adders/multipliers

  • If DP remains, use clusters = DP

    • No other way to exploit that parallelism


Setting clusters adders multipliers
Setting clusters, adders, multipliers power-efficient

  • If sufficient DP, linear decrease in frequency with clusters

    • Set clusters depending on DP and execution time estimate

  • To find adders and multipliers,

    • Let compiler schedule algorithm workloads across different numbers of adders and multipliers and let it find execution time

  • Put all numbers in power equation

    • Compare increase in capacitance due to added ALUs and clusters with benefits in execution time

  • Choose the solution that minimizes the power


Design exploration
Design exploration power-efficient

For sufficiently large

#adders, #multipliers per cluster

Explore Algorithm 1 : 32 clusters (t1)

Explore Algorithm 2 : 64 clusters (t2)

Explore Algorithm 3 : 64 clusters (t3)

Explore Algorithm 4 : 16 clusters (t4)

DP

ILP


Clusters frequency and power
Clusters: frequency and power power-efficient

4

1

10

0.9

0.8

0.7

Power

µ

f

2

Power

µ

f

0.6

3

Frequency (MHz)

Power

µ

f

Normalized Power

3

0.5

10

0.4

0.3

0.2

0.1

2

0

10

0

10

20

30

40

50

60

70

0

1

2

10

10

10

Clusters

Clusters

32 clusters at frequency = 836.692 MHz (p = 1)

64 clusters at frequency = 543.444 MHz (p = 2)

64 clusters at frequency = 543.444 MHz (p = 3)

3G workload


Alu utilization with frequency
ALU utilization with frequency power-efficient

(78,18)

(78,27)

1100

(78,45)

1000

900

(64,31)

Real-Time Frequency (in MHz) with FU utilization(+,*)

800

(50,31)

(65,46)

700

(38,28)

600

(51,42)

(67,62)

(32,28)

3

500

(42,37)

2.8

1

2.6

1.5

(33,34)

(55,62)

2.4

2

2.2

2.5

(43,56)

2

3

1.8

#Multipliers

3.5

(36,53)

1.6

#Adders

4

1.4

4.5

1.2

1

5

3G workload


Power variations with f and
Power variations with f and power-efficient



Exploration results
Exploration results power-efficient

*************************

Final Design Conclusion

*************************

Clusters : 64

Multipliers/cluster : 1

Utilization: 62%

Adders/cluster : 3

Utilization: 55%

Real-time frequency : 568.68 MHz

*************************

Exploration done with plots generated in seconds….


Outline4
Outline power-efficient

Broader impact and limitations


Broader impact
Broader impact power-efficient

  • Results not specific to base-stations

    • High performance, low power system designs

  • Concepts can be extended to handsets

  • Mux network applicable to all SIMD processors

    • Power efficiency in scientific computing

  • Results #2, #3 applicable to all stream applications

    • Design and power efficiency

    • Multimedia, MPEG, …


Limitations
Limitations power-efficient

Don’t believe the model is the reality

(Proof is in the pudding)

  • Fabrication needed to verify concepts

    • Cycle accurate simulator

    • Extrapolating models for power

  • LDPC decoding (in progress)

    • Sparse matrix requires permutations over large data

    • Indexed SRF may help

  • 3G requires 1 GHz at 128 Kbps/user

    • 4G equalization at 1 Mbps breaks down (expected)


Conclusions
Conclusions power-efficient

  • Road ends - conventional architectures[Agarwal2000]

  • Wide range of architectures – DSP, ASSP, ASIP, reconfigurable,stream, ASIC, programmable +

    • Difficult to compare and contrast

    • Need new definitions that allow comparisons

  • Wireless workloads – SPECwireless standard needed

  • utilizing 100-1000s ALUs/clock cycle and mapping algorithms not easy in programmable architectures

    • my thesis lays the initial foundations



Alternate view of the cmp dsp
Alternate view of the CMP DSP the proposal]

Streaming Memory system

L2

internal

memory

Bank

C

Bank

2

Bank

1

Prefetch

Buffers

Clusters

Of

C64x

Instruction

decoder

cluster 0

cluster C

cluster 1

Inter-cluster

communication

network


Adapting clusters using 1 memory transfers
Adapting clusters using the proposal](1) memory transfers

Memory

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

Stream A

Step 1:

Step 2:

Stream A'

SRF

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

0

1

2

4

4

5

6

7

X

X

X

X

8

9

10

11

12

13

14

15

Clusters


2 using conditional streams
(2) Using conditional streams the proposal]

Cluster index

0

1

2

3

0

1

2

3

Conditional Buffer

A

B

C

D

A

B

C

D

1

1

0

0

Condition Switch

1

1

0

0

C

D

-

-

Data received

A

B

-

-

Access 0

Access 1

4-clusters reconfiguring to 2


Arithmetic clusters in stream processors
Arithmetic clusters in stream processors the proposal]

SRF

Distributed Register Files

(supports more ALUs)

From/To SRF

+

+

+

+

+

+

*

*

+

+

*

*

Cross Point

/

Intercluster Network

/

/

/

Comm. Unit

Scratchpad

(indexed accesses)


Programming model
Programming model the proposal]

kernel add(istream<int> a, istream<int> b, ostream<int> sum)

{

int inputA, inputB, output;

loop_stream(a)

{

a >> inputA;

b >> inputB;

output = a + b;

stream<int> a(1024);

sum << output;

stream<int> b(1024);

}

stream<int> sum(1024);

stream<half2> c(512);

}

stream<half2> d(512);

stream<half2> diff(512);

add(a,b,sum);

kernel sub(istream<half2> c, istream<half2> d,

ostream<half2> diff)

sub(c,d,diff);

{

int inputC, inputD, output;

loop_stream(c)

{

c >> inputC;

d >> inputD;

output = c - d;

diff << output;

}

}

Your new hardware won’t run your old software – Balch’s law


Stream processor programming
Stream processor programming the proposal]

Kernel

Stream

Input Data

Output Data

Interference

Cancellation

Viterbi

decoding

receivedsignal

Matched

filter

Decoded bits

Correlator

channel

estimation

  • Kernels (computation) and streams (communication)

  • Use local data in clusters providing GOPs support

  • Imagine stream processor at Stanford [Rixner’01]

Scott Rixner. Stream Processor Architecture, Kluwer Academic Publishers: Boston, MA, 2001.


Parallel viterbi decoding
Parallel Viterbi Decoding the proposal]

  • Add-Compare-Select (ACS) : trellis interconnect : computations

    • Parallelism depends on constraint length (#states)

  • Traceback: searching

    • Conventional

      • Sequential (No DP) with dynamic branching

      • Difficult to implement in parallel architecture

    • Use Register Exchange (RE)

      • parallel solution

ACS

Unit

Traceback

Unit

Decoded

bits

Detected

bits


ad