Programmable processors for wireless basestations. Sridhar Rajagopal ( [email protected] ) December 9, 2003. Fact#1: Wireless rates clock rates. 4. 10. Clock frequency (MHz). 3. 10. 2. 10. WLAN data rate (Mbps). 1. 10. 0. 10. 1. 10. Cellular data rate (Mbps). 2. 10. 3.
Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.
4
10
Clock frequency (MHz)
3
10
2
10
WLAN data rate (Mbps)
1
10
0
10
1
10
Cellular data rate (Mbps)
2
10
3
10
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
Year
Need to process 100X more bits per clock cycle today than in 1996
4 GHz
54100 Mbps
200 MHz
210 Mbps
1 Mbps
9.6 Kbps
Source: Intel, IEEE 802.11x, 3GPP
RF
Network Interface
Baseband processing
LNA
E1/T1
Chip level
Symbol
BSC/RNC
or
Demodulation
Detection
Interface
Packet
Despreading
RF RX
Network
ADC
Packet/
Channel
Symbol
Circuit Switch
DDC
estimation
Decoding
Control
Frequency
Power Measurement and Gain
Power Supply
Offset
Control (AGC)
and Control
Compensation
Unit
Sophisticated signal processing for multiple users
Need 1001000s of arithmetic operations to process 1 bit
Source: Texas Instruments
Example:
1000 arithmetic operations/bit with 1 bit/10 cycles
Basestations need 100 ALUs
Wireless gets blacked out too
Trying to use your cell phone during the blackout was nearly impossible. What went wrong?August 16, 2003: 8:58 AM EDT By Paul R. La Monica, CNN/Money Senior Writer
Wireless systems getting denser
Architectures first tested on basestations
*implies does not waste power – does not imply low power
*how much flexibility? – as flexible as possible
DSP(s)
‘Symbol rate’
processing
RF
‘Chip rate’
Control and
(Analog)
processing
protocol
Decoding
ASIC(s)
Coprocessor(s)
DSP or
and/or
and/or
RISC
ASSP(s)
ASIC(s)
processor
and/or
FPGA(s)
Change implies repartitioning algorithms, designing new hardware
Design done for the worst case – no adaptation with workload
Source: [Baines2003]
*how much programmable? – as programmable as possible
(1) Find ways for increasing clock frequency
(2) Increasing ALUs
(3) Multiprocessors
Multiprocessors
Control
Reconfigurable*
Cannot scale to
support 100’s of
arithmetic units
processors
MIMD
SIMD
(Multiple Instructions
(Single Instruction
:
Multiple Data)
Multiple Data)
Data Parallel
RAW
Chameleon
picoChip
Single chip
Multichip
Array
:
TI TMS320C40 DSP
:
Sundance
TM
ClearSpeed
Cm*
MasPar
Vector
IlliacIV
BSP
:
CODE
Multithreading
Vector IRAM
Chip
(MT)
Cray 1
multiprocessor
Stream
(CMP)
:
Clustered VLIW
Sandbridge SandBlaster
DSP
:
Cray MTA
TI TMS320C8x DSP
:
Imagine
Sun MAJC
TI TMS320C6x DSP
Hydra
TM
Motorola RSVP
PowerPC RS64IV
Multiflow TRACE
IBM Power4
Alpha 21464
Alpha 21264
*Reconfigurable processor uses reconfiguration for execution time benefits
*Programmable here refers to ease of use and write code for
2G physical layer signal processing
User 1
User 1
Code
Viterbi
Matched
decoder
Filter
MAC
Sliding
and
correlator
Network
layers
Received
signal
User K
User K
after
Code
DDC
Viterbi
Matched
decoder
Filter
Sliding
correlator
32 users
16 Kbps/user
Singleuser algorithms
(other users noise)
> 2 GOPs
3G physical layer signal processing
Multiuser detection
User 1
User 1
Code
Viterbi
Matched
decoder
Received
Filter
signal
Parallel
MAC
after
Interference
and
DDC
Cancellation
Network
Stages
layers
User K
User K
Code
Viterbi
Matched
decoder
Filter
Multiuser
channel
estimation
32 users
128 Kbps/user
Multiuser algorithms
(cancels
interference)
> 20 GOPs
M antennas
4G physical layer signal processing
User 1, Antenna 1
User 1
Code
Chip level
LDPC
Matched
Equalization
decoder
Filter
Received
signal
after DDC
Channel
Estimation
User 1, Antenna T
Code
Chip level
Matched
Equalization
Filter
MAC
and
Network
Channel
layers
estimation
User K, Antenna 1
User K
Code
Chip level
LDPC
Matched
Equalization
decoder
Filter
Channel
Estimation
User K, Antenna T
Code
Chip level
Matched
Equalization
Filter
Channel
estimation
32 users
1 Mbps/user
Multiple antennas
(higher spectral
efficiency, higher data rates)
> 200 GOPs
int i,a[N],b[N],sum[N]; // 32 bits
short int c[N],d[N],diff[N]; // 16 bitspacked
for (i = 0; i< 1024; ++i) {
sum[i] = a[i] + b[i];
diff[i] = c[i]  d[i];
}
Instruction Level Parallelism (ILP)  DSP
Subword Parallelism (MMX)  DSP
Data Parallelism (DP) – Vector Processor
– Example: loop unrolling
DP
ILP
MMX
Internal
Memory
micro
controller
micro
controller
+
+
ILP
MMX
+
*
*
*
Memory: Stream Register File (SRF)
+
+
+
+
+
+
+
+
…
ILP
MMX
+
+
+
+
*
*
*
*
*
*
*
*
*
*
*
*
DP
adapt clusters to DP
Identical clusters, same operations.
Powerdown unused FUs, clusters
VLIW DSP
(1 cluster)
Contribution #1
a
3 4
5 6
7 8
1 2
Multiplication
p
3
5
7
1
q
4
6
8
2
Algorithm:
Reordering data
short a;
p
3
x
x
1
int y;
m
7
x
x
5
{
for(i= 1; i < 8 ; ++i)
n
x
2
4
x
y[i] = a[i]*a[i];
q
x
6
8
x
Add
}
p
3
2
4
1
q
7
6
8
5
Reordering data
p
2
3
4
1
q
6
7
7
8
5
Packing uses oddeven grouping
ALUs
Memory
ALUs
Memory
ALUs
t
t
t
1
1
1
Transpose
Transpose
t
t
t
t
mem
3
alu
mem
t
2
t
t
2
2
t = t
+ t
2
stalls
t = t
+ t
t = t
0 < t
<
t
2
alu
2
stalls
mem
(c)
(b)
(a)
N
IN
B
C
D
0
A
A
B
C
D
3
4
2
1
OUT
M
A
1
B
2
M
/2
1
3
4
2
D
4
3
C
Repeat LOG(M
) times
{
IN = OUT;
}
Transpose in memory (t
): DRAM 8 cycles
mem
Transpose in memory (t
): DRAM 3 cycles
mem
5
10
Transpose in ALU (t
)
alu
Execution time (cycles)
4
10
3
10
4
10
Matrix sizes (32x32, 64x64, 128x128)
ACS in SWAPs
Regular ACS
DP
vector
X(0)
X(0)
X(0)
X(0)
X(1)
X(1)
X(2)
X(1)
X(2)
X(2)
X(2)
X(4)
X(3)
X(3)
X(6)
X(3)
X(4)
X(4)
X(8)
X(4)
X(5)
X(10)
X(5)
X(5)
X(6)
X(6)
X(6)
X(12)
X(14)
X(7)
X(7)
X(7)
X(8)
X(8)
X(8)
X(1)
X(9)
X(9)
X(9)
X(3)
X(5)
X(10)
X(10)
X(10)
X(11)
X(7)
X(11)
X(11)
X(12)
X(9)
X(12)
X(12)
X(13)
X(13)
X(13)
X(11)
X(14)
X(13)
X(14)
X(14)
X(15)
X(15)
X(15)
X(15)
Exploiting Viterbi DP in SWAPs:
1000
K = 9
K = 7
DSP
K = 5
100
Frequency needed to attain realtime (in MHz)
10
Max
DP
1
1
10
100
Number of clusters
Ideal C64x (w/o coproc) needs ~200 MHz for realtime
4 Clusters
Data
0/4
1/5
2/6
3/7
0 1 2 3 4 5 6 7
0 2 4 8 1 3 5 7
Intercluster communication
Entire chip length
Limits clock frequency
Limits scaling
2
2
O(C
) wires, O(C
) interconnections, 8 cycles
4 Clusters
0/4
1/5
2/6
3/7
Data
Multiplexer
Broadcasting
support
Registers
Oddeven
(pipelining)
grouping
Demultiplexer
O(C
log(C)
) wires, O(C
) interconnections, 8 cycles
only nearest neighbor interconnections
Contribution #2 : Powerefficiency
High performance is low power
 Mark Horowitz
25
2G basestation (16 Kbps/user)
3G basestation (128 Kbps/user)
20
15
Operation count (in GOPs)
10
5
0
(4,7)
(4,9)
(8,7)
(8,9)
(16,7)
(16,9)
(32,7)
(32,9)
(Users, Constraint lengths)
Note:
GOPs refer
only to arithmetic
computations
Billions of computations per second needed
Workload variation from ~1 GOPs for 4 users, constraint 7 viterbi
to ~23 GOPs for 32 users, constraint 9 viterbi
*Data Parallelism is defined as the parallelism available after subword packing and loop unrolling
U  Users, K  constraint length,
N  spreading gain, R  decoding rate
No reconfiguration
4: 2 reconfiguration
4:1 reconfiguration
All clusters off
C
C
C
C
C
C
C
Turned off using
voltage gating to
eliminate static and
dynamic power dissipation
Adaptive
Multiplexer
Network
C
C
C
C
100
(4,9)
(4,7)
50
0
0
5
10
15
20
25
30
100
(8,9)
(8,7)
50
0
Cluster Utilization
0
5
10
15
20
25
30
100
50
(16,9)
(16,7)
0
0
5
10
15
20
25
30
100
(32,9)
50
(32,7)
0
0
5
10
15
20
25
30
Cluster Index
Cluster utilization variation on a 32cluster processor
(32, 9) = 32 users, constraint length 9 Viterbi
1200
Mem Stall
uC Stall
Busy
1000
800
Realtime Frequency (in MHz)
600
400
200
0
(4,7)
(4,9)
(8,7)
(8,9)
(16,7)
(16,9)
(32,7)
(32,9)
Power can change from 12.38 W to 300 mW
depending on workload changes
Contribution #3 : Design exploration
Dynamic part
(Memory stalls
Microcontroller stalls)
Execution Time
Static part
(computations)
also helps in quickly predicting realtime performance
For sufficiently large
#adders, #multipliers per cluster
Explore Algorithm 1 : 32 clusters (t1)
Explore Algorithm 2 : 64 clusters (t2)
Explore Algorithm 3 : 64 clusters (t3)
Explore Algorithm 4 : 16 clusters (t4)
DP
ILP
4
1
10
0.9
0.8
0.7
Power
µ
f
2
Power
µ
f
0.6
3
Frequency (MHz)
Power
µ
f
Normalized Power
3
0.5
10
0.4
0.3
0.2
0.1
2
0
10
0
10
20
30
40
50
60
70
0
1
2
10
10
10
Clusters
Clusters
32 clusters at frequency = 836.692 MHz (p = 1)
64 clusters at frequency = 543.444 MHz (p = 2)
64 clusters at frequency = 543.444 MHz (p = 3)
3G workload
(78,18)
(78,27)
1100
(78,45)
1000
900
(64,31)
RealTime Frequency (in MHz) with FU utilization(+,*)
800
(50,31)
(65,46)
700
(38,28)
600
(51,42)
(67,62)
(32,28)
3
500
(42,37)
2.8
1
2.6
1.5
(33,34)
(55,62)
2.4
2
2.2
2.5
(43,56)
2
3
1.8
#Multipliers
3.5
(36,53)
1.6
#Adders
4
1.4
4.5
1.2
1
5
3G workload
*************************
Final Design Conclusion
*************************
Clusters : 64
Multipliers/cluster : 1
Utilization: 62%
Adders/cluster : 3
Utilization: 55%
Realtime frequency : 568.68 MHz
*************************
Exploration done with plots generated in seconds….
Broader impact and limitations
Don’t believe the model is the reality
(Proof is in the pudding)
Streaming Memory system
L2
internal
memory
Bank
C
Bank
2
Bank
1
Prefetch
Buffers
Clusters
Of
C64x
Instruction
decoder
cluster 0
cluster C
cluster 1
Intercluster
communication
network
Memory
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Stream A
Step 1:
Step 2:
Stream A'
SRF
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
0
1
2
4
4
5
6
7
X
X
X
X
8
9
10
11
12
13
14
15
Clusters
Cluster index
0
1
2
3
0
1
2
3
Conditional Buffer
A
B
C
D
A
B
C
D
1
1
0
0
Condition Switch
1
1
0
0
C
D


Data received
A
B


Access 0
Access 1
4clusters reconfiguring to 2
SRF
Distributed Register Files
(supports more ALUs)
From/To SRF
+
+
+
+
+
+
*
*
+
+
*
*
Cross Point
/
Intercluster Network
/
/
/
Comm. Unit
Scratchpad
(indexed accesses)
kernel add(istream<int> a, istream<int> b, ostream<int> sum)
{
int inputA, inputB, output;
loop_stream(a)
{
a >> inputA;
b >> inputB;
output = a + b;
stream<int> a(1024);
sum << output;
stream<int> b(1024);
}
stream<int> sum(1024);
stream<half2> c(512);
}
stream<half2> d(512);
stream<half2> diff(512);
add(a,b,sum);
kernel sub(istream<half2> c, istream<half2> d,
ostream<half2> diff)
sub(c,d,diff);
{
int inputC, inputD, output;
loop_stream(c)
{
c >> inputC;
d >> inputD;
output = c  d;
diff << output;
}
}
Your new hardware won’t run your old software – Balch’s law
Kernel
Stream
Input Data
Output Data
Interference
Cancellation
Viterbi
decoding
receivedsignal
Matched
filter
Decoded bits
Correlator
channel
estimation
Scott Rixner. Stream Processor Architecture, Kluwer Academic Publishers: Boston, MA, 2001.
ACS
Unit
Traceback
Unit
Decoded
bits
Detected
bits