1 / 47

Programmable processors for wireless base-stations

Programmable processors for wireless base-stations. Sridhar Rajagopal ( sridhar@rice.edu ) December 16, 2003. Wireless rates  clock rates. 4. 10. Clock frequency (MHz). 3. 10. 2. 10. W-LAN data rate (Mbps). 1. 10. 0. 10. -1. 10. Cellular data rate (Mbps). -2. 10. -3. 10.

chloe-brock
Download Presentation

Programmable processors for wireless base-stations

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Programmable processors for wireless base-stations Sridhar Rajagopal (sridhar@rice.edu) December 16, 2003

  2. Wireless rates  clock rates 4 10 Clock frequency (MHz) 3 10 2 10 W-LAN data rate (Mbps) 1 10 0 10 -1 10 Cellular data rate (Mbps) -2 10 -3 10 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 Year Need to process 100X more bits per clock cycle today than in 1996 4 GHz 54-100 Mbps 200 MHz 2-10 Mbps 1 Mbps 9.6 Kbps

  3. Base-stations need horsepower DSP(s) ‘Symbol rate’ processing RF ‘Chip rate’ ‘Packet rate’ processing (Analog) processing Decoding ASIC(s) Co-processor(s) DSP or and/or and/or RISC ASSP(s) ASIC(s) processor and/or FPGA(s) Sophisticated signal processing for multiple users Need 100-1000s of arithmetic operations to process 1 bit Base-stations require > 100 ALUs

  4. Programmable architectures • Wireless algorithm kernels • Well known, ASIC mapping well-studied • Processors getting more powerful every year • Historic trend: ASICs  Programmable Can we design a fully programmable wireless system?

  5. Thesis addresses the following problem • Design programmableprocessors for wireless base-stations with 100’s of ALUs : • map wireless algorithms on these processors • power-efficient (adapt resources to needs) • (c) decide #ALUs, clock frequency how much programmable? – as programmable as possible

  6. Choice : Multi-processors • Single processors won’t do • ILP, subword parallelism not sufficient • Register file explosion with increasing ALUs • Multiprocessors • Data parallelism in wireless systems • Data-parallel/SIMD/vector processors appropriate • Exploit ILP, MMX, DP

  7. Thesis contributions (a)Mapping algorithms on data-parallel processors • designing data-parallel algorithms • tradeoffs between packing, ALU utilization and memory • reduced inter-cluster communication network (b)Improve power efficiency • adapting compute resources to workload variations • varying voltage and frequency to real-time requirements (c) Design exploration between #ALUs and clock frequency to minimize power consumption • fast real-time performance prediction

  8. Outline • Background • Wireless systems • Data-parallel (Stream) processors • Mapping algorithms to stream processors • Power efficiency • Design exploration • Broad impact and future work

  9. Wireless workloads Time 1996 2003 ?

  10. Key kernels studied for wireless • FFT – Media processing • QRD – Media processing • Outer product updates • Matrix – vector operations • matrix – matrix operations • Matrix transpose • Viterbi decoding • LDPC decoding (in progress)

  11. Characteristics of wireless • Compute-bound • Finite precision • Limited temporal data reuse • Streaming data • Data parallelism • Static, deterministic, regular workloads • Limited control flow

  12. Parallelism levels in wireless int i,a[N],b[N],sum[N]; // 32 bits short int c[N],d[N],diff[N]; // 16 bitspacked for (i = 0; i< 1024; ++i) { sum[i] = a[i] + b[i]; diff[i] = c[i] - d[i]; } Instruction Level Parallelism (ILP) - DSP Subword Parallelism (MMX) - DSP Data Parallelism (DP) – Vector Processor • DP can decrease by increasing ILP and MMX – Example: loop unrolling DP ILP MMX

  13. Stream Processors : multi-cluster DSPs Internal Memory micro controller micro controller + + ILP MMX + * * * Memory: Stream Register File (SRF) + + + + + + + + … ILP MMX + + + + * * * * * * * * * * * * DP adapt clusters to DP Identical clusters, same operations. Power-down unused FUs, clusters VLIW DSP (1 cluster)

  14. Outline • Background • Wireless systems • Stream processors • Mapping algorithms to stream processors • Reduced inter-cluster communication network • Power efficiency • Design exploration • Broad impact and future work

  15. Patterns in inter-cluster comm • Intercluster comm network fully connected • Structure in access patterns can be exploited • Broadcasting • Matrix-vector multiplication, matrix-matrix multiplication, outer product updates • Odd-even grouping • Transpose, Packing, Viterbi decoding

  16. Viterbi needs odd-even grouping ACS in SWAPs Regular ACS DP vector X(0) X(0) X(0) X(0) X(1) X(1) X(2) X(1) X(2) X(2) X(2) X(4) X(3) X(3) X(6) X(3) X(4) X(4) X(8) X(4) X(5) X(10) X(5) X(5) X(6) X(6) X(6) X(12) X(14) X(7) X(7) X(7) X(8) X(8) X(8) X(1) X(9) X(9) X(9) X(3) X(5) X(10) X(10) X(10) X(11) X(7) X(11) X(11) X(12) X(9) X(12) X(12) X(13) X(13) X(13) X(11) X(14) X(13) X(14) X(14) X(15) X(15) X(15) X(15) Exploiting Viterbi DP: • Odd-even grouping of trellis states

  17. Performance of Viterbi decoding 1000 K = 9 K = 7 DSP K = 5 100 Frequency needed to attain real-time (in MHz) 10 Max DP 1 1 10 100 Number of clusters Ideal C64x DSP (w/o co-proc) needs ~200 MHz for real-time

  18. Odd-even grouping • Packing • If odd-even data packed in same cluster and precision doubles • Odd-even grouping required for bringing data to right cluster • Not always beneficial for performance • Matrix transpose • Better done in ALUs than in memory • Shown to have an order-of-magnitude better performance • Done in ALUs as repeated odd-even groupings

  19. Transpose uses odd-even grouping N IN B C D 0 A A B C D 3 4 2 1 OUT M A 1 B 2 M /2 1 3 4 2 D 4 3 C Repeat LOG(M ) times { IN = OUT; }

  20. Odd-even grouping 4 Clusters Data 0/4 1/5 2/6 3/7 2 2 O(C ) wires, O(C ) interconnections, 8 cycles 0 1 2 3 4 5 6 7  0 2 4 6 1 3 5 7 Inter-cluster communication Entire chip length Limits clock frequency Limits scaling

  21. A reduced inter-cluster comm network 4 Clusters 0/4 1/5 2/6 3/7 Data Multiplexer Broadcasting support Registers Odd-even (pipelining) grouping Demultiplexer O(C log(C) ) wires, O(C ) interconnections, 8 cycles only nearest neighbor interconnections

  22. Outline • Background • Wireless systems • Stream processors • Mapping algorithms to stream processors • Power efficiency • Design exploration • Broad impact and future work

  23. Flexibility needed in workloads 25 2G base-station (16 Kbps/user) 3G base-station (128 Kbps/user) 20 15 Note: GOPs refer only to arithmetic computations Min. ALUs needed at 1 GHz Operation count (in GOPs) 10 5 0 (4,7) (4,9) (8,7) (8,9) (16,7) (16,9) (32,7) (32,9) (Users, Constraint lengths) Billions of computations per second needed Workload variation from ~1 GOPs for 4 users, constraint 7 viterbi to ~23 GOPs for 32 users, constraint 9 viterbi

  24. DP changes with users DP Can be turned OFF Packet 1 Constraint length 7 (16 clusters) Packet 2 Constraint length 9 (64 clusters) Packet 3 Constraint length 5 (4 clusters)

  25. Data is not in the right banks SRF C C C C Clusters 4  2 clusters • Data not in the right SRF banks • Overhead in bringing data to the right banks • Via memory • Via inter-cluster communication network

  26. Adapting #clusters to Data Parallelism C C C C SRF Turned off using voltage gating to eliminate static and dynamic power dissipation Adaptive Multiplexer Network Clusters C C C C No reconfiguration 4: 2 reconfiguration 4:1 reconfiguration All clusters off C C C

  27. Cluster utilization variation 100 50 (32,9) Cluster Utilization (32,7) 0 0 5 10 15 20 25 30 Cluster Index Cluster utilization variation on a 32-cluster processor (32, 9) = 32 users, constraint length 9 Viterbi

  28. Frequency variation 1200 Mem Stall uC Stall Busy 1000 800 Real-time Frequency (in MHz) 600 400 200 0 (4,7) (4,9) (8,7) (8,9) (16,7) (16,9) (32,7) (32,9)

  29. Operation • Dynamic Voltage-Frequency scaling when system changes significantly • Users, data rates … • Coarse time scale (when system changes) • Turn off clusters • when parallelism changes • Finer time scale (once every 1000 cycles) (di/dt effects) • Memory operations • Exceed real-time requirements

  30. Power : Voltage Gating & Scaling Power can change from 12.38 W to 300 mW (40x savings) depending on workload changes

  31. Outline • Background • Wireless systems • Stream processors • Mapping algorithms to stream processors • Power efficiency • Design exploration • Broad impact and future work

  32. Deciding ALUs vs. clock frequency • No independent variables • Clusters, ALUs, frequency, voltage (c,a,m,f) • Trade-offs exist • How to find the right combination for lowest power!

  33. Static design exploration Dynamic part (Memory stalls Microcontroller stalls) Execution Time Static, predictable part (computations) also helps in quickly predicting real-time performance

  34. Sensitivity analysis important • We have a capacitance model [Khailany2003] • All equations not exact • Need to see how variations affect solutions

  35. Design exploration methodology • 3 types of parallelism: ILP, MMX, DP • For best performance (power) • Maximize the use of all • Maximize ILP and MMX at expense of DP • Loop unrolling, packing • Schedule on sufficient number of adders/multipliers • If DP remains, set clusters = DP • No other way to exploit that parallelism

  36. Setting clusters, adders, multipliers • If sufficient DP, linear decrease in frequency with clusters • Set clusters depending on DP and execution time estimate • To find adders and multipliers, • Let compiler schedule algorithm workloads across different numbers of adders and multipliers and let it find execution time • Put all numbers in power equation • Compare increase in capacitance due to added ALUs and clusters with benefits in execution time • Choose the solution that minimizes the power

  37. Design exploration for clusters (c) For sufficiently large #adders, #multipliers per cluster Explore Algorithm 1 : 32 clusters Explore Algorithm 2 : 64 clusters Explore Algorithm 3 : 64 clusters Explore Algorithm 4 : 16 clusters DP time

  38. Clusters: frequency and power 4 1 10 0.9 0.8 0.7 Power µ f 2 Power µ f Frequency (MHz) f(c) 0.6 3 Power µ f Normalized Power 3 0.5 10 0.4 0.3 0.2 0.1 2 0 10 0 10 20 30 40 50 60 70 0 1 2 10 10 10 Clusters Clusters(c) 32 clusters at frequency = 836.692 MHz (p = 1) 64 clusters at frequency = 543.444 MHz (p = 2) 64 clusters at frequency = 543.444 MHz (p = 3) 3G workload

  39. ALU utilization with frequency (78,18) (78,27) 1100 (78,45) 1000 900 (64,31) Real-Time Frequency (in MHz) with FU utilization(+,*) 800 (50,31) (65,46) 700 (38,28) 600 (51,42) (67,62) (32,28) 3 500 (42,37) 2.8 1 2.6 1.5 (33,34) (55,62) 2.4 2 2.2 2.5 (43,56) 2 3 1.8 #Multipliers 3.5 (36,53) 1.6 #Adders 4 1.4 4.5 1.2 1 5 3G workload Relation between ALU utilization and power minimization?

  40. Choice of adders and multipliers

  41. Exploration results ************************* Final Design Conclusion ************************* Clusters : 64 Multipliers/cluster : 1 Multiplier Utilization: 62% Adders/cluster : 3 Adder Utilization: 55% Real-time frequency : 568.68 MHz for 128 Kbps/user ************************* Exploration done in seconds….

  42. Outline • Background • Wireless systems • Stream processors • Mapping algorithms to stream processors • Power efficiency • Design exploration • Broad impact and future work

  43. Broader impact • Results not specific to base-stations • High performance, low power system designs • Concepts can be extended to handsets • Mux network applicable to all SIMD processors • Power efficiency in scientific computing • Results #2, #3 applicable to all stream applications • Design and power efficiency • Multimedia, MPEG, …

  44. Future work Don’t believe the model is the reality • Fabrication needed to verify concepts • Cycle accurate simulator • Extrapolating models for power • LDPC decoding (in progress) • Sparse matrix requires permutations over large data • Indexed SRF may help • 3G requires 1 GHz at 128 Kbps/user • 4G equalization at 1 Mbps breaks down (expected)

  45. Options for higher performance • Multi-threading (ILP, MMX, DP, MT) • Schedule other kernels on unused clusters • Additional microcontroller and issue logic complexity • Pipelining (ILP, MMX, DP, MT, PP) • Standard way of improving performance • Inter-processor communication overhead • Load-balancing difficult • min(t1,t2,…) instead of min(t1+t2+,…) • Software tools need to catch up with hardware

  46. Need for new architectures, definitions and benchmarks • Road ends - conventional architectures[Agarwal2000] • Wide range of architectures – DSP, ASSP, ASIP, reconfigurable,stream, ASIC, programmable + • Difficult to compare and contrast • Need new definitions that allow comparisons • Wireless workloads • Typically ASIC designs • SPEC benchmark needed for programmable designs

  47. Conclusions • Utilizing 100-1000s ALUs/clock cycle and mapping algorithms not easy in programmable architectures • Data parallel algorithms need to be designed and mapped • Power efficiency needs to be provided • Design exploration needed to decide #ALUs to meet real-time constraints • My thesis lays the initial foundations

More Related