1 / 30

DSP architectures for wireless communications

DSP architectures for wireless communications. Sridhar Rajagopal Department of Electrical and Computer Engineering Rice University, Houston TX ECE Pizza Talk March 28, 2003. This work has been supported in part by Nokia, TI, TATP and NSF. Wireless Cellular. Wireless LAN. Bluetooth/

percy
Download Presentation

DSP architectures for wireless communications

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. DSP architectures for wireless communications Sridhar Rajagopal Department of Electrical and Computer Engineering Rice University, Houston TX ECE Pizza Talk March 28, 2003 This work has been supported in part by Nokia, TI, TATP and NSF

  2. Wireless Cellular Wireless LAN Bluetooth/ Home Networks Future wireless devices : • High data rate mobile devices with multimedia • Multiple antennas w/ complex algorithms, GOPs of computation • Area-Time-Power constraints • Seamless connection across environments and standards • Use the fastest and cheapest available service

  3. Design me Aim of the talk

  4. Trends FLEXIBILITY

  5. Application Layer Network Layer MAC Layer Physical Layer Change in flexibility requirements No change (already flexible) Maximum change (needs to support multiple environments, algorithms and standards)

  6. Programmable Area-Time-Power benefits Intermediate Flexibility Time-to-market Software updates ASICs Architecture trade-offs Past : more DSP + less ASIC, Current : less DSP + more ASIC Reason: need less flexibilityOR DSPs not powerful enough? Can’t we build better DSPs? How much flexibility do we need?

  7. Problems with current DSPs • Current DSPs • Not enough functional units (FUs) for GOPs of computation • Need 100’s of FUs • Not low power enough!! • Cannot extend to more FUs • Limited Instruction Level Parallelism (ILP) • Limited Subword Parallelism (such as MMX) • Cannot support more registers (area,ports) • Compilers: difficult to find ILP as FUs increase

  8. Scalable Wireless Application-specific Procesors (SWAPs) • Exploit data parallelism (DP) • Available in many wireless algorithms • This is what ASICs do!! • Example: int i,a,b,c; // 32 bits short int d,e,f; // 16 bitspacked for (i = 1; i<= 1024; ++i) { a[i] = b[i] + c[i]; d[i] = e[i] + f[i]; } DP ILP Subword

  9. Input Data Kernel Stream Output Data Interference Cancellation receivedsignal Matched filter Viterbi decoding Decoded bits Correlator channel estimation SWAPs: stream processors for wireless • Kernels (computation) and streams (communication) • Operations on kernels use local data • Streams expose data parallelism • Imagine stream processor at Stanford

  10. + + + + Internal Memory + + + + … ILP + + + + * * * * * * * * * * * * + + DP ILP + * * * DSP vs. SWAPs Stream Register File (SRF) SWAPs (max. clusters All clusters same & do same operations) DSP (1 cluster)

  11. Arithmetic clusters From/To SRF Local Register File • FUs (+,*,/) • Scratch-pad (Sp) • Indexed accesses • Comm. unit (CU) • Intercluster comm. • Distributed reg. Files • more FUs + + + + + + * * + + * * SRF / Cross Point / / / Sp Intercluster Network CU

  12. SWAPs vs. DSPs trade-offs • Same internal memory size as DSPs • Dependent on application, not architecture • Needs more area to support more functional units • Area is less of a constraint than power • Varying levels of DP in applications • Needs reconfiguration!! • Need to turn off unused clusters (and FUs) • More parallelism  lower clock frequency  lower voltage low power (CV2f + leakage) in spite of larger area

  13. Design methodology Chain of receiver algorithms Low “complexity”, parallel, fixed point Flexibility- performance tradeoffs High level language implementation Architecture exploration FPGA, customized, reconfigurable, heterogeneous designs ASIC design learn learn Modular programmable architecture design DSP, SWAPs H-SWAPs

  14. Baseband processing Antenna Detection Decoding Higher (MAC/Network/OS) Layers RF Front-end Channel estimation Physical layer of wireless receivers Receiver more complex than transmitter

  15. Algorithms for • Multiple antenna systems (MIMO systems) • Complexity exponential with transmit * receive antennas • Wide range of extremely complex algorithms • Optimal depends on fading, mobility, bandwidth, antennas • GOPs of computations • Estimation: Linear MMSE, blind, conjugate gradient…. • Detection: FFT, (blind) interference cancellation…. • Decoding: Viterbi, Turbo, LDPC…. • Implement ALL of them AND the NEXT one in line • Use for the best for the situation Example for concept demonstration: Viterbi decoding

  16. Parallel Viterbi Decoding • 1. Add-Compare-Select (ACS) : trellis interconnect • Parallelism depends on constraint length (#states) • 2. Conventional Traceback • Sequential (No DP) • Difficult to implement in parallel architecture • Use Register Exchange (RE) • parallel solution

  17. b. Shuffled Trellis a. Trellis X(0) X(0) X(0) X(0) X(1) X(1) X(1) X(2) X(2) X(2) X(2) X(4) X(3) X(3) X(6) X(3) X(8) X(4) X(4) X(4) X(10) X(5) X(5) X(5) X(12) X(6) X(6) X(6) X(7) X(14) X(7) X(7) X(8) X(8) X(8) X(1) X(9) X(9) X(9) X(3) X(10) X(5) X(10) X(10) X(11) X(11) X(7) X(11) X(12) X(9) X(12) X(12) X(13) X(13) X(11) X(13) X(13) X(14) X(14) X(14) X(15) X(15) X(15) X(15) Re-ordering for parallel Viterbi Exploiting Viterbi DP in SWAPs: • Re-order ACS, RE • Overhead

  18. SWAP: Algorithms + Architecture Algorithm design for parallelism Architecture design?

  19. + + + + … ? ? ? ? ILP * * * * * * * * DP SWAP design • Decide how many clusters • Exploit DP • Decide what to put within each cluster • Maximize ILP with high functional unit efficiency • Search design space with “explore” tool • See how it meets time-area-power constraints

  20. (80,34) (85,24) (85,17) 160 (85,13) 140 (85,11) (70,59) 120 (73,41) 100 (62,62) Instruction count (76,33) 80 (72,22) (65,45) (54,59) (43,58) (72,19) (47,43) (61,33) 60 (39,41) (60,26) (49,33) 40 (61,22) (40,32) (48,26) 1 1 (39,27) (50,22) 2 2 (39,22) 3 3 #Multipliers #Adders 4 4 5 5 Inside a SWAP cluster: EXPLORE Auto-exploration of adders and multipliers for “ACS" (Adder FU%, Multiplier FU%)

  21. “Explore” tool benefits • Instruction count vs. functional unit efficiency • What goes inside each cluster • Explore all algorithms • turn off functional units not in use for given kernel • Design customized application-specific units • Better performance with increased FU utilization Algorithm 1 : 3 adders, 3 multipliers, 32 clusters Algorithm 2 : 4 adders, 1 multiplier, 64 clusters Architecture: 4 adders, 3 multipliers, 64 clusters

  22. Viterbi reconfiguration DP Can be turned OFF Packet 1 Constraint length 7 (16 clusters) Packet 2 Constraint length 9 (64 clusters) Packet 3 Constraint length 5 (4 clusters)

  23. Viterbi decoding: rate 1/2 at 128 Kbps = 10 MHz 1000 K = 9 K = 7 Static architecture DSP K = 5 SWAPs 100 Frequency needed to attain real-time (in MHz) 10 1 1 10 100 Number of clusters Ideal C64x (w/o co-proc) needs ~200 MHz for real-time

  24. SWAPs : Salient features • 1-2 orders of magnitude better than 1 processor DSP • Any constraint length  10 MHz at 128 Kbps • Same code for all constraint lengths • no need to re-compile or load another code • as long as parallelism/cluster ratio is constant • Power savings due to dynamic cluster scaling

  25. Viterbi Clusters used Peak Power K = 9 64 ~90 mW K = 7 16 ~28.57 mW K = 5 4 ~13.8 mW overhead 0 ~8.1 mW 90 80 70 60 50 Power (in mW) 40 30 20 10 0 0 10 20 30 40 50 60 70 Active Clusters (max 64) Expected SWAP power consumption • 64 clusters and 1 multiplier per cluster: • 0.13 micron, 1.2 V • Peak Active Power: ~9 mW at 1 MHz • Area: ~53.7 mm2 • 10 MHz, 128 Kbps with reconfiguration *Exploring the VLSI Scalability of Stream Processors, Brucek Khailany et al, Proceedings of the Ninth Symposium on High Performance Computer Architecture, February 8-12, 2003, Anaheim, California, USA, pp. 153-164

  26. Flexibility vs. performance • Suitable for mobile devices? • SWAPs: Real-time at ~10-100 mW • Maybe ; but can we do better? • ASICs : Real-time at ~10-100 W • No special customization for the application • No application-specific units • Generic inter-cluster communication network • Overhead for extracting parallelism • SWAPs suitable for base-stations? • Why not? – power is not a primary constraint!

  27. 100000 FAST MEDIUM DSP SLOW 10000 32-user base-station 1000 Frequency needed to attain real-time (in MHz) 100 Mobile 10 100 1 10 Number of clusters Multiuser Estimation-Detection+Decoding Real-time target : 128 Kbps per user Ideal C64x (w/o co-proc) needs ~15 GHz for real-time

  28. Current research • SWAPs : Completely flexible and general • How do we trade-off flexibility for better performance? • Handset SWAPs (H-SWAPs)

  29. DSP (RE) Partial DP + Task Pipelining Application-specific units DP SWAP H-SWAP Task Pipelining Dedicated interconnect Dedicated interconnect ASIC/FPGA – Real-time performance ASIC/FPGA – Real-time performance H-SWAPs: Potential advantages DSP (RE) Execution time SWAPs H-SWAPs

  30. Conclusions • Need flexible architectures for future wireless devices • Higher data rates, lower power, more complex algorithms • Design methodology (SWAPs, H-SWAPs, ASICs) • Flexibility vs. performance trade-offs • Blurs distinction between ASICs and programmable solutions • Also need parallel, low precision algorithms for efficient mapping • Inter-disciplinary research: • Computer architecture, VLSI, wireless communications, computer arithmetic, compilers

More Related