1 / 30

Handset architectures

ASICs. Programmable. Handset architectures. Sridhar Rajagopal sridhar@rice.edu http://www.ece.rice.edu/~sridhar. The support for this work in part by Nokia, TI and NSF is gratefully acknowledged. ro. 2G handsets. DSP for most of the baseband. ASIC for compute-intensive operations

kayla
Download Presentation

Handset architectures

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. ASICs Programmable Handset architectures Sridhar Rajagopal sridhar@rice.edu http://www.ece.rice.edu/~sridhar The support for this work in part by Nokia, TI and NSF is gratefully acknowledged

  2. ro 2G handsets DSP for most of the baseband ASIC for compute-intensive operations (spreading etc.) microcontroller for higher layers Evolving Cellular Handset Architectures but a Continuing, Insatiable Desire for DSP MIPs M. L. McMahan, TI Report SPRA650, March 2000

  3. DSP for the third generation wireless communicationsU. Ko, M. McMahan and E. Auslander,International Conference onComputer Design,1999pp.516–520 Introduction to W-CDMA SoC design approachH. Chen, VIA Technologies, August 2002 www.itpilot.org.tw/provisional/910802/ INTRODUCTION%20TO%20WCDMA%20SOC%20.PDF Proposed 3G handsets TI VIA Increased number of co-processors as DSPs unable to do most of the baseband

  4. Motivation How does this scale? Do we need a DSP or should we build ASICs? If ASICs, how to build better ASICs? If programmable, how to build better DSPs? If both, how do we mix them better? Answers dependent on • level of programmability needed • area-time-power architecture tradeoffs

  5. ASICs Programmable Rice innovations for ASICs and DSPs ASICs: On-line arithmetic for dynamic truncation Programmable: Scalable Wireless Application-specific Processors (SWAPs) Mix and match : Hybrid SWAPs (H-SWAPs)

  6. Outline • On-line arithmetic for dynamic truncation • SWAPs • H-SWAPs

  7. ASIC designs • Finite precision arithmetic • Faster • Low power • Low area • How to keep finite precision bounded: • Saturation • Truncation

  8. Keeping precision bounded • Example of truncation • Multiplication by  in gradient descent • Sign detection • Example of saturation • Avoiding overflows • When probability of useful MSBs are low

  9. Dynamic precision requirements • Precision needs change with algorithms, SNR • Adapt hardware dynamically to save power • 25-35% power reduction possible • Dynamic saturation vs. dynamic truncation • Easy as LSBs first – difficult • No error – significant error • Throughput benefits – no benefits

  10. On-line arithmetic for dynamic truncation • Works Most Significant Digit First • Natural way of truncation • Digit-serial  dynamic truncation • Redundant number system  error only in LSD • Throughput benefits as digit-serial

  11. 0 0 0 R 0 R 0 0 a * b i i a * b i i R Tree R Tree addition addition Level 1 Level 1 R R R R Tree addition Tree addition Result Result t a d*t OL-MF OL a t log(d) CONV-MF (b) On-line arithmetic with full precision (a) Truncated conventional arithmetic R R R 0 0 R a * b i i a * b i i Idle R R Tree addition (Pipeline R B B R Level 1 Bubbles) Tree addition Level 1 R R R B B R B B R R R B R B B Tree addition B B B Tree addition Result Result t a d *t OL-MF eff OL t = constant = 3* t OL OL-MF Sign determined at this point Sign determined at this point. Stop! (d) Dynamically truncated on-line arithmetic (c) Dynamically truncated on-line arithmetic (2 MSDs ) (without truncation error) Example for sign detection

  12. Throughput comparisons

  13. Area comparisons

  14. ASIC design conclusion Details : Predrag Using on-line arithmetic for dynamic truncation and conventional arithmetic for dynamic saturation, one can design efficient ASICs for handsets.

  15. Outline • On-line arithmetic for dynamic truncation • SWAPs • H-SWAPs

  16. Programmable architectures • Current DSPs • Not enough functional units (FUs) • Cannot extend to more FUs • Limited Instruction Level Parallelism (ILP) • Cannot support more registers (register area increases quadratically with FUs) • Compilers: difficult to find ILP as FUs increase

  17. Solution • Exploit data parallelism (DP) • Lots available in wireless algorithms • Example: for (i = 1: 1024) { a[i] = b[i] + c[i]; d[i] = b[i] * c[i]; } DP ILP

  18. Internal Memory Internal Memory ILP + + + + + + + + + + + + + + + + + + + + … ILP + + + + + + + + + + * * * * * * * * * * * * * * * * * * * * * * * DP * * * * * * * DSP vs. SWAPs DSP (1 cluster) SWAPs (max. clusters)

  19. SWAPs trade-offs • Same internal memory size as DSPs • Dependent on application, not architecture • Needs more area to support more functional units • Area is not a constraint (power is) • Varying levels of DP in applications • Needs reconfiguration!! • Need to turn off unused clusters • More parallelism  lower clock frequency  lower voltage low power (CV2f + leakage) in spite of larger area

  20. Example: Viterbi Decoding • Add-Compare-Select (ACS) : trellis interconnect • Re-order for exploiting DP • Traceback – sequential • Use Register Exchange (RE) Exploiting DP in programmable architecture implies: • Re-order ACS • Re-order RE

  21. a. Trellis b. Shuffled Trellis X(0) X(0) X(0) X(0) X(1) X(2) X(1) X(1) X(2) X(4) X(2) X(2) X(6) X(3) X(3) X(3) X(4) X(4) X(8) X(4) X(5) X(5) X(5) X(10) X(6) X(6) X(6) X(12) X(7) X(7) X(14) X(7) X(8) X(1) X(8) X(8) X(9) X(3) X(9) X(9) X(10) X(10) X(5) X(10) X(7) X(11) X(11) X(11) X(9) X(12) X(12) X(12) X(13) X(13) X(13) X(11) X(14) X(14) X(13) X(14) X(15) X(15) X(15) X(15) Re-ordering for parallel Viterbi

  22. Viterbi reconfiguration DP Can be turned OFF Packet 1 Constraint length 7 (16 clusters) Packet 2 Constraint length 9 (64 clusters) Packet 3 Constraint length 5 (4 clusters)

  23. 64-bit Packet 1 Rate ½ Constraint Length 7 Memory accesses 64-bit Packet 2 Rate ½ Constraint Length 9 Kernels (Computation) 64-bit Packet 3 Rate ½ Constraint Length 5

  24. 3 10 Actual K = 9 Actual K = 7 Actual K = 5 Regular code Reconfigurable code 2 10 Frequency needed to attain real-time (in MHz) 1 10 0 10 0 1 2 10 10 10 Number of clusters Viterbi decoding: rate 1/2 at 128 Kbps = 10 MHz

  25. Actual K = 9 Actual K = 7 Actual K = 5 2 1 0 0 1 2 10 10 10 Virtex II FPGA* Viterbi decoding: Comparisons 3 10 DSP (RE) DSP C64x (w/o co-proc) 10 DP SWAP 10 Task Pipelining Dedicated interconnect 10 128 KHz (1 bit /cycle) FPGA *VITURBO: A reconfigurable architecture for Viterbi and Turbo decoding, M. Vaya, J. R. Cavallaro, IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), April 2003, Hong Kong 

  26. Salient features of this solution • Any constraint length  10 MHz at 128 Kbps • Same code for all constraint lengths • no need to re-compile or load another code • as long as parallelism/cluster ratio is constant • Exploiting parallelism at 3 levels for real-time: • Instruction Level Parallelism (DSP) • Subword Parallelism (DSP) • Data Parallelism (SWAP)

  27. Problems • Suitable for handsets? - Not yet! • Still too general • Not low power enough!!! • No special customization for the application • Except for a fixed-point architecture • Generic instruction set • Generic ALUs (though can be powered down) • Generic inter-cluster communication network

  28. Outline • On-line arithmetic for dynamic truncation • SWAPs • Hybrid SWAPs (H-SWAPs)

  29. Internal Memory + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + * + + * * + + + * + + + + + + + + + + + + … * * * * * * * * + + + + + + + + + * * * * * * * * * * * * * * * * * * Limited DP Limited DP Limited DP * * * * * * * * * Limited DP DP H-SWAPs (collection of customized mini-SWAPs) Mini-SWAP (limit clusters) H-SWAPs • Trade Data Parallelism for Task Pipelining • Customize each mini-SWAP SWAPs (max. clusters and reconfigure)

  30. Work in progress • How to trade-off task vs. data parallelism? • Power estimation for SWAPs (actual numbers) • Comparisons with ASIC solutions in terms of area-time-power • Evaluation of specialized inter-cluster communication • Specialized instructions (ACS) and arithmetic units (on-line) I am looking for jobs!!!

More Related