440 likes | 561 Views
Data-Parallel Digital Signal Processors: Algorithm mapping, Architecture scaling, and Workload adaptation. Sridhar Rajagopal. Digital Signal Processors (DSPs). Audio, automobile, broadband, military, networking, security, video and imaging, wireless communications
E N D
Data-Parallel Digital Signal Processors:Algorithm mapping, Architecture scaling,and Workload adaptation Sridhar Rajagopal
Digital Signal Processors (DSPs) Audio, automobile, broadband, military, networking, security, video and imaging, wireless communications A 5 billion $ (and growing) market today
We always want something faster! New high performance applications drive need for faster DSPs • Physical-layer signal processing in high speed wireless communications to support multimedia • Application-layer signal processing for video and imaging
Example : wireless systems 32-user system 2G 3G 4G Data rates Algorithms Estimation Detection Decoding Theoretical min ALUs @ 1 GHz 16 Kbps /user Single-user Correlator Matched filter Viterbi > 2 128 Kbps/user Multi-user Max. likelihood Interference cancellation Viterbi > 20 1 Mbps/user MIMO Chip equalizer Matched filter LDPC > 200 Time 1996 2003 ?
Data-Parallel DSPs: state-of-the-art Internal memory Cluster of ALUs + + + + + + + + … + + + + * * * * * * * * * * * * Clusters of ALUs provide billions of computations per second Exploit data parallelism in signal processing applications Imagine stream processor – Stanford (1998 - 2004)
Proposal:Research questions for DP-DSPs • Will DP-DSPs work well for wireless systems? • How do I design DP-DSPs to meet real-time at lowest power? • Can I improve power efficiency further by adapting DSPs to the application?
Contributions: Algorithm mapping • Efficient mapping of (wireless) algorithms • parallelization, structure, memory access patterns • tradeoffs between ALU utilization, inter-cluster communication, memory stalls, packing • A reduced inter-cluster network proposed • exploits inter-cluster communication patterns • allows greater scalability of the architecture by reducing wires
Contributions: Architecture scaling • Design methodology and tool to explore architectures for low power • Provides candidate architectures for low power • Provides insights into ALU utilization and performance • Compile-time exploration is orders-of-magnitude faster than run-time exploration
Contributions: Workload adaptation • Adapt the number of clusters and ALUs to changes in workload during run-time • Multiplexer network designed • adapts clusters to DP at run-time • turns off unused clusters using power gating • Significant power savings at run-time (up to 60%)
Thesis contributions + + + + + + + + + * * * * * * * * * Data-Parallel DSPs Algorithm mapping: Design of algorithms for efficient mapping and performance Architecture scaling: Having designed the algorithms, find a low power processor Workload adaptation: Having designed the processor, improve power at run-time
Outline • DP-DSPs : Parallelism and architecture • Power-aware design exploration • Power-aware resource utilization at run-time • Conclusions
Parallelism levels in DP-DSPs Instruction Level Parallelism (ILP) - DSP Subword Parallelism (SubP) - DSP Data Parallelism (DP) – vector processor Not independent • DP can decrease by increasing ILP and SubP – loop unrolling
Code snippet for ILP, SubP, DP int i,a[N],b[N],sum[N]; short int c[N],d[N],diff[N]; for (i = 0; i< 64; ++i) { sum[i] = a[i] + b[i]; diff[i] = c[i] - d[i]; } DP ILP SubP
Data-Parallel DSPs Internal memory + + + + + + + + … ILP SubP + + + + * * * * micro controller * * * * * * * * DP • ILP, SubP within cluster, DP across clusters • Communication within clusters using inter-cluster comm. network • Microcontroller issues same instruction to all clusters
ILP is resource-bound Inter-cluster communication Adders Multipliers Time Schedule for matrix-matrix multiplication as ALUs increase • ILP dependent on resources such as ALUs, read/write ports, inter-cluster communication, registers • Any one resource bottleneck can affect ILP
Signal processing algorithms have DP in plenty Observations: • More DP available after exploiting ILP and SubP to the point of diminishing returns • Used to set number of clusters • As clusters are added and exploit this ‘extra’ DP, ILP and SubP are not affected significantly This ‘extra’ DP is defined as Cluster DP (CDP)
Observing CDP in Viterbi decoding 1000 K = 9 K = 7 DSP K = 5 100 Frequency needed to attain real-time (in MHz) 10 Max CDP 1 1 10 100 Number of clusters
Designing low power DP-DSPs + ‘10’ ‘10’ + + + + + + + + + + ‘1’ ‘10’ ‘a’ ‘a’ + + ‘a’ + + + + ‘10’ * + + ‘10’ + + + + + + + + + * * * * * * ‘1’ * ‘10’ * ‘m’ * ‘m’ ‘m’ * * * * * * * * * * * * * * * * * * ‘100’ clusters ‘1’ cluster ‘c’ clusters 10 MHz 100 GHz ‘f’ MHz Find the right (a,m,c,f) to minimize power a – #adders/cluster, m – #multipliers/cluster, c – #clusters
Detailed simulation using the Imagine processor simulator • Cycle accurate, parameterized simulator • Insights into operations every cycle • High-level C++-based programming • GUI interface shows dependencies and schedule • Power and VLSI scaling model available • Open source allows modifications in architecture, tools
Need for design exploration tool • Random choice may be way off • 100x power variation possible • Exhaustive simulation not possible • large parameter space (hours for each simulation) • DSP compilers need hand optimizations for performance • evolving algorithms -- architecture exploration needed
Design exploration framework + + + + + + + + + * * * * * * * * * Design Base Data-Parallel phase DSP Explore (a,m,c,f) Design combination that workload minimizes power (worst-case) Hardware implementation Dynamic adaptation Utilization Application to turn down (a,m,c,f) phase workload to save power
DSPs are compute-bound with predictable performance Microcontroller stalls t stall Exposed memory stalls Hidden memorystalls Total execution time (cycles) t Computations compute
Minimization for power C(a,m,c) – capacitance from simulator model f(a,m,c) – real-time clock frequency – obtained by running application on (a,m,c) architecture
Sensitivity to technology and modeling • Sensitivity to technology ‘p’ • Sensitivity to adder-multiplier power ratio ‘’ • 0.01 0.1 for 32-bit adders and 32x32 multipliers • Sensitivity to memory stalls ‘’ • difficult to predict at compile time (5-20 %) • assume q = 25% of execution time as worst case • fstall= q* (1-) * fmin 0 1
Design exploration: big picture • (a,m,c) = (, , ) • Find (a,m,c) where ILP, SubP, DP are fully exploited • Find c that minimizes P for (max(a), max(m)) • Find (a,m) that minimizes P using c • Explore sensitivity to , , p
Real-time frequency with clusters for (a,m) = (5,3) 4 10 b = 0 b = 0.5 b = 1 Frequency (MHz) 3 10 538 MHz 541 MHz 2 10 0 1 2 3 10 10 10 10 Clusters
Choosing clusters c = 64, 541 MHz 0 10 -1 10 Normalized Power -2 10 2 µ Power f 2.5 µ Power f 3 µ Power f -3 10 0 1 2 3 10 10 10 10 Clusters
ALU utilization (+,*) c = 64, = 0.01, = 1, p = 3
Insights from analysis • Sensitivity importance: p, , • Design gives candidates for low power solutions Design I : (a,m,c): (, , ) (5,3,512) (5,3,64) (2,1,64) Design II : (a,m,c): (, , ) (5,3,512) (5,3,64) (3,1,64) • Power minimization related to ALU efficiency • same as maximizing a scaled version of ALU utilization
Advantages of design exploration tool • Simulator (S) • cycle-accurate (execution time at run-time) • explore 100 machine configurations in 100 hours (conservative) • modification of parameters and code for different runs • Tool (T) • cycle-approximate (execution time at compile time) • explore millions of configurations in 100 hours • automated process all the way • generate plots for defense the day before • Rapid evaluation of candidate algorithms for future systems
Verification of design tool 1000 Stalls 800 Computations 600 (Execution time) Real-time clock frequency (MHz) 400 200 T S T S T S Design I Design II Human T- Tool S - Simulator Human (3,3,32) @ 1.2V, 0.13 , 1 GHz = 18.2 W Exploration tool choice : (2,1,64) at 887 MHz Estimated base power @ 1.2V, 0.13 = 13.2 W
Cluster utilization 100 80 60 Cluster utilization (%) 40 32 clusters 20 64 clusters 20 40 60 Cluster index number • 64 cluster inefficient in terms of cluster utilization (54% for 33:64) • But, still lower power than 32 clusters due to the difference in f • can see difference reduces as p 2
Improving power efficiency • Clusters significant source of power consumption (50-75%) • When CDP < c, unutilized clusters waste power • Dynamically turn off clusters using power gating to improve power efficiency
Data access difficult after adaptation 4 2 clusters + + + + + + + + + + + + * * * * * * * * * * * * Clusters off – then how to get data from other banks? 4 2 clusters • Data not in the correct memory banks • Overhead in bringing data : external memory, inter-cluster network
Multiplexer network design No reconfiguration 4: 2 reconfiguration 4:1 reconfiguration All clusters off Turned off using power gating to eliminate static and dynamic power dissipation Multiplexer network adapts clusters to DP
Run-time variations in workload 100 K = 9 K = 7 K = 5 80 60 Cluster utilization (%) 40 20 20 40 60 Cluster index number
Benefits of multiplexer network Power efficiency at design time: Human choice : (3,3,32) Base power @ 1.2V, 0.13 , 1 GHz = 18.2 W Exploration tool choice : (2,1,64) Base power @ 1.2V, 0.13 , 887 MHz = 13.2 W Power efficiency at run-time: With mux network ( K = 9) = 9.9 W ( K = 7) = 7.4 W (K = 5) = 6.8 W
Design exploration for 2G-3G-4G systems 5 10 4G* (1,1,32) and (2,1,32) 3G 2G 4 10 3 Real-time clock frequency (MHz) 10 2 (2,1,64) and (3,1,64) 10 1 10 1 2 3 10 10 10 Data rates A “power”ful tool for algorithm-architecture exploration
Broader impact • Power-aware design exploration with improved run-time power efficiency • Techniques can be applied to all high performance, power efficient DSP designs • Handsets, cameras, video
Future extensions • Fabrication needed to verify concepts • Higher performance • Multi-threading (ILP, SubP, DP, MT) • Pipelining (ILP, SubP, DP, MT, PP) • LDPC decoding • Sparse matrix requires permutations over large data • Indexed SRF in stream processors [Jayasena, HPCA 2004]
Conclusions • Providing high performance with 100-1000’s of ALUs and providing low power designs • a challenge for DSP designers • Algorithm design for efficient mapping on DP-DSPs • Design exploration tool for low power DP-DSPs • Provides candidate DSPs for low power • Allows algorithm-architecture evaluation for new systems • Power efficiency provided during both design and use of DP-DSPs
Acknowledgements • Dr. Joseph R. Cavallaro, Dr. Scott Rixner • Imagine stream processor group at Stanford • Abhishek, Ujval, Brucek, Dr. Dally • Marjan, Predrag, Alex • 4G MIMO + LDPC • Thesis committee • Nokia, Texas Instruments, TATP, NSF