Efficient Performance Scaling of Future CGRAs for Mobile Applications

Efficient Performance Scaling of Future CGRAs for Mobile Applications Yongjun Park, Jason Jong Kyu Park , and Scott Mahlke December 11, 2012 • University of Michigan, Ann Arbor 1

Convergence of Functionalities Flexible Accelerator! 4G Wireless Audio Video 3D Navigation Anatomy of an iPhone4 Convergence of functionalities demands a flexible solution due to the design cost and programmability 2

CGRA : Attractive Alternative to ASICs • Array of PEs connected in a mesh-like interconnect • High throughput with a large number of resources • Distributed hardware offers low cost/power consumption • High flexibility with dynamic reconfiguration 3

Bridging the Gap Between Market Demandand Computation Power How to scale performance with retaining energy efficiency? [Canali, Internet Computing Magazine, IEEE, 2009] 4

Agenda:Scaling the Energy Efficiency of CGRAs • Investigate the key factors and their feasibility in the view of performance and power efficiency • Hardware scalability vs. hardware flexibility • Interconnection topology • Complex PE vs. simple PE • Vector memory operation support • Homogeneity vs. Heterogeneity 5

Experimental Setup • Target applications • Media benchmark: AAC decoder, H.264 decoder, and 3D rendering • Game physics benchmarks: line of sight, convolution, and conjugate • Target architecture: various types of CGRAs • 16 ~ 64 heterogeneous/homogeneous resources • IMPACT frontend compiler + Edge-centric modulo scheduler • Power measurement • IBM 65nm technology @ 200MHz/1V 6

Q1: Interconnection Topology • Overview • Routing overhead limits the performance when increasing the size of the CGRA • Common solution: clustering • What is the optimal interconnection topology? • Methodology • Compare the performance of three different clustering schemes. • Baseline • Fixed partition: CGRAs are physically split into multiple partitions • Flexible partition: number of partitions can be dynamically changed from 1 to 8 • Total number of PEs: 4 to 128 7

Q1: Interconnection Topology Application No-DLP loops Baseline DLP loops Fixed partition Flexible mapping 8

Performance Comparison (Base, Fixed, Flex) • Fixed partitioning doesn’t always show better performance. • Flexible architectures show the best performance and retain scalability 9

Q2: ComplexPEs vs. Simple PEs • Overview • CGRAs with complex PEs are introduced • Two level interconnect • Number of RFs can decrease • Multiple instructions can be chained • Challenge: resource utilization • Goal: determine the availability of complex PEs in the view of energy consumption • Methodology • Compare the energy consumption on different PE styles • Number of FUs inside a PE: 1 ~ 6 • Uniform vs. Optimized 10

PE Designs 11

Energy Consumption • Energy consumption does not increase dramatically as number of PEs • In 1.5x energy budget, complex PEs with 2~3 FUs can also be proper solutions 1.5x energy 12

Q3: SIMD Memory Support • Overview • SIMD memory support provides less power and less number of instructions • Challenge: degree of DLP. • Goal: determine the availability of SIMD memory access in the view of energy consumption • Methodology • Compare the energy consumption on different SIMD widths: 1 ~ 16 13

Relative Energy Consumption • Total energy consumption at wider vector width can be a similar level to a scalar memory unit • High degree of spatial locality can compensate for power overheads 14

Conclusion Beginning • Flexible partitioning should be supported for further improving the performance. • Complex PE can be more energy efficient even in low resource utilizations. • The wide SIMD memory support can be realistic due to the mobile application characteristics. 15

Questions? • For more information • http://cccp.eecs.umich.edu 16

Q1: Homogeneity vs. Heterogeneity • Overview • Heterogeneous CGRAs are common • No experiments on the effect of heterogeneity over homogeneity • Methodology • Start from 16-PE homogeneous CGRA (integer ALU, complex ALU, memory unit) • Decrease the number of PEs supporting complex ALU and memory unit • Performance goal: 80% of performance @ homogeneous CGRA How about performance? 17

Performance Degradation Media Game • The amounts of performance degradation are not substantial • The performance is normally constrained not by the complex instructions • Performance degradation depends much more on memory operations • For 80% of the baseline performance, we can decrease the number of both complex and memory units by up to 75%. 18

Conclusion Beginning • Heterogeneous FU organization is highly effective. • Flexible partitioning should be supported for further improving the performance. • Complex PE can be more energy efficient even in low resource utilizations. • The wide SIMD memory support can be realistic due to the mobile application characteristics. 19

CGRA : Attractive Alternative to ASICs • Suitable for running multimedia applications for future embedded systems • High throughput, low power consumption, high flexibility Morphosys SiliconHive ADRES viterbi at 80Mbps h.264 at 30fps 50-60 MOps /mW • Morphosys : 8x8 array with RISC processor • SiliconHive : hierarchical systolic array • ADRES : 4x4 array with tightly coupled VLIW 20

Efficient Performance Scaling of Future CGRAs for Mobile Applications

Efficient Performance Scaling of Future CGRAs for Mobile Applications

Presentation Transcript

mobile applications in future business environment

Testing of Mobile Applications

Performance-Tuning Mobile Flex Applications

Applications of Performance Assessment

Scaling for Aero-Science Applications

Mobile applications for Tanzania

The Scaling Habits of ASP.NET Applications

Creating Resource-Efficient V2oIP Applications for Low-MHz Mobile Processors

EPIMap : Using Epimorphism to Map Applications on CGRAs

Scaling of Pay for Performance Goals

Mobile Applications

Designing Applications for Performance

Scaling for the Future

Scaling and Performance

The Scaling of Machines for Renewable Energy Applications

The Future Of Mobile applications

The Future Of Mobile Applications

Scaling Parallel Applications

Applications of Performance Assessment

Mobile Applications & The Future of Customer Experience

Progressive Web Applications: Future Of Mobile App Development