1 / 20

Efficient Performance Scaling of Future CGRAs for Mobile Applications

Efficient Performance Scaling of Future CGRAs for Mobile Applications. Yongjun Park , Jason Jong Kyu Park , and Scott Mahlke. December 11, 2012 University of Michigan, Ann Arbor. Convergence of Functionalities. Flexible Accelerator!. 4G Wireless. Audio Video 3D. Navigation.

tess
Download Presentation

Efficient Performance Scaling of Future CGRAs for Mobile Applications

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Efficient Performance Scaling of Future CGRAs for Mobile Applications Yongjun Park, Jason Jong Kyu Park , and Scott Mahlke December 11, 2012 • University of Michigan, Ann Arbor 1

  2. Convergence of Functionalities Flexible Accelerator! 4G Wireless Audio Video 3D Navigation Anatomy of an iPhone4 Convergence of functionalities demands a flexible solution due to the design cost and programmability 2

  3. CGRA : Attractive Alternative to ASICs • Array of PEs connected in a mesh-like interconnect • High throughput with a large number of resources • Distributed hardware offers low cost/power consumption • High flexibility with dynamic reconfiguration 3

  4. Bridging the Gap Between Market Demandand Computation Power How to scale performance with retaining energy efficiency? [Canali, Internet Computing Magazine, IEEE, 2009] 4

  5. Agenda:Scaling the Energy Efficiency of CGRAs • Investigate the key factors and their feasibility in the view of performance and power efficiency • Hardware scalability vs. hardware flexibility • Interconnection topology • Complex PE vs. simple PE • Vector memory operation support • Homogeneity vs. Heterogeneity 5

  6. Experimental Setup • Target applications • Media benchmark: AAC decoder, H.264 decoder, and 3D rendering • Game physics benchmarks: line of sight, convolution, and conjugate • Target architecture: various types of CGRAs • 16 ~ 64 heterogeneous/homogeneous resources • IMPACT frontend compiler + Edge-centric modulo scheduler • Power measurement • IBM 65nm technology @ 200MHz/1V 6

  7. Q1: Interconnection Topology • Overview • Routing overhead limits the performance when increasing the size of the CGRA • Common solution: clustering • What is the optimal interconnection topology? • Methodology • Compare the performance of three different clustering schemes. • Baseline • Fixed partition: CGRAs are physically split into multiple partitions • Flexible partition: number of partitions can be dynamically changed from 1 to 8 • Total number of PEs: 4 to 128 7

  8. Q1: Interconnection Topology Application No-DLP loops Baseline DLP loops Fixed partition Flexible mapping 8

  9. Performance Comparison (Base, Fixed, Flex) • Fixed partitioning doesn’t always show better performance. • Flexible architectures show the best performance and retain scalability 9

  10. Q2: ComplexPEs vs. Simple PEs • Overview • CGRAs with complex PEs are introduced • Two level interconnect • Number of RFs can decrease • Multiple instructions can be chained • Challenge: resource utilization • Goal: determine the availability of complex PEs in the view of energy consumption • Methodology • Compare the energy consumption on different PE styles • Number of FUs inside a PE: 1 ~ 6 • Uniform vs. Optimized 10

  11. PE Designs 11

  12. Energy Consumption • Energy consumption does not increase dramatically as number of PEs • In 1.5x energy budget, complex PEs with 2~3 FUs can also be proper solutions 1.5x energy 12

  13. Q3: SIMD Memory Support • Overview • SIMD memory support provides less power and less number of instructions • Challenge: degree of DLP. • Goal: determine the availability of SIMD memory access in the view of energy consumption • Methodology • Compare the energy consumption on different SIMD widths: 1 ~ 16 13

  14. Relative Energy Consumption • Total energy consumption at wider vector width can be a similar level to a scalar memory unit • High degree of spatial locality can compensate for power overheads 14

  15. Conclusion Beginning • Flexible partitioning should be supported for further improving the performance. • Complex PE can be more energy efficient even in low resource utilizations. • The wide SIMD memory support can be realistic due to the mobile application characteristics. 15

  16. Questions? • For more information • http://cccp.eecs.umich.edu 16

  17. Q1: Homogeneity vs. Heterogeneity • Overview • Heterogeneous CGRAs are common • No experiments on the effect of heterogeneity over homogeneity • Methodology • Start from 16-PE homogeneous CGRA (integer ALU, complex ALU, memory unit) • Decrease the number of PEs supporting complex ALU and memory unit • Performance goal: 80% of performance @ homogeneous CGRA How about performance? 17

  18. Performance Degradation Media Game • The amounts of performance degradation are not substantial • The performance is normally constrained not by the complex instructions • Performance degradation depends much more on memory operations • For 80% of the baseline performance, we can decrease the number of both complex and memory units by up to 75%. 18

  19. Conclusion Beginning • Heterogeneous FU organization is highly effective. • Flexible partitioning should be supported for further improving the performance. • Complex PE can be more energy efficient even in low resource utilizations. • The wide SIMD memory support can be realistic due to the mobile application characteristics. 19

  20. CGRA : Attractive Alternative to ASICs • Suitable for running multimedia applications for future embedded systems • High throughput, low power consumption, high flexibility Morphosys SiliconHive ADRES viterbi at 80Mbps h.264 at 30fps 50-60 MOps /mW • Morphosys : 8x8 array with RISC processor • SiliconHive : hierarchical systolic array • ADRES : 4x4 array with tightly coupled VLIW 20

More Related