Efficient performance scaling of future cgras for mobile applications
Sponsored Links
This presentation is the property of its rightful owner.
1 / 20

Efficient Performance Scaling of Future CGRAs for Mobile Applications PowerPoint PPT Presentation


  • 113 Views
  • Uploaded on
  • Presentation posted in: General

Efficient Performance Scaling of Future CGRAs for Mobile Applications. Yongjun Park , Jason Jong Kyu Park , and Scott Mahlke. December 11, 2012 University of Michigan, Ann Arbor. Convergence of Functionalities. Flexible Accelerator!. 4G Wireless. Audio Video 3D. Navigation.

Download Presentation

Efficient Performance Scaling of Future CGRAs for Mobile Applications

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Efficient Performance Scaling of Future CGRAs for Mobile Applications

Yongjun Park, Jason Jong Kyu Park , and Scott Mahlke

December 11, 2012

  • University of Michigan, Ann Arbor

1


Convergence of Functionalities

Flexible

Accelerator!

4G Wireless

Audio

Video

3D

Navigation

Anatomy of an iPhone4

Convergence of functionalities demands a flexible solution due to the design cost and programmability

2


CGRA : Attractive Alternative to ASICs

  • Array of PEs connected in a mesh-like interconnect

  • High throughput with a large number of resources

  • Distributed hardware offers low cost/power consumption

  • High flexibility with dynamic reconfiguration

3


Bridging the Gap Between Market Demandand Computation Power

How to scale performance with retaining energy efficiency?

[Canali, Internet Computing Magazine, IEEE, 2009]

4


Agenda:Scaling the Energy Efficiency of CGRAs

  • Investigate the key factors and their feasibility in the view of performance and power efficiency

    • Hardware scalability vs. hardware flexibility

  • Interconnection topology

  • Complex PE vs. simple PE

  • Vector memory operation support

  • Homogeneity vs. Heterogeneity

5


Experimental Setup

  • Target applications

    • Media benchmark: AAC decoder, H.264 decoder, and 3D rendering

    • Game physics benchmarks: line of sight, convolution, and conjugate

  • Target architecture: various types of CGRAs

    • 16 ~ 64 heterogeneous/homogeneous resources

  • IMPACT frontend compiler + Edge-centric modulo scheduler

  • Power measurement

    • IBM 65nm technology @ 200MHz/1V

6


Q1: Interconnection Topology

  • Overview

    • Routing overhead limits the performance when increasing the size of the CGRA

    • Common solution: clustering

    • What is the optimal interconnection topology?

  • Methodology

    • Compare the performance of three different clustering schemes.

      • Baseline

      • Fixed partition: CGRAs are physically split into multiple partitions

      • Flexible partition: number of partitions can be dynamically changed from 1 to 8

    • Total number of PEs: 4 to 128

7


Q1: Interconnection Topology

Application

No-DLP loops

Baseline

DLP loops

Fixed partition

Flexible mapping

8


Performance Comparison (Base, Fixed, Flex)

  • Fixed partitioning doesn’t always show better performance.

  • Flexible architectures show the best performance and retain scalability

9


Q2: ComplexPEs vs. Simple PEs

  • Overview

    • CGRAs with complex PEs are introduced

      • Two level interconnect

      • Number of RFs can decrease

      • Multiple instructions can be chained

    • Challenge: resource utilization

    • Goal: determine the availability of complex PEs in the view of energy consumption

  • Methodology

    • Compare the energy consumption on different PE styles

      • Number of FUs inside a PE: 1 ~ 6

      • Uniform vs. Optimized

10


PE Designs

11


Energy Consumption

  • Energy consumption does not increase dramatically as number of PEs

  • In 1.5x energy budget, complex PEs with 2~3 FUs can also be proper solutions

1.5x energy

12


Q3: SIMD Memory Support

  • Overview

    • SIMD memory support provides less power and less number of instructions

    • Challenge: degree of DLP.

    • Goal: determine the availability of SIMD memory access in the view of energy consumption

  • Methodology

    • Compare the energy consumption on different SIMD widths: 1 ~ 16

13


Relative Energy Consumption

  • Total energy consumption at wider vector width can be a similar level to a scalar memory unit

    • High degree of spatial locality can compensate for power overheads

14


Conclusion

Beginning

  • Flexible partitioning should be supported for further improving the performance.

  • Complex PE can be more energy efficient even in low resource utilizations.

  • The wide SIMD memory support can be realistic due to the mobile application characteristics.

15


Questions?

  • For more information

    • http://cccp.eecs.umich.edu

16


Q1: Homogeneity vs. Heterogeneity

  • Overview

    • Heterogeneous CGRAs are common

    • No experiments on the effect of heterogeneity over homogeneity

  • Methodology

    • Start from 16-PE homogeneous CGRA (integer ALU, complex ALU, memory unit)

    • Decrease the number of PEs supporting complex ALU and memory unit

    • Performance goal: 80% of performance @ homogeneous CGRA

How about performance?

17


Performance Degradation

Media

Game

  • The amounts of performance degradation are not substantial

    • The performance is normally constrained not by the complex instructions

  • Performance degradation depends much more on memory operations

  • For 80% of the baseline performance, we can decrease the number of both complex and memory units by up to 75%.

18


Conclusion

Beginning

  • Heterogeneous FU organization is highly effective.

  • Flexible partitioning should be supported for further improving the performance.

  • Complex PE can be more energy efficient even in low resource utilizations.

  • The wide SIMD memory support can be realistic due to the mobile application characteristics.

19


CGRA : Attractive Alternative to ASICs

  • Suitable for running multimedia applications for future embedded systems

    • High throughput, low power consumption, high flexibility

Morphosys SiliconHive ADRES

viterbi at 80Mbps

h.264 at 30fps

50-60 MOps /mW

  • Morphosys : 8x8 array with RISC processor

  • SiliconHive : hierarchical systolic array

  • ADRES : 4x4 array with tightly coupled VLIW

20


  • Login