Efficient performance scaling of future cgras for mobile applications
This presentation is the property of its rightful owner.
Sponsored Links
1 / 20

Efficient Performance Scaling of Future CGRAs for Mobile Applications PowerPoint PPT Presentation


  • 102 Views
  • Uploaded on
  • Presentation posted in: General

Efficient Performance Scaling of Future CGRAs for Mobile Applications. Yongjun Park , Jason Jong Kyu Park , and Scott Mahlke. December 11, 2012 University of Michigan, Ann Arbor. Convergence of Functionalities. Flexible Accelerator!. 4G Wireless. Audio Video 3D. Navigation.

Download Presentation

Efficient Performance Scaling of Future CGRAs for Mobile Applications

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Efficient performance scaling of future cgras for mobile applications

Efficient Performance Scaling of Future CGRAs for Mobile Applications

Yongjun Park, Jason Jong Kyu Park , and Scott Mahlke

December 11, 2012

  • University of Michigan, Ann Arbor

1


Convergence of functionalities

Convergence of Functionalities

Flexible

Accelerator!

4G Wireless

Audio

Video

3D

Navigation

Anatomy of an iPhone4

Convergence of functionalities demands a flexible solution due to the design cost and programmability

2


Cgra attractive alternative to asics

CGRA : Attractive Alternative to ASICs

  • Array of PEs connected in a mesh-like interconnect

  • High throughput with a large number of resources

  • Distributed hardware offers low cost/power consumption

  • High flexibility with dynamic reconfiguration

3


Bridging the gap between market demand and computation p ower

Bridging the Gap Between Market Demandand Computation Power

How to scale performance with retaining energy efficiency?

[Canali, Internet Computing Magazine, IEEE, 2009]

4


Agenda scaling the energy efficiency of cgras

Agenda:Scaling the Energy Efficiency of CGRAs

  • Investigate the key factors and their feasibility in the view of performance and power efficiency

    • Hardware scalability vs. hardware flexibility

  • Interconnection topology

  • Complex PE vs. simple PE

  • Vector memory operation support

  • Homogeneity vs. Heterogeneity

5


Experimental setup

Experimental Setup

  • Target applications

    • Media benchmark: AAC decoder, H.264 decoder, and 3D rendering

    • Game physics benchmarks: line of sight, convolution, and conjugate

  • Target architecture: various types of CGRAs

    • 16 ~ 64 heterogeneous/homogeneous resources

  • IMPACT frontend compiler + Edge-centric modulo scheduler

  • Power measurement

    • IBM 65nm technology @ 200MHz/1V

6


Q1 interconnection t opology

Q1: Interconnection Topology

  • Overview

    • Routing overhead limits the performance when increasing the size of the CGRA

    • Common solution: clustering

    • What is the optimal interconnection topology?

  • Methodology

    • Compare the performance of three different clustering schemes.

      • Baseline

      • Fixed partition: CGRAs are physically split into multiple partitions

      • Flexible partition: number of partitions can be dynamically changed from 1 to 8

    • Total number of PEs: 4 to 128

7


Q1 interconnection t opology1

Q1: Interconnection Topology

Application

No-DLP loops

Baseline

DLP loops

Fixed partition

Flexible mapping

8


Performance comparison base fixed flex

Performance Comparison (Base, Fixed, Flex)

  • Fixed partitioning doesn’t always show better performance.

  • Flexible architectures show the best performance and retain scalability

9


Q2 complex pes vs simple pes

Q2: ComplexPEs vs. Simple PEs

  • Overview

    • CGRAs with complex PEs are introduced

      • Two level interconnect

      • Number of RFs can decrease

      • Multiple instructions can be chained

    • Challenge: resource utilization

    • Goal: determine the availability of complex PEs in the view of energy consumption

  • Methodology

    • Compare the energy consumption on different PE styles

      • Number of FUs inside a PE: 1 ~ 6

      • Uniform vs. Optimized

10


Pe designs

PE Designs

11


E nergy consumption

Energy Consumption

  • Energy consumption does not increase dramatically as number of PEs

  • In 1.5x energy budget, complex PEs with 2~3 FUs can also be proper solutions

1.5x energy

12


Q3 simd memory support

Q3: SIMD Memory Support

  • Overview

    • SIMD memory support provides less power and less number of instructions

    • Challenge: degree of DLP.

    • Goal: determine the availability of SIMD memory access in the view of energy consumption

  • Methodology

    • Compare the energy consumption on different SIMD widths: 1 ~ 16

13


Relative energy consumption

Relative Energy Consumption

  • Total energy consumption at wider vector width can be a similar level to a scalar memory unit

    • High degree of spatial locality can compensate for power overheads

14


C onclusion

Conclusion

Beginning

  • Flexible partitioning should be supported for further improving the performance.

  • Complex PE can be more energy efficient even in low resource utilizations.

  • The wide SIMD memory support can be realistic due to the mobile application characteristics.

15


Questions

Questions?

  • For more information

    • http://cccp.eecs.umich.edu

16


Q 1 homogeneity vs heterogeneity

Q1: Homogeneity vs. Heterogeneity

  • Overview

    • Heterogeneous CGRAs are common

    • No experiments on the effect of heterogeneity over homogeneity

  • Methodology

    • Start from 16-PE homogeneous CGRA (integer ALU, complex ALU, memory unit)

    • Decrease the number of PEs supporting complex ALU and memory unit

    • Performance goal: 80% of performance @ homogeneous CGRA

How about performance?

17


Performance degradation

Performance Degradation

Media

Game

  • The amounts of performance degradation are not substantial

    • The performance is normally constrained not by the complex instructions

  • Performance degradation depends much more on memory operations

  • For 80% of the baseline performance, we can decrease the number of both complex and memory units by up to 75%.

18


C onclusion1

Conclusion

Beginning

  • Heterogeneous FU organization is highly effective.

  • Flexible partitioning should be supported for further improving the performance.

  • Complex PE can be more energy efficient even in low resource utilizations.

  • The wide SIMD memory support can be realistic due to the mobile application characteristics.

19


Cgra attractive alternative to asics1

CGRA : Attractive Alternative to ASICs

  • Suitable for running multimedia applications for future embedded systems

    • High throughput, low power consumption, high flexibility

Morphosys SiliconHive ADRES

viterbi at 80Mbps

h.264 at 30fps

50-60 MOps /mW

  • Morphosys : 8x8 array with RISC processor

  • SiliconHive : hierarchical systolic array

  • ADRES : 4x4 array with tightly coupled VLIW

20


  • Login