Efficient performance scaling of future cgras for mobile applications
Download
1 / 20

Efficient Performance Scaling of Future CGRAs for Mobile Applications - PowerPoint PPT Presentation


  • 125 Views
  • Uploaded on
  • Presentation posted in: General

Efficient Performance Scaling of Future CGRAs for Mobile Applications. Yongjun Park , Jason Jong Kyu Park , and Scott Mahlke. December 11, 2012 University of Michigan, Ann Arbor. Convergence of Functionalities. Flexible Accelerator!. 4G Wireless. Audio Video 3D. Navigation.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Efficient Performance Scaling of Future CGRAs for Mobile Applications ' - tess


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Efficient performance scaling of future cgras for mobile applications

Efficient Performance Scaling of Future CGRAs for Mobile Applications

Yongjun Park, Jason Jong Kyu Park , and Scott Mahlke

December 11, 2012

  • University of Michigan, Ann Arbor

1


Convergence of functionalities
Convergence of Functionalities

Flexible

Accelerator!

4G Wireless

Audio

Video

3D

Navigation

Anatomy of an iPhone4

Convergence of functionalities demands a flexible solution due to the design cost and programmability

2


Cgra attractive alternative to asics
CGRA : Attractive Alternative to ASICs

  • Array of PEs connected in a mesh-like interconnect

  • High throughput with a large number of resources

  • Distributed hardware offers low cost/power consumption

  • High flexibility with dynamic reconfiguration

3


Bridging the gap between market demand and computation p ower
Bridging the Gap Between Market Demandand Computation Power

How to scale performance with retaining energy efficiency?

[Canali, Internet Computing Magazine, IEEE, 2009]

4


Agenda scaling the energy efficiency of cgras
Agenda:Scaling the Energy Efficiency of CGRAs

  • Investigate the key factors and their feasibility in the view of performance and power efficiency

    • Hardware scalability vs. hardware flexibility

  • Interconnection topology

  • Complex PE vs. simple PE

  • Vector memory operation support

  • Homogeneity vs. Heterogeneity

5


Experimental setup
Experimental Setup

  • Target applications

    • Media benchmark: AAC decoder, H.264 decoder, and 3D rendering

    • Game physics benchmarks: line of sight, convolution, and conjugate

  • Target architecture: various types of CGRAs

    • 16 ~ 64 heterogeneous/homogeneous resources

  • IMPACT frontend compiler + Edge-centric modulo scheduler

  • Power measurement

    • IBM 65nm technology @ 200MHz/1V

6


Q1 interconnection t opology
Q1: Interconnection Topology

  • Overview

    • Routing overhead limits the performance when increasing the size of the CGRA

    • Common solution: clustering

    • What is the optimal interconnection topology?

  • Methodology

    • Compare the performance of three different clustering schemes.

      • Baseline

      • Fixed partition: CGRAs are physically split into multiple partitions

      • Flexible partition: number of partitions can be dynamically changed from 1 to 8

    • Total number of PEs: 4 to 128

7


Q1 interconnection t opology1
Q1: Interconnection Topology

Application

No-DLP loops

Baseline

DLP loops

Fixed partition

Flexible mapping

8


Performance comparison base fixed flex
Performance Comparison (Base, Fixed, Flex)

  • Fixed partitioning doesn’t always show better performance.

  • Flexible architectures show the best performance and retain scalability

9


Q2 complex pes vs simple pes
Q2: ComplexPEs vs. Simple PEs

  • Overview

    • CGRAs with complex PEs are introduced

      • Two level interconnect

      • Number of RFs can decrease

      • Multiple instructions can be chained

    • Challenge: resource utilization

    • Goal: determine the availability of complex PEs in the view of energy consumption

  • Methodology

    • Compare the energy consumption on different PE styles

      • Number of FUs inside a PE: 1 ~ 6

      • Uniform vs. Optimized

10


Pe designs
PE Designs

11


E nergy consumption
Energy Consumption

  • Energy consumption does not increase dramatically as number of PEs

  • In 1.5x energy budget, complex PEs with 2~3 FUs can also be proper solutions

1.5x energy

12


Q3 simd memory support
Q3: SIMD Memory Support

  • Overview

    • SIMD memory support provides less power and less number of instructions

    • Challenge: degree of DLP.

    • Goal: determine the availability of SIMD memory access in the view of energy consumption

  • Methodology

    • Compare the energy consumption on different SIMD widths: 1 ~ 16

13


Relative energy consumption
Relative Energy Consumption

  • Total energy consumption at wider vector width can be a similar level to a scalar memory unit

    • High degree of spatial locality can compensate for power overheads

14


C onclusion
Conclusion

Beginning

  • Flexible partitioning should be supported for further improving the performance.

  • Complex PE can be more energy efficient even in low resource utilizations.

  • The wide SIMD memory support can be realistic due to the mobile application characteristics.

15


Questions
Questions?

  • For more information

    • http://cccp.eecs.umich.edu

16


Q 1 homogeneity vs heterogeneity
Q1: Homogeneity vs. Heterogeneity

  • Overview

    • Heterogeneous CGRAs are common

    • No experiments on the effect of heterogeneity over homogeneity

  • Methodology

    • Start from 16-PE homogeneous CGRA (integer ALU, complex ALU, memory unit)

    • Decrease the number of PEs supporting complex ALU and memory unit

    • Performance goal: 80% of performance @ homogeneous CGRA

How about performance?

17


Performance degradation
Performance Degradation

Media

Game

  • The amounts of performance degradation are not substantial

    • The performance is normally constrained not by the complex instructions

  • Performance degradation depends much more on memory operations

  • For 80% of the baseline performance, we can decrease the number of both complex and memory units by up to 75%.

18


C onclusion1
Conclusion

Beginning

  • Heterogeneous FU organization is highly effective.

  • Flexible partitioning should be supported for further improving the performance.

  • Complex PE can be more energy efficient even in low resource utilizations.

  • The wide SIMD memory support can be realistic due to the mobile application characteristics.

19


Cgra attractive alternative to asics1
CGRA : Attractive Alternative to ASICs

  • Suitable for running multimedia applications for future embedded systems

    • High throughput, low power consumption, high flexibility

Morphosys SiliconHive ADRES

viterbi at 80Mbps

h.264 at 30fps

50-60 MOps /mW

  • Morphosys : 8x8 array with RISC processor

  • SiliconHive : hierarchical systolic array

  • ADRES : 4x4 array with tightly coupled VLIW

20


ad
  • Login