Efficient performance scaling of future cgras for mobile applications
Download
1 / 20

Efficient Performance Scaling of Future CGRAs for Mobile Applications - PowerPoint PPT Presentation


  • 134 Views
  • Uploaded on

Efficient Performance Scaling of Future CGRAs for Mobile Applications. Yongjun Park , Jason Jong Kyu Park , and Scott Mahlke. December 11, 2012 University of Michigan, Ann Arbor. Convergence of Functionalities. Flexible Accelerator!. 4G Wireless. Audio Video 3D. Navigation.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Efficient Performance Scaling of Future CGRAs for Mobile Applications' - tess


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Efficient performance scaling of future cgras for mobile applications

Efficient Performance Scaling of Future CGRAs for Mobile Applications

Yongjun Park, Jason Jong Kyu Park , and Scott Mahlke

December 11, 2012

  • University of Michigan, Ann Arbor

1


Convergence of functionalities
Convergence of Functionalities

Flexible

Accelerator!

4G Wireless

Audio

Video

3D

Navigation

Anatomy of an iPhone4

Convergence of functionalities demands a flexible solution due to the design cost and programmability

2


Cgra attractive alternative to asics
CGRA : Attractive Alternative to ASICs

  • Array of PEs connected in a mesh-like interconnect

  • High throughput with a large number of resources

  • Distributed hardware offers low cost/power consumption

  • High flexibility with dynamic reconfiguration

3


Bridging the gap between market demand and computation p ower
Bridging the Gap Between Market Demandand Computation Power

How to scale performance with retaining energy efficiency?

[Canali, Internet Computing Magazine, IEEE, 2009]

4


Agenda scaling the energy efficiency of cgras
Agenda:Scaling the Energy Efficiency of CGRAs

  • Investigate the key factors and their feasibility in the view of performance and power efficiency

    • Hardware scalability vs. hardware flexibility

  • Interconnection topology

  • Complex PE vs. simple PE

  • Vector memory operation support

  • Homogeneity vs. Heterogeneity

5


Experimental setup
Experimental Setup

  • Target applications

    • Media benchmark: AAC decoder, H.264 decoder, and 3D rendering

    • Game physics benchmarks: line of sight, convolution, and conjugate

  • Target architecture: various types of CGRAs

    • 16 ~ 64 heterogeneous/homogeneous resources

  • IMPACT frontend compiler + Edge-centric modulo scheduler

  • Power measurement

    • IBM 65nm technology @ 200MHz/1V

6


Q1 interconnection t opology
Q1: Interconnection Topology

  • Overview

    • Routing overhead limits the performance when increasing the size of the CGRA

    • Common solution: clustering

    • What is the optimal interconnection topology?

  • Methodology

    • Compare the performance of three different clustering schemes.

      • Baseline

      • Fixed partition: CGRAs are physically split into multiple partitions

      • Flexible partition: number of partitions can be dynamically changed from 1 to 8

    • Total number of PEs: 4 to 128

7


Q1 interconnection t opology1
Q1: Interconnection Topology

Application

No-DLP loops

Baseline

DLP loops

Fixed partition

Flexible mapping

8


Performance comparison base fixed flex
Performance Comparison (Base, Fixed, Flex)

  • Fixed partitioning doesn’t always show better performance.

  • Flexible architectures show the best performance and retain scalability

9


Q2 complex pes vs simple pes
Q2: ComplexPEs vs. Simple PEs

  • Overview

    • CGRAs with complex PEs are introduced

      • Two level interconnect

      • Number of RFs can decrease

      • Multiple instructions can be chained

    • Challenge: resource utilization

    • Goal: determine the availability of complex PEs in the view of energy consumption

  • Methodology

    • Compare the energy consumption on different PE styles

      • Number of FUs inside a PE: 1 ~ 6

      • Uniform vs. Optimized

10


Pe designs
PE Designs

11


E nergy consumption
Energy Consumption

  • Energy consumption does not increase dramatically as number of PEs

  • In 1.5x energy budget, complex PEs with 2~3 FUs can also be proper solutions

1.5x energy

12


Q3 simd memory support
Q3: SIMD Memory Support

  • Overview

    • SIMD memory support provides less power and less number of instructions

    • Challenge: degree of DLP.

    • Goal: determine the availability of SIMD memory access in the view of energy consumption

  • Methodology

    • Compare the energy consumption on different SIMD widths: 1 ~ 16

13


Relative energy consumption
Relative Energy Consumption

  • Total energy consumption at wider vector width can be a similar level to a scalar memory unit

    • High degree of spatial locality can compensate for power overheads

14


C onclusion
Conclusion

Beginning

  • Flexible partitioning should be supported for further improving the performance.

  • Complex PE can be more energy efficient even in low resource utilizations.

  • The wide SIMD memory support can be realistic due to the mobile application characteristics.

15


Questions
Questions?

  • For more information

    • http://cccp.eecs.umich.edu

16


Q 1 homogeneity vs heterogeneity
Q1: Homogeneity vs. Heterogeneity

  • Overview

    • Heterogeneous CGRAs are common

    • No experiments on the effect of heterogeneity over homogeneity

  • Methodology

    • Start from 16-PE homogeneous CGRA (integer ALU, complex ALU, memory unit)

    • Decrease the number of PEs supporting complex ALU and memory unit

    • Performance goal: 80% of performance @ homogeneous CGRA

How about performance?

17


Performance degradation
Performance Degradation

Media

Game

  • The amounts of performance degradation are not substantial

    • The performance is normally constrained not by the complex instructions

  • Performance degradation depends much more on memory operations

  • For 80% of the baseline performance, we can decrease the number of both complex and memory units by up to 75%.

18


C onclusion1
Conclusion

Beginning

  • Heterogeneous FU organization is highly effective.

  • Flexible partitioning should be supported for further improving the performance.

  • Complex PE can be more energy efficient even in low resource utilizations.

  • The wide SIMD memory support can be realistic due to the mobile application characteristics.

19


Cgra attractive alternative to asics1
CGRA : Attractive Alternative to ASICs

  • Suitable for running multimedia applications for future embedded systems

    • High throughput, low power consumption, high flexibility

Morphosys SiliconHive ADRES

viterbi at 80Mbps

h.264 at 30fps

50-60 MOps /mW

  • Morphosys : 8x8 array with RISC processor

  • SiliconHive : hierarchical systolic array

  • ADRES : 4x4 array with tightly coupled VLIW

20


ad