1 / 18

VEAL: Virtualized Execution Accelerator for Loops

VEAL: Virtualized Execution Accelerator for Loops. Nate Clark 1 , Amir Hormati 2 , Scott Mahlke 2 1 Georgia Tech., 2 U. Michigan. How to get Efficiency?. Microarchitecture changes Multi- / many-core Heterogeneity. Core2 Duo. STI Cell. Engineer/ Compiler. How is Heterogeneity Used?.

livvy
Download Presentation

VEAL: Virtualized Execution Accelerator for Loops

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. VEAL: Virtualized Execution Accelerator for Loops Nate Clark1, Amir Hormati2, Scott Mahlke2 1 Georgia Tech., 2U. Michigan

  2. How to get Efficiency? • Microarchitecture changes • Multi- / many-core • Heterogeneity Core2 Duo STI Cell

  3. Engineer/ Compiler How is Heterogeneity Used? Program Hetero. GPP Control Statically Placed in Binary

  4. Hetero. Hetero. CPU CPU Engineer/ Compiler Problem With Static Control Not forward/backward compatible CPU Program 4

  5. Hetero. Hetero. CPU CPU Program CPU Dyn Comp. Engineer/ Compiler Dyn Comp. Dyn Comp. Solution: Virtualization • Abstract accelerator features • Reexamine compiler algorithms • Key: do the hard stuff offline Offline Online

  6. This Paper: • Examines loops as heterogeneity target • ASICs often implement loops • Design a generalized loop accelerator • Not covered in this talk • Explore how to virtualize loop accelerators • I.e. abstract the accelerator interface

  7. Loop Accelerator Template

  8. Why More Efficient Than GPP? • Simple control flow • Decoupled memory accesses • I-Cache unnecessary • Customize execution resources for loops

  9. Proposed Loop Accelerator • 1 CCA • 2 Int units • 16 regs • Memory (4x) • 16 Input streams • 8 Output streams • 0.8 mm2, 90nm

  10. Modulo Scheduling + High quality software pipelining technique + Simple control structure (low HW cost) - Can be slow, i.e., hard to do dynamically - Loops: no side exits, no while, if convertible

  11. Benchmark Execution Time

  12. Modulo Scheduling Basics FU C Kernel

  13. 0 1 2 Modulo Scheduling Example 1. CCA Mapping 2. II Calculation 3. Priority 4. Scheduling 5. Reg. assignment/ communication 2 3 7 4 5 Time 6 Priority: 2, 4, 6 3, 5 7

  14. Measured Scheduling Overhead 70% Priority, 19% CCA

  15. Supporting Hybrid Compilation Loop: 1 ld 2 add 3 sub 4 brl CCA 5 or 6 or 7 add 8 str CCA: and sub xor ret Data: 0 1 4 6 3 … Loop: 1 ld 2 add 3 sub 4 brl CCA 5 or … Loop: 1 ld 2 add 3 sub and sub xor 5 or 6 or 7 add 8 str

  16. Speedups

  17. Summary • Virtualization key to heterogeneity • VEAL speedup: 2.54 • 2.63 w/o translation (i.e., not binary compatible) • 2.17 fully dynamic • CCA and priority: 89% overhead • mpeg2dec 2.1 vs. 1.15

  18. Thank you! Questions? http://www.cc.gatech.edu/~ntclark http://cccp.eecs.umich.edu/

More Related