VEAL: Virtualized Execution Accelerator for Loops

VEAL: Virtualized Execution Accelerator for Loops Nate Clark1, Amir Hormati2, Scott Mahlke2 1 Georgia Tech., 2U. Michigan

How to get Efficiency? • Microarchitecture changes • Multi- / many-core • Heterogeneity Core2 Duo STI Cell

Engineer/ Compiler How is Heterogeneity Used? Program Hetero. GPP Control Statically Placed in Binary

Hetero. Hetero. CPU CPU Engineer/ Compiler Problem With Static Control Not forward/backward compatible CPU Program 4

Hetero. Hetero. CPU CPU Program CPU Dyn Comp. Engineer/ Compiler Dyn Comp. Dyn Comp. Solution: Virtualization • Abstract accelerator features • Reexamine compiler algorithms • Key: do the hard stuff offline Offline Online

This Paper: • Examines loops as heterogeneity target • ASICs often implement loops • Design a generalized loop accelerator • Not covered in this talk • Explore how to virtualize loop accelerators • I.e. abstract the accelerator interface

Loop Accelerator Template

Why More Efficient Than GPP? • Simple control flow • Decoupled memory accesses • I-Cache unnecessary • Customize execution resources for loops

Proposed Loop Accelerator • 1 CCA • 2 Int units • 16 regs • Memory (4x) • 16 Input streams • 8 Output streams • 0.8 mm2, 90nm

Modulo Scheduling + High quality software pipelining technique + Simple control structure (low HW cost) - Can be slow, i.e., hard to do dynamically - Loops: no side exits, no while, if convertible

Benchmark Execution Time

Modulo Scheduling Basics FU C Kernel

0 1 2 Modulo Scheduling Example 1. CCA Mapping 2. II Calculation 3. Priority 4. Scheduling 5. Reg. assignment/ communication 2 3 7 4 5 Time 6 Priority: 2, 4, 6 3, 5 7

Measured Scheduling Overhead 70% Priority, 19% CCA

Supporting Hybrid Compilation Loop: 1 ld 2 add 3 sub 4 brl CCA 5 or 6 or 7 add 8 str CCA: and sub xor ret Data: 0 1 4 6 3 … Loop: 1 ld 2 add 3 sub 4 brl CCA 5 or … Loop: 1 ld 2 add 3 sub and sub xor 5 or 6 or 7 add 8 str

Speedups

Summary • Virtualization key to heterogeneity • VEAL speedup: 2.54 • 2.63 w/o translation (i.e., not binary compatible) • 2.17 fully dynamic • CCA and priority: 89% overhead • mpeg2dec 2.1 vs. 1.15

Thank you! Questions? http://www.cc.gatech.edu/~ntclark http://cccp.eecs.umich.edu/

VEAL: Virtualized Execution Accelerator for Loops