Nathan Clark, Manjunath Kudlur, Hyunchul Park, Scott Mahlke, Krisztián Flautner*

Application-Specific Processing on a General Purpose Core via Transparent Instruction Set Customization Nathan Clark, Manjunath Kudlur, Hyunchul Park, Scott Mahlke, Krisztián Flautner* Advanced Computer Architecture Lab, University of Michigan *ARM Ltd. 1

A Case for Customization • General purpose processors handles many applications fairly well, but… • Each application has different requirements • Need for efficient execution • Impressive design wins through customization • Performance, power, area • Up to 3.5x speedup [Hot Chips 16] 2

SHR LD AND Instruction Set Customization • Computationally demanding parts of applications run on special hardware • New instructions use the special hardware LD MPY MPY XOR SHR CUSTOM XOR MOV XOR 3

High Non-Recurring Engineering costs (NRE) “Universal” accelerator No ISA change CPU CPU CPU CPU Compute Accelerator (CCA) Traditional vs. Transparent Customization Traditional Transparent CPU CPU 4

IN 1 IN 2 … FU FU FU … FU FU FU … … Design of a Compute Accelerator • Goal: support important computation subgraphs • Array of function units • Exploits subgraph parallelism • Allows natural data propagation CCA F e t c h I s s u e W B … … ALU ALU 5

1 1 1 Mov Mov And Mov And And 1 1 Mov Or Or Or 1 1 Mov And Mov And Mov And 1 Or Or Or CCA Shape 164.gzip 6

2 2 2 Add Mov Xor 2 2 Mov Xor 2 2 Xor And 1 CCA Shape Blowfish 7

CCA Utilization • Dynamic % of subgraphs using FU 8

CCA Operations • Dynamic opcodes in important subgraphs • Excluded mpy/div, load/store, branch • Two main categories – logicals, adds • Subgraphs rarely have more than 3 dependent adds 9

I1 Proposed CCA Design • 4 inputs/2 outputs • Two FU types • Arith/logic • Logic • Crossbar between rows • Captures > 99% of important subgraphs I1 I2 I3 I4 O1 O2 10

Synthesis of CCA • Synopsys design tools, 130nm library 11

ASIPs – ISA change – High NRE ASIPs – ISA change – High NRE ASIPs – ISA change – High NRE ASIPs – ISA change – High NRE + Powerful selection + Simple hardware – Some ISA change – Recompile necessary + Powerful selection + Simple hardware – Some ISA change – Recompile necessary + Powerful selection + Simple hardware – Some ISA change – Recompile necessary + Powerful selection + Simple hardware – Some ISA change – Recompile necessary + No ISA change + No recompile – Simple selection – Hardware complexity + No ISA change + No recompile – Simple selection – Hardware complexity + No ISA change + No recompile – Simple selection – Hardware complexity + No ISA change + No recompile – Simple selection – Hardware complexity CCA Utilization Realization Static Dynamic Static Selection Dynamic 12

… ADD r4, r1, #1 LSR r2, r2, #4 XOR r5, r4, r2 LD r3 ADD r6, r5, r3 XOR r7, r6, r8 SHR … … ADD r4, r1, #1 LSR r2, r2, #4 XOR r5, r4, r2 LD r3 ADD r6, r5, r3 XOR r7, r6, r8 SHR … … ADD r4, r1, #1 LSR r2, r2, #4 XOR r5, r4, r2 LD r3 ADD r6, r5, r3 XOR r7, r6, r8 SHR … … ADD r4, r1, #1 LSR r2, r2, #4 XOR r5, r4, r2 LD r3 ADD r6, r5, r3 XOR r7, r6, r8 SHR … … ADD r4, r1, #1 LSR r2, r2, #4 XOR r5, r4, r2 LD r3 ADD r6, r5, r3 XOR r7, r6, r8 SHR … Dynamic Selection – Dynamic Realization • Detect and replace subgraphs in fill unit of trace cache I-Cache D e c o d e . . . E x e c u t e . . . R e t i r e Trace Cache … LSR r2, r2, #4 LD r3 CUSTOM SHR … Subgraph Selection and Insertion Trace Construction 13

Simulation • SimpleScalar – ARM instruction set • 4-wide Execution, 1 compute accelerator • 128 RUU entries • 32k inst. trace cache, 256 inst. Traces • 5000 cycle selection/insert latency • L1 I-cache : 32k, 2 way, 2 cycle hit • L1 D-cache : 32k, 4 way, 2 cycle hit 14

Varying CCA Latency Encryption MediaBench SPECint 1.45 1.40 Lat 1.35 6 1.30 4 2 1.25 Speedup 1 1.20 1.15 1.10 1.05 1.00 rc4 sha epic 3des cjpeg djpeg unepic blowfish Average 181.mcf 164.gzip 300.twolf mpeg2enc mpeg2dec pegwitdec pegwitenc rawdaudio 186.crafty 197.parser gsmdecode g721encode mesamipmap 15

… ADD r4, r1, #1 LSR r2, r2, #4 XOR r5, r4, r2 LD r3 ADD r6, r5, r3 XOR r7, r6, r8 SHR … … LSR r2, r2, #4 LD r3 CCA_Start #2 ADD r4, r1, #1 XOR r5, r4, r2 ADD r6, r5, r3 XOR r7, r6, r8 CCA_End SHR … Control Table D e c o d e . . . E x e c u t e . . . R e t i r e I-Cache Static Selection – Dynamic Realization • Compiler selects subgraphs offline • Communicated to the hardware at load time • Control bits stored in a table and inserted at decode 16

Dynamic vs. Static Selection SPECint MediaBench Encryption 1.45 Dynamic Selection Static Selection 1.40 1.35 1.30 1.25 Speedup 1.20 1.15 1.10 1.05 1.00 rc4 sha epic 3des djpeg cjpeg unepic blowfish 181.mcf Average 164.gzip 300.twolf mpeg2dec mpeg2enc pegwitdec pegwitenc rawdaudio 186.crafty 197.parser gsmdecode g721encode mesamipmap 17

Summary • Transparent instruction set customization • Benefits of customization without changing ISA • Presented design of a compute accelerator • Handle majority of important computation subgraphs in many benchmarks • Developed ways to utilize the accelerator • Table-based static selection – dynamic realization • Trace cache based dynamic selection – dynamic realization 18

Nathan Clark, Manjunath Kudlur, Hyunchul Park, Scott Mahlke, Krisztián Flautner*

Nathan Clark, Manjunath Kudlur, Hyunchul Park, Scott Mahlke, Krisztián Flautner*

Presentation Transcript

New Parks in Downtown Dallas

The Physics of Hitting a Home Run

Characterisation of Molecular Interactions Using Surface Plasmon Resonance: BIAcore

The way to rainy mountain Momaday N. Scott

PREVENTION OF DIABETIC FOOT ULCERS AND LOWER EXTREMITY AMPUTATION

IMMUNOMODULATORS

Clark Leonard Hull

Scott Foresman / Reading Street Morning Warm-Up Grade 1/ Unit 1 Debra McKeivier/ Holly Andrepont Maplewood 1 st Grade

SCOTT 4.5 Airpack

Acceptance and Commitment Therapy

Scott McRae Park Nicollet Health Services St. Louis Park, MN

The Lewis and Clark Expedition

MMG /BIOC 352

Statistical Significance for Peptide Identification by Tandem Mass Spectrometry

The CLARK Intranet

NoGa Campout – 2009 Devil’s Fork State Park – S.C .

The Red Sox : Slivers of light at Fenway Park

Design for Cast and Molded Parts

Lewis and Clark and Me: A Dog’s Tale

An Introduction to a Multi-Tiered System of Supports Clark Dorman Don Kincaid