1 / 18

Nathan Clark, Manjunath Kudlur, Hyunchul Park, Scott Mahlke, Krisztián Flautner*

Application-Specific Processing on a General Purpose Core via Transparent Instruction Set Customization. Nathan Clark, Manjunath Kudlur, Hyunchul Park, Scott Mahlke, Krisztián Flautner* Advanced Computer Architecture Lab, University of Michigan *ARM Ltd. A Case for Customization.

ollie
Download Presentation

Nathan Clark, Manjunath Kudlur, Hyunchul Park, Scott Mahlke, Krisztián Flautner*

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Application-Specific Processing on a General Purpose Core via Transparent Instruction Set Customization Nathan Clark, Manjunath Kudlur, Hyunchul Park, Scott Mahlke, Krisztián Flautner* Advanced Computer Architecture Lab, University of Michigan *ARM Ltd. 1

  2. A Case for Customization • General purpose processors handles many applications fairly well, but… • Each application has different requirements • Need for efficient execution • Impressive design wins through customization • Performance, power, area • Up to 3.5x speedup [Hot Chips 16] 2

  3. SHR LD AND Instruction Set Customization • Computationally demanding parts of applications run on special hardware • New instructions use the special hardware LD MPY MPY XOR SHR CUSTOM XOR MOV XOR 3

  4. High Non-Recurring Engineering costs (NRE) “Universal” accelerator No ISA change CPU CPU CPU CPU Compute Accelerator (CCA) Traditional vs. Transparent Customization Traditional Transparent CPU CPU 4

  5. IN 1 IN 2 … FU FU FU … FU FU FU … … Design of a Compute Accelerator • Goal: support important computation subgraphs • Array of function units • Exploits subgraph parallelism • Allows natural data propagation CCA F e t c h I s s u e W B … … ALU ALU 5

  6. 1 1 1 Mov Mov And Mov And And 1 1 Mov Or Or Or 1 1 Mov And Mov And Mov And 1 Or Or Or CCA Shape 164.gzip 6

  7. 2 2 2 Add Mov Xor 2 2 Mov Xor 2 2 Xor And 1 CCA Shape Blowfish 7

  8. CCA Utilization • Dynamic % of subgraphs using FU 8

  9. CCA Operations • Dynamic opcodes in important subgraphs • Excluded mpy/div, load/store, branch • Two main categories – logicals, adds • Subgraphs rarely have more than 3 dependent adds 9

  10. I1 Proposed CCA Design • 4 inputs/2 outputs • Two FU types • Arith/logic • Logic • Crossbar between rows • Captures > 99% of important subgraphs I1 I2 I3 I4 O1 O2 10

  11. Synthesis of CCA • Synopsys design tools, 130nm library 11

  12. ASIPs – ISA change – High NRE ASIPs – ISA change – High NRE ASIPs – ISA change – High NRE ASIPs – ISA change – High NRE + Powerful selection + Simple hardware – Some ISA change – Recompile necessary + Powerful selection + Simple hardware – Some ISA change – Recompile necessary + Powerful selection + Simple hardware – Some ISA change – Recompile necessary + Powerful selection + Simple hardware – Some ISA change – Recompile necessary + No ISA change + No recompile – Simple selection – Hardware complexity + No ISA change + No recompile – Simple selection – Hardware complexity + No ISA change + No recompile – Simple selection – Hardware complexity + No ISA change + No recompile – Simple selection – Hardware complexity CCA Utilization Realization Static Dynamic Static Selection Dynamic 12

  13. ADD r4, r1, #1 LSR r2, r2, #4 XOR r5, r4, r2 LD r3 ADD r6, r5, r3 XOR r7, r6, r8 SHR … … ADD r4, r1, #1 LSR r2, r2, #4 XOR r5, r4, r2 LD r3 ADD r6, r5, r3 XOR r7, r6, r8 SHR … … ADD r4, r1, #1 LSR r2, r2, #4 XOR r5, r4, r2 LD r3 ADD r6, r5, r3 XOR r7, r6, r8 SHR … … ADD r4, r1, #1 LSR r2, r2, #4 XOR r5, r4, r2 LD r3 ADD r6, r5, r3 XOR r7, r6, r8 SHR … … ADD r4, r1, #1 LSR r2, r2, #4 XOR r5, r4, r2 LD r3 ADD r6, r5, r3 XOR r7, r6, r8 SHR … Dynamic Selection – Dynamic Realization • Detect and replace subgraphs in fill unit of trace cache I-Cache D e c o d e . . . E x e c u t e . . . R e t i r e Trace Cache … LSR r2, r2, #4 LD r3 CUSTOM SHR … Subgraph Selection and Insertion Trace Construction 13

  14. Simulation • SimpleScalar – ARM instruction set • 4-wide Execution, 1 compute accelerator • 128 RUU entries • 32k inst. trace cache, 256 inst. Traces • 5000 cycle selection/insert latency • L1 I-cache : 32k, 2 way, 2 cycle hit • L1 D-cache : 32k, 4 way, 2 cycle hit 14

  15. Varying CCA Latency Encryption MediaBench SPECint 1.45 1.40 Lat 1.35 6 1.30 4 2 1.25 Speedup 1 1.20 1.15 1.10 1.05 1.00 rc4 sha epic 3des cjpeg djpeg unepic blowfish Average 181.mcf 164.gzip 300.twolf mpeg2enc mpeg2dec pegwitdec pegwitenc rawdaudio 186.crafty 197.parser gsmdecode g721encode mesamipmap 15

  16. ADD r4, r1, #1 LSR r2, r2, #4 XOR r5, r4, r2 LD r3 ADD r6, r5, r3 XOR r7, r6, r8 SHR … … LSR r2, r2, #4 LD r3 CCA_Start #2 ADD r4, r1, #1 XOR r5, r4, r2 ADD r6, r5, r3 XOR r7, r6, r8 CCA_End SHR … Control Table D e c o d e . . . E x e c u t e . . . R e t i r e I-Cache Static Selection – Dynamic Realization • Compiler selects subgraphs offline • Communicated to the hardware at load time • Control bits stored in a table and inserted at decode 16

  17. Dynamic vs. Static Selection SPECint MediaBench Encryption 1.45 Dynamic Selection Static Selection 1.40 1.35 1.30 1.25 Speedup 1.20 1.15 1.10 1.05 1.00 rc4 sha epic 3des djpeg cjpeg unepic blowfish 181.mcf Average 164.gzip 300.twolf mpeg2dec mpeg2enc pegwitdec pegwitenc rawdaudio 186.crafty 197.parser gsmdecode g721encode mesamipmap 17

  18. Summary • Transparent instruction set customization • Benefits of customization without changing ISA • Presented design of a compute accelerator • Handle majority of important computation subgraphs in many benchmarks • Developed ways to utilize the accelerator • Table-based static selection – dynamic realization • Trace cache based dynamic selection – dynamic realization 18

More Related