1 / 20

An Architecture Framework for Transparent Instruction Set Customization in Embedded Processors

An Architecture Framework for Transparent Instruction Set Customization in Embedded Processors. Nathan Clark, Jason Blome, Michael Chu, Scott Mahlke, Stuart Biles*, Krisztián Flautner* Advanced Computer Architecture Lab, University of Michigan *ARM Ltd. The Expression Gap.

lindley
Download Presentation

An Architecture Framework for Transparent Instruction Set Customization in Embedded Processors

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. An Architecture Framework for Transparent Instruction Set Customization in Embedded Processors Nathan Clark, Jason Blome, Michael Chu, Scott Mahlke, Stuart Biles*, Krisztián Flautner* Advanced Computer Architecture Lab, University of Michigan *ARM Ltd. 1

  2. The Expression Gap • RISC ISAs are lowest common denominator • Don’t match applications’ computation • Don’t match hardware capabilities • Need efficient execution • Impressive design wins through customization • Performance, power, etc. 2

  3. 4 OptimoDE (5 Issue VLIW, 333 MHz) OptimoDE + Custom ISA 3.5 3 2.5 Speedup 2 1.5 1 0.5 0 3Des AES Blowfish Md5 Rc4 SHA Customization Gains: Performance 3

  4. CPU SHR LD MPY CUSTOM AND Custom Hardware Traditional ISA Customization • Demanding parts of applications run on special hardware • New instructions use the special hardware LD MPY XOR SHR XOR MOV XOR 4

  5. Objectives of Transparent ISA Customization • Increase execution efficiency of processors • Architecture framework for subgraph acceleration • Create a pipeline with fixed interface • Design and verify once • Support Plug-and-Play style accelerators • CISC on Demand 5

  6. Traditional Significant ISA change High NRE Verification Masks Control placed in binary Software migration No legacy codes Transparent No ISA change Baseline CPU unchanged Hardware generates control Eases software burden Forward compatible Traditional vs. Transparent Customization 6

  7. Architecture Framework Subgraph Execution Unit 1. Inputs Outputs Application 4. 2. Standard Pipeline … Subg. … Compiler Instructions 3. Control Generation Augments Instruction Stream 7

  8. I1 I1 I2 I3 I4 O1 O2 Configurable Compute Array (CCA) • Array of function units • Two types of FUs: arith/logic, logic • 82% of important subgraphs • Crossbar between rows • 3.19ns critical path • 0.61mm2 in 0.13m 8

  9. Architecture Framework Subgraph Execution Unit 1. Inputs Outputs Application 4. 2. Standard Pipeline … Subg. … Compiler Instructions 3. Control Generation Augments Instruction Stream 9

  10. Compiler • Identify and delineate subgraphs • “Procedural Abstraction” – used in compression 10

  11. Architecture Framework Subgraph Execution Unit 1. Inputs Outputs Application 4. 2. Standard Pipeline … Subg. … Compiler Instructions 3. Control Generation Augments Instruction Stream 11

  12. I2 I1 I1 Control Generation I1 I2 I3 I4 Subg: AND r3, r1, #-4 SEXT r2, r4 AND r2, r2, #3 OR r3, r3, r2 RET O1 O2 12

  13. Architecture Framework Subgraph Execution Unit 1. Inputs Outputs Application 4. 2. Standard Pipeline … Subg. … Compiler Instructions 3. Control Generation Augments Instruction Stream 13

  14. Pipeline Interface 14

  15. Evaluation • Ported Trimaran compiler to ARM ISA • Subgraph identification engine • Synthesized control generator and accelerator • SimpleScalar configured as ARM926EJ-S • 5 stage pipe, 250 MHz • 1 cycle 16k I/D caches • Single issue • 1 cycle subgraph execution latency 15

  16. 6.51 5 SPECint MediaBench Encryption 4.5 4 3.5 Speedup 3 2.5 2 1.5 1 rc4 sha md5 epic djpeg cjpeg unepic Rijndael 181.mcf blowfish 164.gzip 300.twolf 256.bzip2 pegwitenc pegwitdec rawdaudio rawcaudio 197.parser gsmencode gsmdecode g721encode g721decode Performance Results 1.6 IPC on a single-issue core 16

  17. Plug-and-Play Benefits Baseline Area: 0.61mm2 Baseline Speedup: 1.8 17

  18. 5 1 2 3 4 4.5 4 3.5 Speedup 3 2.5 2 1.5 1 rc4 sha md5 cjpeg djpeg rijndael epicdec epicenc blowfish rawdaudio rawcaudio pegwitdec pegwitenc gsmdecode gsmencode g721decode g721encode Effect of CCA Pipelining Average: 2.17 1.86 1.64 1.48 18

  19. Conclusions • Expression gap between ISAs and computation • Inherent inefficiency • Transparent ISA Customization • Fixed core Þ low NRE • Plug-and-Play accelerators • Enables “CISC on demand” • 1.8x speedup for 15% area overhead 19

  20. Questions? More info: http://cccp.eecs.umich.edu 20

More Related