1 / 20

ALU-Array based Reconfigurable Accelerator for Energy Efficient Executions

ALU-Array based Reconfigurable Accelerator for Energy Efficient Executions . Koji Inoue†, Hamid Noori ‡, Farhad Mehdipour †, Takaaki Hanada †, and Kazuaki Murakami† †Department of Advanced Information Technology, Kyushu University, Fukuoka, Japan

jules
Download Presentation

ALU-Array based Reconfigurable Accelerator for Energy Efficient Executions

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. ALU-Array based Reconfigurable Accelerator for Energy Efficient Executions Koji Inoue†, HamidNoori‡, FarhadMehdipour†, TakaakiHanada†, and Kazuaki Murakami† †Department of Advanced Information Technology, Kyushu University, Fukuoka, Japan ‡School of Electrical and Computer Engineering, University of Tehran

  2. Outline • Introduction • ADEXOR: Adaptive Extensible Processor • Overview • Microarchitecture • Coarse-grained Reconfigurable Functional Unit • Evaluation • Conclusions

  3. Motivation and Solution • Embedded processors have to achieve • Lowcost • High-performance • Low-power or low-energy consumption • Key point • How can processors adapt to target applications? • Solution: ASIP w/ Re-configurability • Application specific ISA • Provide custom instructions (CIs) • Implement re-configurable FUs

  4. ADaptiveEXtensibleprocessOR(ADEXOR) • Has a coarse-grained re-configurable functional unit • Supports efficient “Multi-Exits CIs” • Achieves high-performance and low energy 400680 subiu $25,$25,1 400688 lbu $13,0($7) 400690 lbu $2,0($4) 400698 sll $2,$2,0x18 4006a0 sra $14,$2,0x18 4006a8 addiu $4,$4,1 4006b0 srl $8,$2,0x1c 4006b8 sll $2,$8,0x2 4006c0 addu $2,$2,$25 4006c8 bgez $10,4006f0 4006d0 xori $13,$13,1 4006d8 addu $10,$10,$2 400680 subiu $25,$25,1 400698 sll $2,$2,0x18 4006a0 sra $14,$2,0x18 400688 lbu $13,0($7) 4006e0 bgez $10,4006f0 . . . . Register File Indexed by mtc1 RFU or sequencer Configuration Memory ID/EXE Reg ID/EXE Reg CRFU ALU MUX Counter Triggered by mtc1 or sequencer EXE/MEM Reg GPP Augmented HW GPP: General Purpose Processor CRFU: Coarse-grained Reconfigurable Functional Unit Hot Basic Block

  5. CRFU Microarchitecture • 16 FUs controlled by configuration bits • MUX-base interconnection between FUs • Early stage data can be transferred to output ports

  6. Supporting Multi-Exits Custom Instructions (MECIs) Multiple-Exits Custom Instruction Conditional Execution + Hot-Path Selection #Required nodes: 16 adpcm Exit Exit Assume 16 nodes can be included in one CI in maximum

  7. Experimental Setup (1/2) Base Processor Configuration

  8. Experimental Setup (2/2) • arch1: (4-read/2-write) • Clock freq: 135MHz • RF read/write access • Input: 5, 6, 7, or 8 +1 extra cycle • Output: 3 or 4  +1 extra cycle • Output: 5 or 6  +2 extra cycles • CRFU execution • arch-1-var: variable (1 or 2 cycles) • arch-1-fix: 2 cycles • arch2: (8-read/4-write) • Clock freq: 130MHz • RF read/write access • Input: no extra cycle • Output: 5 or 6  +1 extra cycle • CRFU execution • arch-2-var: variable (1 or 2 cycles) • arch-2-fix: 2 cycles 8

  9. Performance Evaluation

  10. Energy Consumption Pros. Cons. • Low activity of hardware components • I-Cache, Bpred • Decoder • Register File • Functional Unit • Higher I-Cache hit rates • Reduce the energy for off-chip accesses • RFU configuration • Accessing the config. Memory • Setting control signals in the RFU • Increased complexity • Communication between the processor’s data-path and the RFU 10

  11. Total Energy Reduction

  12. FU FU FU FU FU FU FU FU FU FU FU FU FU FU FU FU Temperature Analysis CRFU Floor Plan (1.7x1.7 [mm2])

  13. Conclusions • ADEXOR: Adaptive Extensible Processor • Has a coarse-grain reconfigurable functional unit • Supports multi-exit custom instructions • Performance / Energy Analysis • 5X speed up (best case) • 60% energy reduction (best case) • Future Work • Extend for 3D-IC Implementation

  14. Acknowledgement • This research was supported in part by • New Energy and Industrial Technology Development Organization • The chip fabrication program of VLSI Design and Education Center(VDEC), the University of Tokyo in collaboration with Hitachi Ltd. and Dai Nippon Printing Corporation. 14

  15. Backup Slides

  16. Area overhead (1/2) • VHDL & Hitachi 0.18μm library • Base processor: 4.5 mm2 • CRFU: 1.7 mm2 • CACTI 4.2 (0.18μm) • I-Cache & D-Cache (32KB 4-way ): 2.25mm2 • Configuration Memory (SRAM - for 32 MECIs): 0.56mm2 • Sequencer (CAM – 32 entries): 0.092mm2 • Base Processor (with caches) • Area: 9.0mm2 16

  17. Area overhead (2/2) 17

  18. Access Reduction 55 35 15 seq mtc1 18

  19. Energy Consumption Breakdown for arch1/invoke-mtc1 19

  20. Energy Consumption Breakdown for arch2/invoke-seq 20

More Related