1 / 53

General Overview of A n Adaptive Dynamic Extensible Processor

General Overview of A n Adaptive Dynamic Extensible Processor. Hamid Noori, Kazuaki Murakami, Koji Inoue & Victor Goulart. Kyushu University Department of Informatics Workshop on Introspective Architecture (WISA06). Agenda. Background Research goal General overview of the architecture

kelvin
Download Presentation

General Overview of A n Adaptive Dynamic Extensible Processor

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. General Overview of AnAdaptive Dynamic Extensible Processor Hamid Noori, Kazuaki Murakami, Koji Inoue & Victor Goulart Kyushu University Department of Informatics Workshop on Introspective Architecture (WISA06)

  2. Agenda • Background • Research goal • General overview of the architecture • Modes of operation • Profiler • Accelerator • Sequencer • Generation of Custom Instructions • Configuration Data for the Accelerator • Experiments and Results • Conclusions & Future work

  3. Background

  4. Some definitions • Hot Basic Block (HBB) • A basic block which execution frequency is greater than a given threshold specified in the profiler • Custom Instructions (CIs) • Are the extended Instruction Set Architecture (ISA) that are executed on the ACC • Accelerator (ACC) • Custom hardware for executing CIs • Training mode • Operation mode for detecting HBBs and generating CIs • Normal mode • Normal operation mode where CIs are executed on the ACC

  5. Research Goal • Proposal of an Adaptive Dynamic Extensible Processor for Embedded Systems • Custom instructions are adaptable to the applications • Custom instructions are detected and created during execution/training • Generation of custom instruction are done transparently and automatically • Advantages of the novel approach • Higher performance than GPPs • Higher flexibility compared to Extensible Processors • Shorter TAT and cheaper design and verification cost compared to ASIPs and Extensible Processors

  6. General overview of the architecture Adaptive Dynamic Extensible Processor N-way in-order general RISC Detects start addresses of Hot Basic Blocks (HBBs) Base Processor Fetch Reg File Augmented Hardware Decode Switches between main processor and ACC Profiler Execute ACC Memory Sequencer Write Executes Custom Instructions

  7. General overview of the architecture • Modes of operation • Training mode • Profiling • Detecting start address of Hot Basic Blocks (HBBs) • Generating Custom Instructions • Generating Configuration Data for the ACC • Binary rewriting • Initializing the Sequencer Table ♦ Online • Needs a simple hardware for profiling • All tasks are run on the base processor ♦ Offline • Needs a PC trace after taken branches/jumps • Normal mode • Profiling (optional) • Executing Custom Instructions on the ACC and other parts of the code on the base processor

  8. Components DMA Cache Register File Multi-Context Memory ID/EXE Reg Functional Unit Online Training Accelerator Sequencer Sequencer Table Mux Profiler Profiler Table (HWT) Augmented HW GPP EXE/MEM Reg

  9. Operation modes Training Mode Training Mode Normal Mode Running Tools for Generating Custom Instructions, Generating Configuration Data for ACC and Initializing Sequencer Table Monitors PC and Switches between main processor and ACC Detecting Start Address of HBBs Applications Applications Applications Binary-Level Profiling Processor Processor Processor Profiler Profiler Profiler Profiler ACC ACC ACC Sequencer Sequencer Sequencer Binary Rewriting Executing CIs

  10. Profiler Profiler Table Current PC Previous PC Compare No If greater than instruction length Nothing Yes After a taken branch or jump we look at the BBSA to see if the target PC is on the table. If it is a miss we include this address and initialize the counter to 1, otherwise we increment its value. Is Current PC in the table? Yes No Increment the counter Add it as a new entry and set the counter to one.

  11. Detecting Start Addr of HBBs HBB 400d10: addiu $29,$29,-8 400d18: addu $8,$0,$4 400d20: sw $0,0($29) 400d28: addu $4,$0,$0 400d30: addu $7,$0,$0 400d38: lui $9,49152 400d40: sll $4,$4,0x2 400d48: and $2,$8,$9 400d50: bne $2,$0,400db8 <usqrt+0xa8> 400d58: srl $2,$2,0x1e 400d60: lw $3,0($29) 400d68: addu $4,$4,$2 400d70: sll $8,$8,0x2 400d78: sll $6,$3,0x1 400d80: sll $3,$3,0x2 400d88: addiu $3,$3,1 400d90: sltu $2,$4,$3 400d98: sw $6,0($29) BTA 400db8 50 Counter > Threshold Profiler Table Taken Freq Not taken part 400d10 500 400db8 X 500 HBB Table sub Hot? Exec Freq Threshold = 100

  12. Size of Profiler Table Number of Basic Blocks with Exec Freq more than Threshold

  13. Accelerator (ACC) • ACC is a matrix of Functional Units (FUs) • ACC has a two level configuration memory • A multi-context memory (keeps two or four config) • A cache • FUs support only logical operations, add/subtract, shifts and compare • ACC updates the PC • ACC has variable delay which depends on size of Custom Instruction

  14. Connecting ACC to the Base Processor Reg0 Reg31 ………………………………………………………………. Config Mem Decoder DEC/EXE Pipeline Registers FU1 FU2 FU3 FU4 ACC Sequencer EXE/MEM Pipeline Registers

  15. Connecting ACC to the Base Processor Reg0 Reg31 ………………………………………………………………. Config Mem Decoder Sequencer DEC/EXE Pipeline Registers FU1 FU2 FU3 FU4 ACC Sequencer EXE/MEM Pipeline Registers

  16. Sequencer • The sequencer mainly determines the microcode execution sequence. • Selects between decoder and config memory for reading RF • Selects between the output of Functional Unit and Accelerator • Distinguishes when to switch between different contexts of multi-context memory • Determines when to load configuration data from cache to multi-context memory. • Checks the configuration data of custom instruction • If it is in multi-context memory, custom instructions will be executed on the accelerator • If it is not in multi-context memory • If there is enough time to load it from cache to multi-context memory, loads it and execute CI on the ACC • If there is not enough time, the original code is executed.

  17. Generation of Custom Instructions • Custom instructions • Exclude floating point, multiply, divide and load instructions • Include at most one STORE, at most one BRANCH/JUMP and all other fixed point instructions • Simple algorithm for generating custom instructions • HBBs usually include 10~40 instructions for Mibench • Custom instruction generator is going to be executed on the base processor (in online training mode)

  18. 4052c0 addiu $29,$29,-32 4052c8 mov.d $f0,$f12 4052d0 sw $18,24($29) 4052d8 addu $18,$0,$6 4052e0 sw $31,28($29) 4052e8 sw $16,16($29) 4052f0 mfc1 $16,$f0 4052f8 mfc1 $17,$f1 405300 srl $6,$17,0x14 405308 andi $6,$6,2047 405310 sltiu $2,$6,2047 405318 addu $6,$6,$18 405320 sltiu $2,$6,2047 405328 lui $2,32783 405330 and $17,$17,$2 405338 andi $2,$6,2047 405340 sll $2,$2,0x14 405348 or $17,$17,$2 405350 mtc1 $16,$f0 405358 mtc1 $17,$f1 405360 lw $31,28($29) 405370 lw $16,16($29) 405378 addiu $29,$29,32 405380 jr $31 Finding the biggest sequence of instructions in the HBB that can be executed on the ACC Moving the instructions and appending supportable instructions to the head of the detected instruction sequence after checking flow-dependency and anti-dependency Moving the instructions and appending supportable instructions to the tail of the detected instruction sequence after checking flow-dependency and anti-dependency Rewriting object code if instructions have been moved Moving instructions, should not modify the logic of the application Custom instruction generation is done without considering any other constraints. Generating Custom Instructions

  19. Supported instr(s) (B1) Not supported instr(s) (B2) Supported instr(s) (B1) Not supported instr(s) (B2) Supported instr(s) (B1) Supported instr(s) (B3) Supported instr(s) (B3) Supported instr(s) (B3) Not supported instr(s) (B4) Not supported instr(s) (B2) Supported instr(s) (B5) Generating Custom Instructions • Block 3 (B3) is selected as the biggest instructions sequence that can be executed on the ACC • Block 2 (B2) can not be executed on ACC • Block 1 (B1) can be executed on ACC • If there is no flow and anti-dependency between B1 and B2 exchange them. • This is done for B4 and B5.

  20. Example 1 Customized Instruction 1 400d10: addiu $29,$29,-8 400d18: addu $8,$0,$4 400d20: sw $0,0($29) 400d28: addu $4,$0,$0 400d30: addu $7,$0,$0 400d38: lui $9,49152 400d40: sll $4,$4,0x2 400d48: and $2,$8,$9 400d50: srl $29,$2,0x1e 400d58: lw $3,0($29) 400d60: addu $4,$4,$3 400d68: sll $8,$8,0x2 400d70: sll $6,$3,0x1 400d78: sll $3,$3,0x2 400d80: addiu $3,$3,1 400d88: sltu $2,$4,$3 400d90: sw $6,0($29) 400d98: bne $2,$0,400db8 <usqrt+0xa8> Customized Instruction 2

  21. Example 2 (rewriting obj code) 400d10: addiu $29,$29,-8 400d18: addu $8,$0,$4 400d20: addu $7,$0,$0 400d28: lui $9,49152 400d30: sll $4,$4,0x2 400d38: and $2,$8,$9 400d40: srl $2,$2,0x1e 400d48: lw $22,0($29) 400d50: addu $4,$4,$2 400d58: sll $8,$8,0x2 400d60: sll $6,$3,0x1 400d68: sll $3,$3,0x2 400d70: sltu $2,$4,$3 400d78: bne $2,$0,400db8 <usqrt+0xa8>

  22. ACC Config Data Generation Flow Base Processor Mibench Applications Simplescalar (PISA Configuration) Profiler Detecting Start Addr of HBBs Reading HBBs from Obj Code DFG

  23. FU FU FU FU FU FU FU FU FU FU FU FU FU FU FU FU FU FU FU FU Preliminary Performance Evaluation 400d10: addiu $29,$29,-8 400d18: addu $8,$0,$4 400d20: sw $0,0($29) 400d28: addu $4,$0,$0 400d30: addu $7,$0,$0 400d38: lui $9,49152 400d40: sll $4,$4,0x2 400d48: and $2,$8,$9 400d50: srl $2,$2,0x1e 9 – 2 = 7 clock cycles 7 * freq = reduced clock cycles 7 * 50K = 350K clock cycles Depth = 3 1st row = 1 clock 0.5 clock 0.5 clock Total = 2 clock

  24. Results – Number of CI considering their length 82 Length of CIs

  25. Results –Percentage of CIs considering their length Length of CIs

  26. More info on Custom Instructions

  27. Conclusions • An Adaptive Dynamic Extensible Processor • Training mode and Normal mode • Advantages • It has s simple profiler • CI are detected and added after production • There is no need to a new compiler • There is no need to new opcode for CIs • There is no penalty for absence of CI config data • Lower design cost and shorter design time • By accelerating a small part of code which has a high execution frequency an average 25% speedup improvement can be obtained. Comparing a single issue processor speedup improvement ranges from 7.8% to 52%.

  28. Future Work • Linking HBBs • Providing more details on the architecture (Accelerator, sequencer, etc) • Designing an Accelerator to support conditional execution • Developing a complete framework • Extending ACC for floating point operations • Substituting the in-order base processor with an out-of-order

  29. Thank you for your listening

  30. Example • Application X • CIx1, 100, input = 3 • CIx2, 200, input = 6 • Total executed instruction = 400,000 • Application Y • CIy1, 50, input = 4 • CIy2, 400, input = 6 • Total executed instruction = 800,000 • Input < 5

  31. Mapping Tool - Example

  32. RFU Design: A Quantitative Approach • RFU or Accelerator is a matrix of ALUs • No of Inputs • No of Outputs • No of ALUs • Connections • Location of Inputs & Outputs • Some definitions: • Considering frequency and weight in measurement • CI Execution Frequency • Weight (To equal number of executed instructions) • Average = for all CIs (ΣFreq*Weight) • Rejection: Percentage of CI that could not be mapped on the RFU • Coverage: Percentage of CI that could be mapped on the RFU • Basic Blocks:A sequence of instructions terminates in a control instruction • Hot Basic Blocks: A basic block executed more than a threshold

  33. RFU Inputs (no constraint) 96.37 89.37 98.48 8

  34. RFU Outputs (no constraint) 96.58 6

  35. RFU Node No (Input=8, Output=8) 94.74 16

  36. RFU Width (Inp=8, Out=8, Node=16) 95.65 97.65 6

  37. RFU Depth (Inp=8, Out=8, Node=16) 93.41 6

  38. RFU Configuration • Input=8 • Output=8 • Node=16 • Width = 6,4,3,2,1 • Depth = 5

  39. General overview of RFU (Architecture 1) • Inputs are applied to the first row • Outputs of each row are connected only to the inputs of the subsequent row • MOVE is used for transferring data • Rejection is 22.47%

  40. General overview of RFU (Architecture 2) • Distributing Inputs in different rows • Row1 = 7 • Row 2 = 2 • Row 3 = 2 • Row 4 = 2 • Row 5 = 1 • Connections with Variable Length • row1  row3 = 1 • row1  row4 = 1 • row1  row5 = 1 • row2  row4 = 1 • Rejection is 9.52%

  41. Functional Units • Types for FUs: • Type1: Logical (xor, nor, and , or) • Type2: add, sub, compare • Type3: shift (left/right) • Number of each type in the RFU • Type 1 = 6 • Type 2 = 14 • Type 3 = 9

  42. RFU with 8 outputs Accelerator Reg Reg Reg Reg FU2-Output FU4-Output FU1-Output FU3-Output Sequencer/control bits Sequencer/control bits

  43. Control Bits & Immediate Data • 287 bits are needed as Control Bits for • Multiplexers • Functional Units • 204 bits are needed for Immediates • Each CI configuration needs (247+204 = 491 bits)

  44. CI Configuration Memory • 2K x 1-bit multi-context memory  4 CI configuration • 8K x 1-bit cache  16 CI configuration • Total 20 CI configuration can be kept in configuration memories

  45. B1 S4 S8 S1 B5 B7 B10 B2 S5 J2 S9 S2 B6 B8 B11 B3 J1 S7 J3 S3 S6 B9 S10 B4 B12 Extension of Custom Instructions over HBBs – Motivating Example

  46. Multi-Exit Custom Instructions

  47. Conclusions • Adaptive Dynamic Extensible Processor • Binary Profiler • RFU (Inp=8, Out=6, Nodes=16, Width=6,4,3,2,1 - Depth=5) • Sequencer • Adaptive Dynamic Extensible Processor • No design time • No extra read port and write port • No design and verification cost • No compiler • No new opcode • No penalty for absence of configuration data of custom instruction in multi-context memory.

  48. Custom Instruction • Generated from HBBs • Using HBB table • Object code • Custom instruction can include • logical operations • add/sub • Shift • At most one store • At most one control instruction (jump/branch) • No load • No floating point instructions • New object code • Logically is equivalent Profiler Table

  49. Processor modes (1/2) • Training mode • Profiling applications • Detecting critical region of code • Generating DFG for critical regions • Generating custom instruction from DFGs • Generating new object code • Generating data for accelerator configuration memories and initializing sequencer table • Training can be done at the gap between two consecutive execution of the application if possible, otherwise just once before processor starts its normal operation

  50. Processor modes (2/2) • Normal mode • Profiling applications • Using the data generated in training mode to execute custom instructions on the accelerator. • Critical regions of the code are executed as custom instructions on the accelerator and the remaining part of the code are executed deploying the processor functional unit as usual.

More Related