1 / 78

Adapting Compilation Techniques to Enhance the Packing of Instructions into Registers

Adapting Compilation Techniques to Enhance the Packing of Instructions into Registers. Stephen Hines , David Whalley and Gary Tyson Computer Science Dept. Florida State University October 23, 2006. Instruction Packing.

mea
Download Presentation

Adapting Compilation Techniques to Enhance the Packing of Instructions into Registers

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Adapting Compilation Techniques to Enhance the Packing of Instructions into Registers Stephen Hines, David Whalley and Gary Tyson Computer Science Dept. Florida State University October 23, 2006

  2. Instruction Packing • Store frequently occurring instructions as specified by the compiler in a small, low-power Instruction Register File (IRF) • Allow multiple instruction fetches from the IRF by packing instruction references together • Tightly packed – multiple IRF references • Loosely packed – piggybacks an IRF reference onto an existing instruction • Facilitate parameterization of some instructions using an Immediate Table (IMM) Adapting Compilation Techniques to Enhance the Packing of Instructions into Registers

  3. insn3 insn4 insn2 insn3 insn2 insn4 imm3 imm3 IRF IMM Instruction Cache insn1 insn1 Execution of IRF Instructions Instruction Fetch Stage First Half of Instruction Decode Stage IF/ID PC packed instruction packed instruction To Instruction Decoder IRWP Executing a Tightly Packed Param4c Instruction Adapting Compilation Techniques to Enhance the Packing of Instructions into Registers

  4. Outline • Introduction • Improved Promotion to the IRF • Compiler Optimizations • Instruction Selection • Register Re-assignment • Instruction Scheduling • Experimental Evaluation • Conclusions & Future Work Adapting Compilation Techniques to Enhance the Packing of Instructions into Registers

  5. Improved Promotion to the IRF • Different classes of instructions can consume 1 – 5 slots • More accurately model the benefits of promoting from one class of instruction to another • Original IRF papers did not promote multiple I-type instructions with different default immediate values • addi $3, $3, 4 and addi $3, $3, 1 would not both reside in the IRF, no matter how frequently they occurred Adapting Compilation Techniques to Enhance the Packing of Instructions into Registers

  6. Mixed Profiling • Static profiling is best for decreasing code size • Dynamic profiling is best for reducing energy consumption • Can simultaneously weight static and dynamic profile data to obtain a mixed result that has both good code compression and reduced energy consumption • Can obtain most of the benefits of individual static/dynamic profiling Adapting Compilation Techniques to Enhance the Packing of Instructions into Registers

  7. Compiler Optimizations • Instruction Selection • Choose beneficial encodings for increasing redundancy • Register Re-assignment • Attempts to rename registers such that instructions can be accessed via IRF • Instruction Scheduling • Intra-block – focus on reordering instructions so that dense packs are formed (both tight and loose) • Inter-block – attempt to move instructions between blocks to fill up packs ending with branches/jumps • Code duplication • Predication Adapting Compilation Techniques to Enhance the Packing of Instructions into Registers

  8. 5 5 4 2 1 1 4 4’ 4’ 2 3 3 Intra-block Instruction Scheduling Without Instruction Scheduling With Instruction Scheduling 3 1 2 1 2 4 5 4’ 5 4 5 Instruction Dependence DAG Adapting Compilation Techniques to Enhance the Packing of Instructions into Registers

  9. 2 3 3’ 4 4’ Code Duplication to Reduce Code Size W • • • X Y 5 c 5’ a b 6 slots is too many to fit in a single packed instruction … Z 1 1 3 4 3’ 4’ but we can duplicate a single instruction … resulting in the ability to pack the remaining 5 slots together. Adapting Compilation Techniques to Enhance the Packing of Instructions into Registers

  10. 2 3 b 2’ 4 4’ Predication – Forward Branches X • • • Cond Branch a Fall-through Instructions packed after forward branches will only be executed when the branch is not taken Y 1 2 3 4 4 2’ 4’ 4’ Branch taken path Z • • • Adapting Compilation Techniques to Enhance the Packing of Instructions into Registers

  11. Predication – Backward Branches • • • a b c Instructions packed after backward branches will only be executed when the branch is taken 1 1 2 2 2’ 2’ Branch offset Branch d e f • • • Adapting Compilation Techniques to Enhance the Packing of Instructions into Registers

  12. Predication Advantages with IRF • IRF facilitates a form of predication for the MIPS – a baseline architecture that traditionally does not support predication • No need to waste instruction encoding space specifying predicate bits for most/all instructions (even ARM traded away general predication for reducing code size with Thumb and Thumb2) • No need to fetch, decode and possibly execute instructions that are annulled after the branch within a pack (reducing energy consumption and execution time) Adapting Compilation Techniques to Enhance the Packing of Instructions into Registers

  13. Experimental Evaluation • MiBench embedded benchmark suite – 6 categories representing common tasks for various domains • SimpleScalar MIPS/PISA architectural simulator • Out-of-order, single issue embedded machine with 8KB 4-way set associative L1 instruction and data caches and 128-entry bimodal branch predictor • Wattch/Cacti extensions for modeling energy consumption (inactive portions of pipeline only dissipate 10% of normal energy when using cc3 clock gating) • VPO – Very Portable Optimizer targeted for SimpleScalar MIPS/PISA Adapting Compilation Techniques to Enhance the Packing of Instructions into Registers

  14. Energy Consumption Adapting Compilation Techniques to Enhance the Packing of Instructions into Registers

  15. Static Code Size Adapting Compilation Techniques to Enhance the Packing of Instructions into Registers

  16. IRF Promotion with Mixed Profiling Adapting Compilation Techniques to Enhance the Packing of Instructions into Registers

  17. Conclusions & Future Work • Compiler optimizations targeted specifically for IRF can further reduce energy (12.2%15.8%), code size (16.8%28.8%) and execution time • Unique transformation opportunities exist due to IRF, such as code duplication for code size reduction and predication • As processor designs become more idiosyncratic, it is increasingly important to explore the possibility of evolving existing compiler optimizations • Register targeting and loop unrolling should also be explored with instruction packing • Enhanced parameterization techniques Adapting Compilation Techniques to Enhance the Packing of Instructions into Registers

  18. Adapting Compilation Techniques to Enhance the Packing of Instructions into Registers

  19. Tightly Packed Instruction Format • New opcodes for this T-format of MISA instructions • Supports sequential execution of up to 5 RISA instructions from the IRF • Unnecessary fields are padded with nop • Supports up to 2 parameters replacing instruction slots • Parameters can come from 32-entry IMM • Each IRF entry also retains a default immediate value as well • Branches use these 5 bits for displacements • R-type RISA instructions can use parameter to replace RD field Adapting Compilation Techniques to Enhance the Packing of Instructions into Registers

  20. MIPS Instruction Format Modifications • Creating Loosely Packed instructions • R-type: Removed shamt field and merged with rs • I-type: Shortened immediate values (16-bit  11bit) • Lui now uses 21-bit immediate values, hence no loose packing • J-type: Unchanged Adapting Compilation Techniques to Enhance the Packing of Instructions into Registers

  21. Adapting Compilation Techniques to Enhance the Packing of Instructions into Registers

  22. Introduction • Embedded Processor Design Constraints • Energy Consumption • Static Code Size • Execution Time • Fetch logic is responsible for 36% of total processor power on StrongARM • Two Primary Areas for Improvement in Instruction Fetch • Better fetch mechanism and storage • Instruction Cache and/or ROM – Lower power than main memory, but still a fairly large, flat storage method • Better instruction encodings • Instruction encodings are wasteful with bits • Maximize functionality, but simplify decoding (fixed length) • Most applications only apply a subset of available instructions • Nowhere near theoretical compression limits Adapting Compilation Techniques to Enhance the Packing of Instructions into Registers

  23. Instruction Redundancy • Profiled largest benchmark in each of six MiBench categories • Most frequent 32 instructions comprise 66.5% of total dynamic and 31% of total static instructions Adapting Compilation Techniques to Enhance the Packing of Instructions into Registers

  24. Access of Data & Instructions Main Memory L2 Cache L1 Data Cache L1 Instruction Cache Data Register File ????? • Each lower layer is designed to improve accessibility of current/frequent items, albeit at a reduction in number of available items. • Caching is beneficial, but compilers can do better for the “most frequently” accessed data items (e.g. Register Allocation). • Instructions have no analogue to the Data Register File (RF). Adapting Compilation Techniques to Enhance the Packing of Instructions into Registers

  25. Instruction Packing • Store frequently occurring instructions as specified by the compiler in a small, low-power Instruction Register File (IRF) • Allow multiple instruction fetches from the IRF by packing instruction references together • Facilitate parameterization of some instructions using an Immediate Table (IMM) Adapting Compilation Techniques to Enhance the Packing of Instructions into Registers

  26. Outline • Introduction • IRF Instruction Set Architecture • IRF Register Windowing • Compiler Optimizations • Experimental Framework • Results • Interaction with Other Techniques • Proposed Enhancements for Instruction Packing • Conclusions • Publication Plan Adapting Compilation Techniques to Enhance the Packing of Instructions into Registers

  27. IRF Instruction Set Architecture • MIPS ISA – commonly known and provides simple encoding • RISA (Register ISA) – instructions available via IRF access • MISA(Memory ISA) – instructions available in memory • New instruction formats that can reference multiple RISA instructions – Tightly Packed • Original instructions modified to pack an additional RISA reference – Loosely Packed Adapting Compilation Techniques to Enhance the Packing of Instructions into Registers

  28. insn3 insn4 insn2 insn3 insn2 insn4 imm3 imm3 IRF IMM Instruction Cache insn1 insn1 Execution of IRF Instructions Instruction Fetch Stage First Half of Instruction Decode Stage IF/ID PC packed instruction packed instruction To Instruction Decoder IRWP Executing a Tightly Packed Param4c Instruction Adapting Compilation Techniques to Enhance the Packing of Instructions into Registers

  29. Tightly Packed Instruction Format • New opcodes for this T-format of MISA instructions • Supports sequential execution of up to 5 RISA instructions from the IRF • Unnecessary fields are padded with nop • Supports up to 2 parameters replacing instruction slots • Parameters can come from 32-entry IMM • Each IRF entry also retains a default immediate value as well • Branches use these 5 bits for displacements • R-type RISA instructions can use parameter to replace RD field Adapting Compilation Techniques to Enhance the Packing of Instructions into Registers

  30. MIPS Instruction Format Modifications • Creating Loosely Packed instructions • R-type: Removed shamt field and merged with rs • I-type: Shortened immediate values (16-bit  11bit) • Lui now uses 21-bit immediate values, hence no loose packing • J-type: Unchanged Adapting Compilation Techniques to Enhance the Packing of Instructions into Registers

  31. Compilation Framework Adapting Compilation Techniques to Enhance the Packing of Instructions into Registers

  32. Packing Instructions • Greedy selection of instructions based on static or dynamic profile data • Examine a sliding window of instructions for each basic block, attempting to pack adjacent RISA instructions together • Denser packs are more favorable (e.g. tight5) • Branch offset distances can slip into range when we apply an iterative packing algorithm Adapting Compilation Techniques to Enhance the Packing of Instructions into Registers

  33. IRF Register Windowing • Simply increasing the number of IRF registers  reduced number of RISA instructions available in a single MISA instruction • Instead, keep format the same and provide multiple IRF windows to switch between on a per-function basis • Can dynamically switch IRF contents on calls/returns or provide actual physically separate register windows Adapting Compilation Techniques to Enhance the Packing of Instructions into Registers

  34. Software Windowing • Add new instructions to dynamically save/restore instruction registers • Calculate cost/benefit analysis of creating a new window/partition for a given function • Insert appropriate instructions for saving/restoring IRs between call/return sequences • Drawbacks • Requires the complete call graph for the application • Extra instructions for saving/restoring can add to execution time Adapting Compilation Techniques to Enhance the Packing of Instructions into Registers

  35. Hardware Windowing • IRF windows are shared amongst functions with similar instruction mix • Function addresses are modified to incorporate an instruction register window pointer (IRWP) • Call instruction – transfers to new function and switches the window by adjusting the IRWP; also saves the return address and current IRWP • Return Instruction – transfers control back to the caller and restores the previous IRWP • Inactive windows can be kept in a low-power state Adapting Compilation Techniques to Enhance the Packing of Instructions into Registers

  36. Compiler Optimizations • Improved Instruction Promotion • Adapt existing techniques to improve instruction packing • Instruction Selection • Register Re-assignment • Instruction Scheduling • Increase application redundancy • Eliminate constraints on packing Adapting Compilation Techniques to Enhance the Packing of Instructions into Registers

  37. Improving Instruction Promotion • Different Classes of instructions – can consume between one and five of the potentially available slots for a tight pack • Goal is to more accurately model the benefits of promoting from one class of instruction to another • Original IRF papers did not consider the promotion of multiple I-type instructions with different default immediate values • addi $3, $3, 4 and addi $3, $3, 1 would not both reside in the IRF, no matter how frequently they occurred • Can simultaneously weight static and dynamic profile data to obtain a mixed result that has both good code compression and reduced energy consumption • Can obtain most of the benefits of individual static/dynamic profiling Adapting Compilation Techniques to Enhance the Packing of Instructions into Registers

  38. Instruction Selection • Encode short jumps as branches (now parameterizable) • j label beq $0, $0, rel_label • Replace simple instructions with equivalent parameterizable forms • addu $2, $3, $0 addiu $2, $3, 0 • Ensure that commutative operations always have the same order of operands • addu $2, $4, $2 addu $2, $2, $4 Adapting Compilation Techniques to Enhance the Packing of Instructions into Registers

  39. Register Re-assignment • Performed after registers have been assigned in the second compilation with packing • Use a Register Interference Graph to represent register live ranges in the function • Re-assign live ranges to other registers if this improves packing density • Live ranges ordered by dynamic frequency • Modify register in conflicting live ranges if beneficial Adapting Compilation Techniques to Enhance the Packing of Instructions into Registers

  40. Instruction Scheduling • Intra-block – focus on reordering instructions so that dense packs are formed (both tight and loose) • Inter-block – attempt to move instructions between blocks to fill up packs ending with branches/jumps • Code duplication • Predication Adapting Compilation Techniques to Enhance the Packing of Instructions into Registers

  41. Experimental Framework • MiBench embedded benchmark suite – 6 categories representing common tasks for various domains • SimpleScalar MIPS/PISA architectural simulator • Out-of-order, single issue embedded machine with 8KB 4-way set associative L1 instruction and data caches and 128-entry bimodal branch predictor • Wattch/Cacti extensions for modeling energy consumption (inactive portions of pipeline only dissipate 10% of normal energy when using cc3 clock gating) • VPO – Very Portable Optimizer targeted for SimpleScalar MIPS/PISA Adapting Compilation Techniques to Enhance the Packing of Instructions into Registers

  42. Results – Processor Energy Adapting Compilation Techniques to Enhance the Packing of Instructions into Registers

  43. Results – Static Code Size Adapting Compilation Techniques to Enhance the Packing of Instructions into Registers

  44. Results – Execution Cycles Adapting Compilation Techniques to Enhance the Packing of Instructions into Registers

  45. Results – Mixed Profiling Adapting Compilation Techniques to Enhance the Packing of Instructions into Registers

  46. Interaction with Other Techniques • Loop Cache – small buffer that captures innermost loops dynamically • IRF can operate synergistically, increasing utilization of loop cache when innermost loops are larger than normal • Fetch Energy reduced by 56% with 8-entry loop cache and 4 window IRF • L0 (Filter) Cache – small instruction cache with low energy consumption, but can negatively impact execution time due to cache misses • IRF reduces working set size and can mask a portion of the cache miss penalty due to overlapped fetch • Fetch energy reduced by 70% with 256-byte L0 cache and 4 window IRF Adapting Compilation Techniques to Enhance the Packing of Instructions into Registers

  47. Proposed Enhancements for Instruction Packing • Splitting of MISA/RISA • Split Opcode/Operand Encoding in RISA • Improved Parameterization • Link-Time Instruction Packing • Instruction Promotion as a Side Effect Adapting Compilation Techniques to Enhance the Packing of Instructions into Registers

  48. Splitting of MISA/RISA • No need to arbitrarily limit RISA to same instructions as MISA • Tailor each ISA to particular design goals • MISA – reduced code size (small instructions) • RISA – improved performance & expressiveness (large instructions) • Baseline ISA choices • ARM/Thumb – 32/16 bit dual width ISA • FITS – opcodes mapped to 16 bit instructions • ARM/Thumb2 – 32/16 bit variable length ISA Adapting Compilation Techniques to Enhance the Packing of Instructions into Registers

  49. Split MISA Encodings Adapting Compilation Techniques to Enhance the Packing of Instructions into Registers

  50. Splitting Opcodes/Operands • Existing code compression schemes have benefited from separating the encoding of opcodes and operands • Can re-encode RISA to separate opcode and operand streams • Loosely packed instructions can use traditional or new split encodings • Tightly packed instructions could even continue supporting parameterization Adapting Compilation Techniques to Enhance the Packing of Instructions into Registers

More Related