energy efficient instruction cache for wide issue processors l.
Skip this Video
Download Presentation
Energy Efficient Instruction Cache for Wide-issue Processors

Loading in 2 Seconds...

play fullscreen
1 / 19

Energy Efficient Instruction Cache for Wide-issue Processors - PowerPoint PPT Presentation

  • Uploaded on

Energy Efficient Instruction Cache for Wide-issue Processors . Alex Veidenbaum Information and Computer Science University of California, Irvine. Motivation. Power dissipation is a serious problem It is important for both high performance, e.g. COMPAQ Alpha, MIPS R10K, Intel x86,

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'Energy Efficient Instruction Cache for Wide-issue Processors' - lore

Download Now An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
energy efficient instruction cache for wide issue processors

Energy Efficient Instruction Cache for Wide-issue Processors

Alex Veidenbaum

Information and Computer Science

University of California, Irvine

  • Power dissipation is a serious problem
  • It is important for both high performance,
    • e.g. COMPAQ Alpha, MIPS R10K, Intel x86,
  • As well as embedded processors
    • E.g. ARM, Strongarm, etc…
  • Our current research is focused on reducing the average energy consumption of high-performance embedded processors
      • MIPS R20K, IBM/Motorola PowerPC, Mobile Pentium,..
We want to address the energy consumption via
    • architecture
    • compiler
  • Technology is, to the first order, an orthogonal parameter and is NOT considered
  • So the first question is
    • What are the major sources of energy dissipation?
    • Hard to find this info
experimental setup
Experimental Setup
  • We started with Wattch (Princeton)
    • architectural-level power simulator
    • based on SimpleScalar (sim-outorder) simulator
    • can specify technology, clock and voltage
    • computes power for major internal units of processor
      • parameterizable models for most of the components
      • mostly memory-based units…
      • ALU power is a constant (based on input from industry)
  • Modified it to match our needs
Need to account for energy correctly:
    • “Worst case”, F(V,C,f) =~C*V^2*f
      • Every unit on every cycle…
        • No good
    • Depending on program behavior
        • the right way
  • Basic organization:
Used a typical wide-issue processor, assuming
      • 600 MHz, 32-bit
      • 32K L1 instruction cache, 32K L1 data cache
      • 512K L2 unified cache
      • 2 int ALUs, 1 FP adder, 1 FP multiplier
      • 3.3V
  • MIPS R10K like, mods from default SS
  • Major units to look at:
    • Instruction,data cache, ALU, branch predictor, RF,…
some typical results
Some typical results
  • Power distribution among major internal units
motivation cont d
Motivation cont’d
  • Now we can attack specific important sources
  • Instruction cache is one such unit
  • Reason:
    • Every cycle 4 32b instruction words need to be fetched
  • Next we discuss a hardware mechanism for reducing instruction cache energy consumption
previous work
Previous Work
  • Hasegawa 1995 - phased cache
      • Examine the tag and data fields in two separate phases
      • Reduce power consumption by 70%
      • Increase the average cache-access time by 100%
  • Inoue 1999 - set-associative, way-prediction cache
      • Speculatively selects one way before starting a normal access
      • On a way prediction hit, power is reduced by a factor of 4
      • Increase the cache-access time on mispredictions
  • Lee 1999 - loop cache
      • Shut down main cache completely while executing tight program loops from the loop cache
      • Power savings vary with the application
      • No performance degradation
our approach
Our approach
  • Not all of the fetched instructions in a line are used
    • When a branch is taken –
      • the words after the branch till line end
    • When there is a target in a line
      • from the beginning of a line till the target
  • Save energy by fetching only useful instructions
    • Design a hardware mechanism (fetch predictor) that predicts which instructions are going to be used out of a cache line before that line is fetched
    • Selectively fetch only predicted useful instructions in each fetch cycle
  • Need a cache with an ability to fetch any consecutive sequence of instructions from a line
  • This has been implemented before
      • Su 1995 - divide the cache into subbanks, activated individually by a control vector
      • RS/6000 - cache organized as 4 separate arrays, each of which could use a different row address
  • Generate a control vector w/ a bit for each “bank”
fetch predictor
Fetch Predictor
  • General idea:
    • Rely on branch predictor to get PC of next instruction
    • Build a fetch predictor on top of branch predictor to decide which instructions to fetch
    • Use branch misprediction detection mechanism and branch predictor update phase to update the fetch predictor
some specifics
Some specifics
  • Predict for the next line to fetched
  • For a target in the next line - use address in BTB
    • Fetch from target on
  • For a branch in next line need a separate predictor
    • Before the line is fetched
    • Add update when branch predictor is updated
    • Initialize to fetch all words
  • Need to take care of a case when both branch and target are in the same line
    • AND control bit vectors
experimental setup14
Experimental Setup
  • SimpleScalar extended with a fetch predictor
  • A simple power model:
    • Energy per cycle is proportional to the number of fetched instructions
  • Simulated a subset of Spec95
      • 3 billion instructions executed in each
      • Direct mapped cache, with 4 and 8 instructions per line
      • Bimodal branch predictor
summary of results
Summary of Results
  • Average power savings
    • Perfect predictor
      • 33%, between 8% and 55% for a 8-instr cache line
    • Fetch predictor
      • 25%, between 5% and 41% for a 8-instr cache line
  • Larger power savings for integer benchmarks than for the floating point ones
  • Contribution to power-aware hardware
    • a fetch predictor for fetching only useful instructions
  • Preliminary results
    • 5% of the total power, for an 8-instruction cache line (assumes I-cache consumes 20% of the total)
  • Advantage: No performance penalty!