highlights of the 36 th annual international symposium on microarchitecture december 2003 l.
Skip this Video
Loading SlideShow in 5 Seconds..
Highlights of the 36 th Annual International Symposium on Microarchitecture December 2003 PowerPoint Presentation
Download Presentation
Highlights of the 36 th Annual International Symposium on Microarchitecture December 2003

Loading in 2 Seconds...

play fullscreen
1 / 62

Highlights of the 36 th Annual International Symposium on Microarchitecture December 2003 - PowerPoint PPT Presentation

  • Uploaded on

Highlights of the 36 th Annual International Symposium on Microarchitecture December 2003. Theo Theocharides Embedded and Mobile Computing Center Department of Computer Science and Engineering The Pennsylvania State University Acknowledgements:

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'Highlights of the 36 th Annual International Symposium on Microarchitecture December 2003' - PamelaLan

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
highlights of the 36 th annual international symposium on microarchitecture december 2003

Highlights of the 36thAnnual International Symposium on MicroarchitectureDecember 2003

Theo Theocharides

Embedded and Mobile Computing Center

Department of Computer Science and Engineering

The Pennsylvania State University


K. Bernstein, T. Austin, D. Blaauw, L. Peh, D. Jimenez

  • The International Symposium on Microarchitecture is the premier forum for discussing new microarchitecture and software techniques
  • Processor architecture, compilers, and systems for technical interaction on traditional MICRO topics
    • special emphasis on optimizations to take advantage of application specific opportunities
    • microarchitecture and embedded architecture communities
  • http://www.microarch.org
  • http://www.microarch.org/micro36/
symposium outline
Symposium Outline
  • Session 1: Voltage Scaling & Transient
  • Session 2: Cache
  • Session 3: Power and Energy Efficient Architectures
  • Session 4: Application-Specific Optimization and Analysis
  • Session 5: Dynamic Optimization Systems
  • Session 6: Dynamic Program Analysis and Optimization
  • Session 7: Branch, Value, and Scheduling Optimization
  • Session 8: Dataflow, Data Parallel, and Clustered Architectures
  • Session 9: Secure and Network Processors
  • Session 10: Scaling Design

Keynote Speech

  • Caution Flag Out: Microarchitecture's Race for Power Performance
    • Kerry Bernstein, IBM T. J. Watson Research Center

Interesting Papers

  • Razor: A Low-Power Pipeline Based on Circuit-Level Timing Speculation, D. Ernst, et. al
  • Power-Driven Design of Router Microarchitectures in On-Chip Networks, H. Wang, Li-Shiuan Peh, S. Malik
  • Fast Path-Based Neural Branch Prediction, D. Jimenez
workshops and tutorials
Workshops and Tutorials
  • 5th Workshop on Media and Streaming Processors (MSP)
  • 3rd Workshop on Power-Aware Computer Systems (PACS)
  • 2nd Workshop on Application Specific Processors (WASP)
  • Tutorial: Challenges in Embedded Computing
  • Tutorial: Open Research Compiler (ORC): Proliferation of Technologies and Tools
  • Tutorial: Microarchitecture-Level Power-Performance Simulators: Modeling, Validation, and Impact on Design
  • Tutorial: Network Processors
  • Tutorial: Architectural Exploration with Liberty
keynote speech
Keynote Speech
  • Given by Kerry Bernstein, IBM T.J. Watson Research Center
  • Microarchitecture and technology relationship
  • We cannot continue to scale down to achieve higher frequencies without any catch
  • Increasing pipeline depth does not necessarily help
  • Power consumption, process variation, soft errors, die area erosion becoming more and more important
  • Keynote explored how past technologies have influenced high speed microarchitectures
  • Keynote showed how characteristics of proposed new devices and interconnects for lithographies beyond 90nm may shape future machine design.
  • Given the present issues and incoming trends, role of microarchitecture in extending CMOS performance will be more important than ever
issues in summary
Issues in summary:
  • Feature size
  • Device count (transistors per chip)
  • Pipeline depth
  • Power consumption increases non-linearly with scaling
  • Power growths when we reduce the FO4 delay
  • Delay and power affected by process variation
  • Cooling creates more problems
  • Cost of power diverges from performance gain
  • Monitor-based Full Chip Voltage, Clock Throttling
  • Voltage Islands
    • Technology aid required here
    • Latency required
    • Low-activity FET count increase
  • Clock Gating
    • So far has been a nice solution…
  • Pipeline depth optimization
  • Performance accelerators for ASICs (DSP, GPU’s, etc.)
    • As in, they need power anyways, at least make them efficient
    • Software solutions should be developed here
  • Compute-Informed Power Management
    • Instruction Stream
    • Dynamic Resource Assertion
    • Power Aware OS
    • Thermal Modeling
new ideas
New Ideas
  • “Evolutionary”
    • Strained Silicon
    • High-K Gate Dielectrics
    • Hybrid Crystal Silicon
      • Increase current drive/micron of device
      • Allow transistor density improvement
      • Introduce Features which enable active static power management
  • “Revolutionary”
    • Double Gated MOSFETs
    • 3D Integration
    • Molecular Computing
      • Reduce Power Density without architectural management
      • Eliminate power dependence on frequency
      • Return the industry to threshold and supply voltage scaling
keynote conclusions
Keynote Conclusions
  • New technologies will likelyhelp, not necessarily
  • Power is by far the predominant factor in scaling – we need to see what new technologies can give us
  • Staying ahead requires power-aware systems
razor project t austin d blaauw t mudge
Razor Project (T. Austin, D. Blaauw, T. Mudge)
  • We (designers/architects) have been scaling the voltage down but up to a point where it was proven that under all possible worst cases, there were no errors
  • Very conservative voltage scaling
  • IDEA!
  • Instead of trying to avoid ALL errors, ALLOW some errors to happen and correct them!
  • Major argument: Scaling the voltage supply by almost 0.25V down, gives an average error rate of less than 5%
  • Instead of spending energy, logic, effort, time and so many other useful factors into avoiding error, allow a very small error percentage to happen, and gain huge power savings
  • Cost of fixing errors is minimal when the error percentage is kept under control
razor advantages
Razor Advantages
  • Eliminate safety margins
    • Process variation, IR-drop, temperature fluctuation, data-dependent latencies, model uncertainty
  • Operate at sub-critical voltage for optimal trade-off between:
    • Energy gain from voltage scaling
    • Energy overhead from dynamic error correction
  • Tune voltage for average instruction data
    • Exploit delay dependence in data
  • Tolerate delay degradation due to infrequent noise events
    • SER, capacitive, inductive noise, charge sharing, floating body effect…
    • Most severe noise also least frequent
Power-driven Design of Router Microarchitectures in On-chip Networks (Hangsheng Wang, Li-Shiuan Peh, Sharad Malik)
  • Investigates on-chip network microarchitectures from a power-driven perspective
  • Power-efficient network microarchitectures:
    • segmented crossbar, cut-through crossbar and write-through buffer
  • Studies and uncovers the power saving potential of an existing network architecture: Express cube
  • Reduction in network power of up to 44.9%,
  • NO degradation in network performance
  • Improved latency throughput in some cases.
power in noc
Power in NoC
  • Ewrtis the average energy dissipated when writing a flit into the input buffer
  • Erdis the average energy dissipated when reading a flit from the input buffer
  • Ebuf = Ewrt + Erdis average buffer energy
  • Earbis average arbitration energy
  • Exbis average crossbar traversal energy
  • Elnkis average link traversal energy
  • His the number of hops traversed by this flit
architectural methods
Architectural Methods
  • Segmented crossbar
  • Cut-through crossbar
  • Write-through input buffer
  • Express cube
segmented crossbar
Segmented Crossbar

Schematic of a matrix crossbar and a segmented crossbar. F is flit size in bits, dw is track width, E, W, N, S are ports.

cut through crossbar
Cut-through crossbar

Schematic of cut-through crossbars

F is flit size, dw is track width, E, W, N, S are ports

write through buffer
Write-through buffer
  • Bypassing without overlapping
  • Bypassing with overlapping
  • Schematic of a write-through input buffer.
power savings and conclusions
Power savings and conclusions
  • Importance of a power-driven approach to on-chip network design
  • Need to investigate the interactions between traffic patterns and On Chip Network architectures
  • Need to reach a systematic design methodology for on-chip networks
fast path based neural branch prediction j himenez
Fast Path-Based Neural Branch Prediction(J. Himenez)
  • Paper presented a new neural branch predictor
    • both more accurate and much faster than previous neural predictors
  • Accuracy far superior to conventional predictors
  • Latency comparable to predictors from industrial designs
  • Improves the instructions-per-cycle (IPC) rate of an aggressively clocked microarchitecture by 16%
latency accuracy gain
Latency - Accuracy Gain

Rather than being done all at once (above), computation is staggered (below)

  • Train a neural network with path history, and update it dynamically.
  • Choose the weight vectors according to the path leading up to the branch rather than branch address alone
  • Directly reduces latency (can begin prior to the prediction – see figure on the left)
  • Improves accuracy as the predictor incorporates path information
ipc per hardware cost
IPC per hardware cost
  • Faster and more accurate than existing neural branch predictors
  • Overview of MICRO36
  • Conference lasted 5 days – impossible to review in half hour!
  • If you are interested, you should read the proceedings on-line at


The Call For Papers for MICRO37 is available, at



links to the papers reviewed
Links to the papers reviewed
  • Razor
    • http://www.microarch.org/micro36/html/pdf/ernst-Razor.pdf
  • NoC Router Power-Driven Design
    • http://www.microarch.org/micro36/html/pdf/wang-PowerDrivenDesign.pdf
  • Fast-Path Neural Branch Predictor
    • http://www.microarch.org/micro36/html/pdf/jimenez-FastPath.pdf


talk overview
Talk Overview
  • Session covered in this presentation
  • Review papers
    • Architectural vulnerability factors
      • Introduction
      • Proposed technique
      • Soft error terminology
      • Computing AVF’s
      • Results
      • Conclusion
    • L2-Miss Drive Variable Supply voltage scaling
      • Introduction
      • Proposed Solution
      • Transitions
      • Results
      • Achievements
session covered
Session Covered
  • Voltage Scaling & Transient Faults
    • Methodology to compute Artificial vulnerability factors
    • VSV: L2-Miss-Driven Variable Supply-Voltage Scaling for Low Power
architectural vulnerability factors s s mukherjee c t weaver j emer s k reinhardt t austin
Architectural Vulnerability Factors(S. S. Mukherjee, C. T. Weaver, J. Emer, S. K. Reinhardt, T. Austin)
  • Single-event upsets from particle strikes have become a key challenge in microprocessor design.
  • Soft errors due to cosmic rays making an impact in industry.
    • In 2000, Sun Microsystems acknowledged cosmic ray strikes on unprotected cache memories as the cause of random crashes at major customer sites in its flagship Enterprise server line
    • The fear of cosmic ray strikes prompted Fujitsu to protect 80% of its 200,000 latches in its recent SPARC processor with some form of error detection
  • require accurate estimates of processor error rates to make appropriate cost/reliability trade-offs.
  • All existing approaches introduce a significant penalty in performance, power, die size, and design time
  • Tools and techniques to estimate processor transient error rates are not readily available or fully understood.
  • Estimates are needed early in the design cycle.
  • In this Paper :
    • Define architectural vulnerability factor (AVF)
    • identify numerous cases, such as pre-fetches, dynamically dead code, and wrong-path instructions, in which a fault will not affect correct execution
proposed technique
Proposed technique
  • Not all faults in a micro-architectural structure affect the final outcome of a program.
  • Architectural Vulnerability factor (AVF)
    • probability that a fault in that particular structure will result in an error in the final output of the program
  • The overall error rate = product of raw fault rate and AVF.
  • Can examine the relative contributions of various structures
    • identify cost-effective areas to employ fault protection techniques
  • Tracks the subset of processor state bits required for architecturally correct execution (ACE)
    • fault in a storage cell containing one of these bits affects output
  • For example, a branch predictor’s AVF is 0%
    • predictor bits are always un-ACE bits.
  • Bits in the committed PC are always ACE bits, has an AVF of 100%
soft error terminology
Soft error terminology
  • Error budget expressed in terms of:
    • Mean Time Between Failures (MTBF).
    • Failures In Time (FIT) - inversely related to MTBF.
  • Errors are often classified as:
    • Undetected - silent data corruption (SDC)
    • Detected - detected unrecoverable errors (DUE)
  • Effective FIT rate for a structure is the product of its raw circuit FIT rate and the structure’s vulnerability factor
  • effective FIT rate per bit is influenced by several vulnerability factors
    • also known as de-rating factors or soft error sensitivity factor
  • Examples include timing vulnerability factor for latches and AVF
identifying un ace bits
Identifying Un-ACE Bits
  • Bits that do not affect final program output
  • Analyzed a uniprocessor system
  • Micro-architectural Un-ACE bits
    • Idle or Invalid State.
    • Miss-speculated State.
    • Predictor Structures.
    • Ex-ACE State.
  • Architectural Un-ACE Bits
    • NOP instructions.
    • Performance-enhancing instructions.
    • Predicated-false instructions.
    • Dynamically dead instructions.
    • Logical masking.
computing avf
Computing AVF
  • AVFs for storage cells - fraction of time an upset in that cell will cause a visible error in the final output of a program
  • AVF Equations for a Hardware Structure
    • average AVF for all its bits in that structure
    • ∑ residency (in cycles) of all ACE bits in a structure -------------------------------------------------------------------------------- total number of bits in the hardware structure × total execution cycles
  • Little’s Law:
    • N = B×L, where
      • N = average number of bits in a box,
      • B = average bandwidth per cycle into the box, and
      • L = average latency of an individual bit through the box.
    • Bace × Lace

AVF = --------------------------------------------------------------

total number of bits in the hardware structure

computing avfs using a performance model
Computing AVFs using a Performance Model
  • Two structures—the instruction queue and execution units—using the Asim performance model framework
  • Need following information
    • Sum of all residence cycles of all ACE bits of the objects resident in the structure during the execution of the program,
    • Total execution cycles for which we observe the ACE bits’ residence time, and
    • Total number of bits in a hardware structure.
  • AVF algorithm
    • Record the residence time of the instruction in the structure as an instruction flows through different structures in the pipeline
    • Update the structures the instruction flowed through
    • Put the instruction in a post-commit analysis window to
      • Determine if the instruction is dynamically dead or
      • Determine if there are any bits that are logically masked
methodology for evaluation
Methodology for evaluation
  • Use an Itanium2®-like IA64 processor [14] scaled to current technology
  • Modeled in detail in Asim performance model framework.
  • Program-level Decomposition
    • We get about 45% ACE instructions. The rest—55% of the instructions—are un-ACE instructions
    • Some of these un- ACE instructions still contain ACE bits, such as the op-code bits of pre-fetch instructions
    • UNKNOWN and NOT_PROCESSED instructions account for about 1% of the total instructions
    • NOPs, predicated false instructions, and prefetch instructions account for 26%, 6.7%, and 1.5%, respectively.
    • FDD_reg and FDD_mem denote results that are written back to registers and memory, respectively
      • Account for about 9.4% and 2% of the dynamic instructions
      • IA64 has a large number of registers
    • TDD_reg and TDD_mem account for 6.6% and 1.6% of the dynamic instructions
avf for instruction queue52
AVF for instruction queue
  • Shows what percentage of cycles a storage cell in the instruction queue contains ACE and un-ACE bits.
  • Instruction queue contains an ACE bit about 28% of the time.
    • Thus AVF of the instruction queue is 28%.
  • Floating point programs, in general, have higher AVFs compared to integer programs (31% vs. 25%, respectively)
    • Long-latency instructions and few branch mispredictions
    • Use the instruction queue more effectively than integer programs, leading to a higher AVF
  • Apply Little’s law :
    • Number of ACE instructions in the queue =

bandwidth or ACE IPC X the average number of cycles an instruction can be considered to be in ACE state or ACE latency

  • The ACE IPC and ACE latency from our performance mode
avfs for the execution units54
AVFs for the Execution Units
  • Four integer pipes and two floating point pipes
    • 50% control latches and 50% datapath latches
  • 11% of the cycles processing ACE instructions
    • Significantly lower
      • Instructions must wait in the instruction queue
      • Speculatively issued instructions succeeding cache-miss loads must replay through the instruction queue
      • The floating point pipes are mostly idle while executing integer code
    • Implemented logical masking functions for a small but important subset
  • Estimated AVFs using a novel approach that tracks bits required for architecturally correct execution (ACE) and un-ACE bits
  • Computed the AVF for the instruction queue and execution units of an Itanium2®-like IA64 processor.
  • Further refinement could further lower the AVF estimates but expect the contribution from further refinement to be small
  • Can estimate the FIT rate of an entire processor early in the design cycle
  • Can help designers choose the appropriate error detection or correction schemes
  • Can lower the FIT rate of the chip iteratively by adding more and more error protection, using AVF estimates as a guide.
l2 miss driven vsv for low power h li c cher t n vijaykumar k roy
L2-Miss Driven VSV for low power (H. Li, C. Cher, T. N. Vijaykumar, K. Roy)
  • Idea: Upon a L2 miss, pipeline performs independent computations, but almost always ends up stalled, waiting for data despite out-or-order issue and other latency-hiding techniques
  • During an L2 miss, scale down the supply, carry out independent computations at lower speed instead
  • Performance degradations if there are sufficient independent computations however, which will overlap with the delay of the cache
  • Returning to full speed however, will likely reduce power savings if there are multiple misses and insufficient independent computations to overlap with the misses
proposed solution
Proposed solution
  • Two state machines tracking parallelism on the fly
  • Scale down voltage depending on parallelism of the two events
  • Factors considered
    • Circuit level complexities reducing VSV to two voltages
    • Stability
    • Signal propagation speed issues
    • Energy overhead issues in RAMs and Register files
  • Average reduction of processor power is 7% while performance degradation is 0.9%

High-To-Low transition

Low-To-High transition

vsv achievements
VSV - Achievements
  • Power savings with minimal performance degradation
  • Complexity of circuits taken into consideration
  • FSM’s control the level of parallelism between independent operations and delay caused by an L2 cache miss
  • VSV achieves 4% reduction in power for all SPEC2K benchmarks
  • VSV achieves 12% for the benchmarks with high L2 miss rates
  • Any questions or feedback ??