Loading in 5 sec....

Highlights of the 36 th Annual International Symposium on Microarchitecture December 2003PowerPoint Presentation

Highlights of the 36 th Annual International Symposium on Microarchitecture December 2003

- 247 Views
- Uploaded on

Download Presentation
## PowerPoint Slideshow about 'Highlights of the 36th Annual International Symposium on ...' - PamelaLan

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

### Highlights of the 36thAnnual International Symposium on MicroarchitectureDecember 2003

### 36 Networks (th Annual International Symposium on Micro-Architecture - A Review

Theo Theocharides

Embedded and Mobile Computing Center

Department of Computer Science and Engineering

The Pennsylvania State University

Acknowledgements:

K. Bernstein, T. Austin, D. Blaauw, L. Peh, D. Jimenez

Introduction

- The International Symposium on Microarchitecture is the premier forum for discussing new microarchitecture and software techniques
- Processor architecture, compilers, and systems for technical interaction on traditional MICRO topics
- special emphasis on optimizations to take advantage of application specific opportunities
- microarchitecture and embedded architecture communities

- http://www.microarch.org
- http://www.microarch.org/micro36/

Symposium Outline

- Session 1: Voltage Scaling & Transient
- Session 2: Cache
- Session 3: Power and Energy Efficient Architectures
- Session 4: Application-Specific Optimization and Analysis
- Session 5: Dynamic Optimization Systems
- Session 6: Dynamic Program Analysis and Optimization
- Session 7: Branch, Value, and Scheduling Optimization
- Session 8: Dataflow, Data Parallel, and Clustered Architectures
- Session 9: Secure and Network Processors
- Session 10: Scaling Design

Highlights

Keynote Speech

- Caution Flag Out: Microarchitecture's Race for Power Performance
- Kerry Bernstein, IBM T. J. Watson Research Center
Interesting Papers

- Kerry Bernstein, IBM T. J. Watson Research Center
- Razor: A Low-Power Pipeline Based on Circuit-Level Timing Speculation, D. Ernst, et. al
- Power-Driven Design of Router Microarchitectures in On-Chip Networks, H. Wang, Li-Shiuan Peh, S. Malik
- Fast Path-Based Neural Branch Prediction, D. Jimenez

Workshops and Tutorials

- 5th Workshop on Media and Streaming Processors (MSP)
- 3rd Workshop on Power-Aware Computer Systems (PACS)
- 2nd Workshop on Application Specific Processors (WASP)
- Tutorial: Challenges in Embedded Computing
- Tutorial: Open Research Compiler (ORC): Proliferation of Technologies and Tools
- Tutorial: Microarchitecture-Level Power-Performance Simulators: Modeling, Validation, and Impact on Design
- Tutorial: Network Processors
- Tutorial: Architectural Exploration with Liberty

Keynote Speech

- Given by Kerry Bernstein, IBM T.J. Watson Research Center
- Microarchitecture and technology relationship
- We cannot continue to scale down to achieve higher frequencies without any catch
- Increasing pipeline depth does not necessarily help
- Power consumption, process variation, soft errors, die area erosion becoming more and more important
- Keynote explored how past technologies have influenced high speed microarchitectures
- Keynote showed how characteristics of proposed new devices and interconnects for lithographies beyond 90nm may shape future machine design.
- Given the present issues and incoming trends, role of microarchitecture in extending CMOS performance will be more important than ever

Issues in summary:

- Feature size
- Device count (transistors per chip)
- Pipeline depth
- Power consumption increases non-linearly with scaling
- Power growths when we reduce the FO4 delay
- Delay and power affected by process variation
- Cooling creates more problems
- Cost of power diverges from performance gain

Repairs

- Monitor-based Full Chip Voltage, Clock Throttling
- Voltage Islands
- Technology aid required here
- Latency required
- Low-activity FET count increase

- Clock Gating
- So far has been a nice solution…

- Pipeline depth optimization
- Performance accelerators for ASICs (DSP, GPU’s, etc.)
- As in, they need power anyways, at least make them efficient
- Software solutions should be developed here

- Compute-Informed Power Management
- Instruction Stream
- Dynamic Resource Assertion
- Power Aware OS
- Thermal Modeling

New Ideas

- “Evolutionary”
- Strained Silicon
- High-K Gate Dielectrics
- Hybrid Crystal Silicon
- Increase current drive/micron of device
- Allow transistor density improvement
- Introduce Features which enable active static power management

- “Revolutionary”
- Double Gated MOSFETs
- 3D Integration
- Molecular Computing
- Reduce Power Density without architectural management
- Eliminate power dependence on frequency
- Return the industry to threshold and supply voltage scaling

Keynote Conclusions

- New technologies will likelyhelp, not necessarily
- Power is by far the predominant factor in scaling – we need to see what new technologies can give us
- Staying ahead requires power-aware systems

Razor Project (T. Austin, D. Blaauw, T. Mudge)

- We (designers/architects) have been scaling the voltage down but up to a point where it was proven that under all possible worst cases, there were no errors
- Very conservative voltage scaling
- IDEA!
- Instead of trying to avoid ALL errors, ALLOW some errors to happen and correct them!
- Major argument: Scaling the voltage supply by almost 0.25V down, gives an average error rate of less than 5%
- Instead of spending energy, logic, effort, time and so many other useful factors into avoiding error, allow a very small error percentage to happen, and gain huge power savings
- Cost of fixing errors is minimal when the error percentage is kept under control

Razor Advantages

- Eliminate safety margins
- Process variation, IR-drop, temperature fluctuation, data-dependent latencies, model uncertainty

- Operate at sub-critical voltage for optimal trade-off between:
- Energy gain from voltage scaling
- Energy overhead from dynamic error correction

- Tune voltage for average instruction data
- Exploit delay dependence in data

- Tolerate delay degradation due to infrequent noise events
- SER, capacitive, inductive noise, charge sharing, floating body effect…
- Most severe noise also least frequent

Power-driven Design of Router Microarchitectures in On-chip Networks (Hangsheng Wang, Li-Shiuan Peh, Sharad Malik)

- Investigates on-chip network microarchitectures from a power-driven perspective
- Power-efficient network microarchitectures:
- segmented crossbar, cut-through crossbar and write-through buffer

- Studies and uncovers the power saving potential of an existing network architecture: Express cube
- Reduction in network power of up to 44.9%,
- NO degradation in network performance
- Improved latency throughput in some cases.

Power in NoC Networks (

- Ewrtis the average energy dissipated when writing a flit into the input buffer
- Erdis the average energy dissipated when reading a flit from the input buffer
- Ebuf = Ewrt + Erdis average buffer energy
- Earbis average arbitration energy
- Exbis average crossbar traversal energy
- Elnkis average link traversal energy
- His the number of hops traversed by this flit

Architectural Methods Networks (

- Segmented crossbar
- Cut-through crossbar
- Write-through input buffer
- Express cube

Segmented Crossbar Networks (

Schematic of a matrix crossbar and a segmented crossbar. F is flit size in bits, dw is track width, E, W, N, S are ports.

Cut-through crossbar Networks (

Schematic of cut-through crossbars

F is flit size, dw is track width, E, W, N, S are ports

Write-through buffer Networks (

- Bypassing without overlapping
- Bypassing with overlapping
- Schematic of a write-through input buffer.

Express cube topology and microarchitecture Networks (

Power savings and conclusions Networks (

- Importance of a power-driven approach to on-chip network design
- Need to investigate the interactions between traffic patterns and On Chip Network architectures
- Need to reach a systematic design methodology for on-chip networks

Fast Path-Based Neural Branch Prediction Networks ((J. Himenez)

- Paper presented a new neural branch predictor
- both more accurate and much faster than previous neural predictors

- Accuracy far superior to conventional predictors
- Latency comparable to predictors from industrial designs
- Improves the instructions-per-cycle (IPC) rate of an aggressively clocked microarchitecture by 16%

Latency - Accuracy Gain Networks (

Rather than being done all at once (above), computation is staggered (below)

- Train a neural network with path history, and update it dynamically.
- Choose the weight vectors according to the path leading up to the branch rather than branch address alone
- Directly reduces latency (can begin prior to the prediction – see figure on the left)
- Improves accuracy as the predictor incorporates path information

Comparative Results – Misprediction rate Networks (

IPC per hardware cost Networks (

- Faster and more accurate than existing neural branch predictors

Conclusion Networks (

- Overview of MICRO36
- Conference lasted 5 days – impossible to review in half hour!
- If you are interested, you should read the proceedings on-line at
http://www.microarch.org/micro36

The Call For Papers for MICRO37 is available, at

http://www.microarch.org/micro37

DEADLINE FOR PAPER SUBMISSION: May 28th, 2004

Links to the papers reviewed Networks (

- Razor
- http://www.microarch.org/micro36/html/pdf/ernst-Razor.pdf

- NoC Router Power-Driven Design
- http://www.microarch.org/micro36/html/pdf/wang-PowerDrivenDesign.pdf

- Fast-Path Neural Branch Predictor
- http://www.microarch.org/micro36/html/pdf/jimenez-FastPath.pdf

Questions? Networks (

THANK YOU !

Rajaraman Ramanarayanan

Talk Overview Networks (

- Session covered in this presentation
- Review papers
- Architectural vulnerability factors
- Introduction
- Proposed technique
- Soft error terminology
- Computing AVF’s
- Results
- Conclusion

- L2-Miss Drive Variable Supply voltage scaling
- Introduction
- Proposed Solution
- Transitions
- Results
- Achievements

- Architectural vulnerability factors

Session Covered Networks (

- Voltage Scaling & Transient Faults
- Methodology to compute Artificial vulnerability factors
- VSV: L2-Miss-Driven Variable Supply-Voltage Scaling for Low Power

Architectural Vulnerability Factors Networks ((S. S. Mukherjee, C. T. Weaver, J. Emer, S. K. Reinhardt, T. Austin)

- Single-event upsets from particle strikes have become a key challenge in microprocessor design.
- Soft errors due to cosmic rays making an impact in industry.
- In 2000, Sun Microsystems acknowledged cosmic ray strikes on unprotected cache memories as the cause of random crashes at major customer sites in its flagship Enterprise server line
- The fear of cosmic ray strikes prompted Fujitsu to protect 80% of its 200,000 latches in its recent SPARC processor with some form of error detection

- require accurate estimates of processor error rates to make appropriate cost/reliability trade-offs.

Introduction Networks (

- All existing approaches introduce a significant penalty in performance, power, die size, and design time
- Tools and techniques to estimate processor transient error rates are not readily available or fully understood.
- Estimates are needed early in the design cycle.
- In this Paper :
- Define architectural vulnerability factor (AVF)
- identify numerous cases, such as pre-fetches, dynamically dead code, and wrong-path instructions, in which a fault will not affect correct execution

Proposed technique Networks (

- Not all faults in a micro-architectural structure affect the final outcome of a program.
- Architectural Vulnerability factor (AVF)
- probability that a fault in that particular structure will result in an error in the final output of the program

- The overall error rate = product of raw fault rate and AVF.
- Can examine the relative contributions of various structures
- identify cost-effective areas to employ fault protection techniques

- Tracks the subset of processor state bits required for architecturally correct execution (ACE)
- fault in a storage cell containing one of these bits affects output

- For example, a branch predictor’s AVF is 0%
- predictor bits are always un-ACE bits.

- Bits in the committed PC are always ACE bits, has an AVF of 100%

Soft error terminology Networks (

- Error budget expressed in terms of:
- Mean Time Between Failures (MTBF).
- Failures In Time (FIT) - inversely related to MTBF.

- Errors are often classified as:
- Undetected - silent data corruption (SDC)
- Detected - detected unrecoverable errors (DUE)

- Effective FIT rate for a structure is the product of its raw circuit FIT rate and the structure’s vulnerability factor
- effective FIT rate per bit is influenced by several vulnerability factors
- also known as de-rating factors or soft error sensitivity factor

- Examples include timing vulnerability factor for latches and AVF

Silent data corruption in the future Networks (

Identifying Un-ACE Bits Networks (

- Bits that do not affect final program output
- Analyzed a uniprocessor system
- Micro-architectural Un-ACE bits
- Idle or Invalid State.
- Miss-speculated State.
- Predictor Structures.
- Ex-ACE State.

- Architectural Un-ACE Bits
- NOP instructions.
- Performance-enhancing instructions.
- Predicated-false instructions.
- Dynamically dead instructions.
- Logical masking.

Computing AVF Networks (

- AVFs for storage cells - fraction of time an upset in that cell will cause a visible error in the final output of a program
- AVF Equations for a Hardware Structure
- average AVF for all its bits in that structure
- ∑ residency (in cycles) of all ACE bits in a structure -------------------------------------------------------------------------------- total number of bits in the hardware structure × total execution cycles

- Little’s Law:
- N = B×L, where
- N = average number of bits in a box,
- B = average bandwidth per cycle into the box, and
- L = average latency of an individual bit through the box.

- Bace × Lace
AVF = --------------------------------------------------------------

total number of bits in the hardware structure

- N = B×L, where

Computing AVFs using a Performance Model Networks (

- Two structures—the instruction queue and execution units—using the Asim performance model framework
- Need following information
- Sum of all residence cycles of all ACE bits of the objects resident in the structure during the execution of the program,
- Total execution cycles for which we observe the ACE bits’ residence time, and
- Total number of bits in a hardware structure.

- AVF algorithm
- Record the residence time of the instruction in the structure as an instruction flows through different structures in the pipeline
- Update the structures the instruction flowed through
- Put the instruction in a post-commit analysis window to
- Determine if the instruction is dynamically dead or
- Determine if there are any bits that are logically masked

Methodology for evaluation Networks (

- Use an Itanium2®-like IA64 processor [14] scaled to current technology
- Modeled in detail in Asim performance model framework.

Results – Program level Decomposition Networks (

Results Networks (

- Program-level Decomposition
- We get about 45% ACE instructions. The rest—55% of the instructions—are un-ACE instructions
- Some of these un- ACE instructions still contain ACE bits, such as the op-code bits of pre-fetch instructions
- UNKNOWN and NOT_PROCESSED instructions account for about 1% of the total instructions
- NOPs, predicated false instructions, and prefetch instructions account for 26%, 6.7%, and 1.5%, respectively.
- FDD_reg and FDD_mem denote results that are written back to registers and memory, respectively
- Account for about 9.4% and 2% of the dynamic instructions
- IA64 has a large number of registers

- TDD_reg and TDD_mem account for 6.6% and 1.6% of the dynamic instructions

AVF for instruction queue Networks (

AVF for instruction queue Networks (

- Shows what percentage of cycles a storage cell in the instruction queue contains ACE and un-ACE bits.
- Instruction queue contains an ACE bit about 28% of the time.
- Thus AVF of the instruction queue is 28%.

- Floating point programs, in general, have higher AVFs compared to integer programs (31% vs. 25%, respectively)
- Long-latency instructions and few branch mispredictions
- Use the instruction queue more effectively than integer programs, leading to a higher AVF

- Apply Little’s law :
- Number of ACE instructions in the queue =
bandwidth or ACE IPC X the average number of cycles an instruction can be considered to be in ACE state or ACE latency

- Number of ACE instructions in the queue =
- The ACE IPC and ACE latency from our performance mode

AVFs for the Execution Units Networks (

AVFs for the Execution Units Networks (

- Four integer pipes and two floating point pipes
- 50% control latches and 50% datapath latches

- 11% of the cycles processing ACE instructions
- Significantly lower
- Instructions must wait in the instruction queue
- Speculatively issued instructions succeeding cache-miss loads must replay through the instruction queue
- The floating point pipes are mostly idle while executing integer code

- Implemented logical masking functions for a small but important subset

- Significantly lower

Conclusion Networks (

- Estimated AVFs using a novel approach that tracks bits required for architecturally correct execution (ACE) and un-ACE bits
- Computed the AVF for the instruction queue and execution units of an Itanium2®-like IA64 processor.
- Further refinement could further lower the AVF estimates but expect the contribution from further refinement to be small
- Can estimate the FIT rate of an entire processor early in the design cycle
- Can help designers choose the appropriate error detection or correction schemes
- Can lower the FIT rate of the chip iteratively by adding more and more error protection, using AVF estimates as a guide.

L2-Miss Driven VSV for low power Networks ((H. Li, C. Cher, T. N. Vijaykumar, K. Roy)

- Idea: Upon a L2 miss, pipeline performs independent computations, but almost always ends up stalled, waiting for data despite out-or-order issue and other latency-hiding techniques
- During an L2 miss, scale down the supply, carry out independent computations at lower speed instead
- Performance degradations if there are sufficient independent computations however, which will overlap with the delay of the cache
- Returning to full speed however, will likely reduce power savings if there are multiple misses and insufficient independent computations to overlap with the misses

Proposed solution Networks (

- Two state machines tracking parallelism on the fly
- Scale down voltage depending on parallelism of the two events
- Factors considered
- Circuit level complexities reducing VSV to two voltages
- Stability
- Signal propagation speed issues
- Energy overhead issues in RAMs and Register files

- Average reduction of processor power is 7% while performance degradation is 0.9%

VSV Structure Networks (

VSV -Results Networks (

VSV - Achievements Networks (

- Power savings with minimal performance degradation
- Complexity of circuits taken into consideration
- FSM’s control the level of parallelism between independent operations and delay caused by an L2 cache miss
- VSV achieves 4% reduction in power for all SPEC2K benchmarks
- VSV achieves 12% for the benchmarks with high L2 miss rates

Questions Networks (

- Any questions or feedback ??

Download Presentation

Connecting to Server..