Highlights of the 36thAnnual International Symposium on MicroarchitectureDecember 2003 Theo Theocharides Embedded and Mobile Computing Center Department of Computer Science and Engineering The Pennsylvania State University Acknowledgements: K. Bernstein, T. Austin, D. Blaauw, L. Peh, D. Jimenez
Introduction • The International Symposium on Microarchitecture is the premier forum for discussing new microarchitecture and software techniques • Processor architecture, compilers, and systems for technical interaction on traditional MICRO topics • special emphasis on optimizations to take advantage of application specific opportunities • microarchitecture and embedded architecture communities • http://www.microarch.org • http://www.microarch.org/micro36/
Symposium Outline • Session 1: Voltage Scaling & Transient • Session 2: Cache • Session 3: Power and Energy Efficient Architectures • Session 4: Application-Specific Optimization and Analysis • Session 5: Dynamic Optimization Systems • Session 6: Dynamic Program Analysis and Optimization • Session 7: Branch, Value, and Scheduling Optimization • Session 8: Dataflow, Data Parallel, and Clustered Architectures • Session 9: Secure and Network Processors • Session 10: Scaling Design
Highlights Keynote Speech • Caution Flag Out: Microarchitecture's Race for Power Performance • Kerry Bernstein, IBM T. J. Watson Research Center Interesting Papers • Razor: A Low-Power Pipeline Based on Circuit-Level Timing Speculation, D. Ernst, et. al • Power-Driven Design of Router Microarchitectures in On-Chip Networks, H. Wang, Li-Shiuan Peh, S. Malik • Fast Path-Based Neural Branch Prediction, D. Jimenez
Workshops and Tutorials • 5th Workshop on Media and Streaming Processors (MSP) • 3rd Workshop on Power-Aware Computer Systems (PACS) • 2nd Workshop on Application Specific Processors (WASP) • Tutorial: Challenges in Embedded Computing • Tutorial: Open Research Compiler (ORC): Proliferation of Technologies and Tools • Tutorial: Microarchitecture-Level Power-Performance Simulators: Modeling, Validation, and Impact on Design • Tutorial: Network Processors • Tutorial: Architectural Exploration with Liberty
Keynote Speech • Given by Kerry Bernstein, IBM T.J. Watson Research Center • Microarchitecture and technology relationship • We cannot continue to scale down to achieve higher frequencies without any catch • Increasing pipeline depth does not necessarily help • Power consumption, process variation, soft errors, die area erosion becoming more and more important • Keynote explored how past technologies have influenced high speed microarchitectures • Keynote showed how characteristics of proposed new devices and interconnects for lithographies beyond 90nm may shape future machine design. • Given the present issues and incoming trends, role of microarchitecture in extending CMOS performance will be more important than ever
Issues in summary: • Feature size • Device count (transistors per chip) • Pipeline depth • Power consumption increases non-linearly with scaling • Power growths when we reduce the FO4 delay • Delay and power affected by process variation • Cooling creates more problems • Cost of power diverges from performance gain
Repairs • Monitor-based Full Chip Voltage, Clock Throttling • Voltage Islands • Technology aid required here • Latency required • Low-activity FET count increase • Clock Gating • So far has been a nice solution… • Pipeline depth optimization • Performance accelerators for ASICs (DSP, GPU’s, etc.) • As in, they need power anyways, at least make them efficient • Software solutions should be developed here • Compute-Informed Power Management • Instruction Stream • Dynamic Resource Assertion • Power Aware OS • Thermal Modeling
New Ideas • “Evolutionary” • Strained Silicon • High-K Gate Dielectrics • Hybrid Crystal Silicon • Increase current drive/micron of device • Allow transistor density improvement • Introduce Features which enable active static power management • “Revolutionary” • Double Gated MOSFETs • 3D Integration • Molecular Computing • Reduce Power Density without architectural management • Eliminate power dependence on frequency • Return the industry to threshold and supply voltage scaling
Keynote Conclusions • New technologies will likelyhelp, not necessarily • Power is by far the predominant factor in scaling – we need to see what new technologies can give us • Staying ahead requires power-aware systems
Razor Project (T. Austin, D. Blaauw, T. Mudge) • We (designers/architects) have been scaling the voltage down but up to a point where it was proven that under all possible worst cases, there were no errors • Very conservative voltage scaling • IDEA! • Instead of trying to avoid ALL errors, ALLOW some errors to happen and correct them! • Major argument: Scaling the voltage supply by almost 0.25V down, gives an average error rate of less than 5% • Instead of spending energy, logic, effort, time and so many other useful factors into avoiding error, allow a very small error percentage to happen, and gain huge power savings • Cost of fixing errors is minimal when the error percentage is kept under control
Razor Advantages • Eliminate safety margins • Process variation, IR-drop, temperature fluctuation, data-dependent latencies, model uncertainty • Operate at sub-critical voltage for optimal trade-off between: • Energy gain from voltage scaling • Energy overhead from dynamic error correction • Tune voltage for average instruction data • Exploit delay dependence in data • Tolerate delay degradation due to infrequent noise events • SER, capacitive, inductive noise, charge sharing, floating body effect… • Most severe noise also least frequent
Power-driven Design of Router Microarchitectures in On-chip Networks (Hangsheng Wang, Li-Shiuan Peh, Sharad Malik) • Investigates on-chip network microarchitectures from a power-driven perspective • Power-efficient network microarchitectures: • segmented crossbar, cut-through crossbar and write-through buffer • Studies and uncovers the power saving potential of an existing network architecture: Express cube • Reduction in network power of up to 44.9%, • NO degradation in network performance • Improved latency throughput in some cases.
Power in NoC • Ewrtis the average energy dissipated when writing a flit into the input buffer • Erdis the average energy dissipated when reading a flit from the input buffer • Ebuf = Ewrt + Erdis average buffer energy • Earbis average arbitration energy • Exbis average crossbar traversal energy • Elnkis average link traversal energy • His the number of hops traversed by this flit
Architectural Methods • Segmented crossbar • Cut-through crossbar • Write-through input buffer • Express cube
Segmented Crossbar Schematic of a matrix crossbar and a segmented crossbar. F is flit size in bits, dw is track width, E, W, N, S are ports.
Cut-through crossbar Schematic of cut-through crossbars F is flit size, dw is track width, E, W, N, S are ports
Write-through buffer • Bypassing without overlapping • Bypassing with overlapping • Schematic of a write-through input buffer.
Power savings and conclusions • Importance of a power-driven approach to on-chip network design • Need to investigate the interactions between traffic patterns and On Chip Network architectures • Need to reach a systematic design methodology for on-chip networks
Fast Path-Based Neural Branch Prediction(J. Himenez) • Paper presented a new neural branch predictor • both more accurate and much faster than previous neural predictors • Accuracy far superior to conventional predictors • Latency comparable to predictors from industrial designs • Improves the instructions-per-cycle (IPC) rate of an aggressively clocked microarchitecture by 16%
Latency - Accuracy Gain Rather than being done all at once (above), computation is staggered (below) • Train a neural network with path history, and update it dynamically. • Choose the weight vectors according to the path leading up to the branch rather than branch address alone • Directly reduces latency (can begin prior to the prediction – see figure on the left) • Improves accuracy as the predictor incorporates path information
IPC per hardware cost • Faster and more accurate than existing neural branch predictors
Conclusion • Overview of MICRO36 • Conference lasted 5 days – impossible to review in half hour! • If you are interested, you should read the proceedings on-line at http://www.microarch.org/micro36 The Call For Papers for MICRO37 is available, at http://www.microarch.org/micro37 DEADLINE FOR PAPER SUBMISSION: May 28th, 2004
Links to the papers reviewed • Razor • http://www.microarch.org/micro36/html/pdf/ernst-Razor.pdf • NoC Router Power-Driven Design • http://www.microarch.org/micro36/html/pdf/wang-PowerDrivenDesign.pdf • Fast-Path Neural Branch Predictor • http://www.microarch.org/micro36/html/pdf/jimenez-FastPath.pdf
Questions? THANK YOU !
36th Annual International Symposium on Micro-Architecture - A Review Rajaraman Ramanarayanan
Talk Overview • Session covered in this presentation • Review papers • Architectural vulnerability factors • Introduction • Proposed technique • Soft error terminology • Computing AVF’s • Results • Conclusion • L2-Miss Drive Variable Supply voltage scaling • Introduction • Proposed Solution • Transitions • Results • Achievements
Session Covered • Voltage Scaling & Transient Faults • Methodology to compute Artificial vulnerability factors • VSV: L2-Miss-Driven Variable Supply-Voltage Scaling for Low Power
Architectural Vulnerability Factors(S. S. Mukherjee, C. T. Weaver, J. Emer, S. K. Reinhardt, T. Austin) • Single-event upsets from particle strikes have become a key challenge in microprocessor design. • Soft errors due to cosmic rays making an impact in industry. • In 2000, Sun Microsystems acknowledged cosmic ray strikes on unprotected cache memories as the cause of random crashes at major customer sites in its flagship Enterprise server line • The fear of cosmic ray strikes prompted Fujitsu to protect 80% of its 200,000 latches in its recent SPARC processor with some form of error detection • require accurate estimates of processor error rates to make appropriate cost/reliability trade-offs.
Introduction • All existing approaches introduce a significant penalty in performance, power, die size, and design time • Tools and techniques to estimate processor transient error rates are not readily available or fully understood. • Estimates are needed early in the design cycle. • In this Paper : • Define architectural vulnerability factor (AVF) • identify numerous cases, such as pre-fetches, dynamically dead code, and wrong-path instructions, in which a fault will not affect correct execution
Proposed technique • Not all faults in a micro-architectural structure affect the final outcome of a program. • Architectural Vulnerability factor (AVF) • probability that a fault in that particular structure will result in an error in the final output of the program • The overall error rate = product of raw fault rate and AVF. • Can examine the relative contributions of various structures • identify cost-effective areas to employ fault protection techniques • Tracks the subset of processor state bits required for architecturally correct execution (ACE) • fault in a storage cell containing one of these bits affects output • For example, a branch predictor’s AVF is 0% • predictor bits are always un-ACE bits. • Bits in the committed PC are always ACE bits, has an AVF of 100%
Soft error terminology • Error budget expressed in terms of: • Mean Time Between Failures (MTBF). • Failures In Time (FIT) - inversely related to MTBF. • Errors are often classified as: • Undetected - silent data corruption (SDC) • Detected - detected unrecoverable errors (DUE) • Effective FIT rate for a structure is the product of its raw circuit FIT rate and the structure’s vulnerability factor • effective FIT rate per bit is influenced by several vulnerability factors • also known as de-rating factors or soft error sensitivity factor • Examples include timing vulnerability factor for latches and AVF
Identifying Un-ACE Bits • Bits that do not affect final program output • Analyzed a uniprocessor system • Micro-architectural Un-ACE bits • Idle or Invalid State. • Miss-speculated State. • Predictor Structures. • Ex-ACE State. • Architectural Un-ACE Bits • NOP instructions. • Performance-enhancing instructions. • Predicated-false instructions. • Dynamically dead instructions. • Logical masking.
Computing AVF • AVFs for storage cells - fraction of time an upset in that cell will cause a visible error in the final output of a program • AVF Equations for a Hardware Structure • average AVF for all its bits in that structure • ∑ residency (in cycles) of all ACE bits in a structure -------------------------------------------------------------------------------- total number of bits in the hardware structure × total execution cycles • Little’s Law: • N = B×L, where • N = average number of bits in a box, • B = average bandwidth per cycle into the box, and • L = average latency of an individual bit through the box. • Bace × Lace AVF = -------------------------------------------------------------- total number of bits in the hardware structure
Computing AVFs using a Performance Model • Two structures—the instruction queue and execution units—using the Asim performance model framework • Need following information • Sum of all residence cycles of all ACE bits of the objects resident in the structure during the execution of the program, • Total execution cycles for which we observe the ACE bits’ residence time, and • Total number of bits in a hardware structure. • AVF algorithm • Record the residence time of the instruction in the structure as an instruction flows through different structures in the pipeline • Update the structures the instruction flowed through • Put the instruction in a post-commit analysis window to • Determine if the instruction is dynamically dead or • Determine if there are any bits that are logically masked
Methodology for evaluation • Use an Itanium2®-like IA64 processor  scaled to current technology • Modeled in detail in Asim performance model framework.
Results • Program-level Decomposition • We get about 45% ACE instructions. The rest—55% of the instructions—are un-ACE instructions • Some of these un- ACE instructions still contain ACE bits, such as the op-code bits of pre-fetch instructions • UNKNOWN and NOT_PROCESSED instructions account for about 1% of the total instructions • NOPs, predicated false instructions, and prefetch instructions account for 26%, 6.7%, and 1.5%, respectively. • FDD_reg and FDD_mem denote results that are written back to registers and memory, respectively • Account for about 9.4% and 2% of the dynamic instructions • IA64 has a large number of registers • TDD_reg and TDD_mem account for 6.6% and 1.6% of the dynamic instructions