1 / 52

Microarchitectural Techniques to Exploit Repetitive Computations and Values

LECTURA DE TESIS, (Barcelona, 14 de Diciembre de 2005). Microarchitectural Techniques to Exploit Repetitive Computations and Values. Carlos Molina Clemente. Advisors: Antonio González and Jordi Tubella. Outline. Motivation & Objectives Overview of Proposals To improve the memory system

sahara
Download Presentation

Microarchitectural Techniques to Exploit Repetitive Computations and Values

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. LECTURA DE TESIS, (Barcelona,14 de Diciembre de 2005) Microarchitectural Techniques to Exploit Repetitive Computations and Values Carlos Molina Clemente Advisors: Antonio González and Jordi Tubella

  2. Outline • Motivation & Objectives • Overview of Proposals • To improve the memory system • To speed-up the execution of instructions • Non Redundant Data Cache • Trace-Level Speculative Multithreaded Arch. • Conclusions & Future Work

  3. Outline • Motivation & Objectives • Overview of Proposals • To improve the memory system • To speed-up the execution of instructions • Non Redundant Data Cache • Trace-Level Speculative Multithreaded Arch. • Conclusions & Future Work

  4. Even with aggressive compilers Repetition is relatively common Motivation • General by design • real-world programs • operating systems • Often designed in mind to • future expansion • code reuse • Input sets have little variation

  5. Computations Values Types of Repetition Repetition z = F (x, y)

  6. Repetitive Computations 100 % 90 % 80 % 70 % 60 % 50 % 40 % 30 % 20 % 10 % 0 % Spec CPU2000, 500 million instructions

  7. Computations Values Types of Repetition Repetition z = F (x, y)

  8. Repetitive Values 100 % 90 % 80 % 70 % 60 % 50 % 40 % 30 % 20 % 10 % 0 % Spec CPU2000, 500 million instructions, analysis of destination value

  9. Exploit Value Repetition of Store Instructions • Redundant store instructions • Non redundant data cache To improve the memory system • Exploit Computation Repetition of all Insts • Redundant computation buffer (ILR) • Trace-level reuse (TLR) • Trace-level speculative multithreaded architecture (TLS) To speed-up the execution of instructions Objectives

  10. Experimental Framework • Methodology • Analysis of benchmarks • Definition of proposal • Evaluation of proposal • Tools • Atom • Cacti 3.0 • Simplescalar Tool Set • Benchmarks • Spec CPU95 • Spec CPU2000

  11. Outline • Motivation & Objectives • Overview of Proposals • To improve the memory system • To speed-up the execution of instructions • Non Redundant Data Cache • Trace-Level Speculative Multithreaded Arch. • Conclusions & Future Work

  12. Redundant Stores Non Redundant Cache Techniques to Improve Memory Value Repetition

  13. Memory @i Value X Redundant Stores Instructions • Do NOT modify memory STORE (@i , Value Y) • If (Value X==Value Y) then Redundant Store • Contributions • Redundant stores • Analysis of repetition into same storage location • Redundant stores applied to reduce memory traffic • Main results • 15%-25% of redundant store instructions • 5%-20% of memory traffic reduction Molina, González, Tubella, “Reducing Memory Traffic via Redundant Store Instructions”, HPCN’99

  14. Data Cache Tag X 1234 Value A Value B FFFF Tag Y Value C 0000 Value D 1234 Non Redundant Data Cache • If (Value A==Value D) then Value Repetition • Contributions • Analysis of repetition in severalstorage locations • Non redundant data cache (NRC) • Main results • On average, a value is stored 4 times at any given time • NRC: -32% area, -13% energy, -25% latency, +5% miss Molina, Aliagas, García,Tubella, González, “Non Redundant Data Cache”, ISLPED’03 Aliagas, Molina, García, González, Tubella, “Value Compression to Reduce Power in Data Caches”, EUROPAR’03

  15. Outline • Motivation & Objectives • Overview of Proposals • To improve the memory system • To speed-up the execution of instructions • Non Redundant Data Cache • Trace-Level Speculative Multithreaded Arch. • Conclusions & Future Work

  16. Data Value Reuse Data Value Speculation Techniques to Speed-up I Execution Computation Repetition • Avoid serialization caused by data dependences • Determine results of instructions without executing them • Target is to speed-up the execution of programs

  17. Data Value Reuse Data Value Speculation • NON SPECULATIVE !!! • Buffers previous inputs and their corresponding outputs • Only possible if a computation has been done in the past • Inputs have to be ready at reuse test time Techniques to Speed-up I Execution Computation Repetition

  18. Data Value Reuse Data Value Speculation • SPECULATIVE !!! • Predicts values as a function of the past history • Needs to confirm speculation at a later point • Solves reuse test but introduces misspeculation penalty Techniques to Speed-up I Execution Computation Repetition

  19. Data Value Reuse Data Value Speculation Instruction Level Trace Level Instruction Level Trace Level Applied to a SINGLE instruction Techniques to Speed-up I Execution Computation Repetition

  20. Data Value Reuse Data Value Speculation Instruction Level Trace Level Instruction Level Trace Level Applied to a GROUP of instructions Techniques to Speed-up I Execution Computation Repetition

  21. Data Value Reuse Data Value Speculation Instruction Level Trace Level Instruction Level Trace Level Techniques to Speed-up I Execution Computation Repetition

  22. index OOO Execution Fetch Commit Decode & Rename Instruction Level Reuse (ILR) RCB Reuse Table • Redundant Computation Buffer (RCB) • Contributions • Performance potential of ILR • Main results • Ideal ILR speed-up of 1.5 • RCB speed-up of 1.1 (outperforms previous proposals) Molina, González, Tubella, “Dynamic Removal of Redundant Computations”, ICS’99

  23. I1 I2 I3 TRACE I4 I5 I6 Trace Level Reuse (TLR) • Contributions • Trace Level Reuse • Initial design issues for integrating TLR • Performance potential of TLR • Main results • Ideal TLR speed-up of 3.6 • 4K-entry table: 25% of reuse, average trace size of 6 González, Tubella, Molina, “Trace-Level Reuse”, ICPP’99

  24. Microarchitecture Support for Trace Speculation Control and Data Speculation Techniques Static Analysis Based on Profiling Info TSMA Trace Level Speculation (TLS) • Two orthogonal issues • Compiler analysis to support TSMA • Contributions • Trace Level Speculative Multithreaded Architecture • Main results • speedup of 1.38 with a 20% of misspeculations Molina, González, Tubella, “Trace-Level Speculative Multithreaded Architecture (TSMA)”, ICCD’02 Molina,González,Tubella “Compiler Analysis forTSMA”, INTERACT’05 Molina, Tubella, González, “Reducing Misspeculation Penalty in TSMA”, ISHPC’05

  25. Objectives & Proposals • To improve the memory system • Redundant store instructions • Non redundant data cache • To speed-up the execution of instructions • Redundant computation buffer (ILR) • Trace-level reuse buffer (TLR) • Trace-level speculative multithreaded architecture (TLS)

  26. Outline • Motivation & Objectives • Overview of Proposals • To improve the memory system • To speed-up the execution of instructions • Non Redundant Data Cache • Trace-Level Speculative Multithreaded Arch. • Conclusions & Future Work

  27. Motivation • Caches spend close to 50% of total die area • Caches are responsible of a significant part of total power dissipated by a processor

  28. Data Value Repetition percentage of repetitive values percentage of time Spec CPU2000, 1 billion instructions, 256KB data cache

  29. Value A 1234 Value B FFFF Tag X Value C 0000 Value D 1234 Tag Y Conventional Cache • If (Value A==Value D) then Value Repetition

  30. 1234 FFFF 0000 Tag X 1234 0000 1234 Tag Y FFFF Non Redundant Data Cache Pointer Table Value Table Die Area Reduction

  31. 0000 Tag X 1234 Tag Y FFFF Additional Hardware: Pointers Non Redundant Data Cache Pointer Table Value Table

  32. 0000 Tag X 1234 Tag Y FFFF Additional Hardware: Counters Non Redundant Data Cache Pointer Table Value Table 1 2 1

  33. Data Value Inlining • Some values can be represented with a small number of bits (Narrow Values) • Narrow values can be inlined into pointer area • Simple sign extension is applied • Benefits • enlarges effective capacity of VT • reduces latency • reduces power dissipation

  34. 0000 1 Tag X Tag Y FFFF 1 Non Redundant Data Cache Pointer Table Value Table F 1234 2 0 Data Value Inlining

  35. Miss Rate vs Die Area L2 Cache: 256KB 512KB 1MB 2MB 4MB % % % % Miss Ratio % % % % | | | 0,1 0,5 1,0 cm2 CONV VT20 VT50 VT30 Spec CPU2000, 1 billion instructions

  36. Results • Caches ranging from 256 KB to 4 MB

  37. Outline • Motivation & Objectives • Overview of Proposals • To improve the memory system • To speed-up the execution of instructions • Non Redundant Data Cache • Trace-Level Speculative Multithreaded Arch. • Conclusions & Future Work

  38. Trace Level Speculation • Avoids serialization caused by data dependences • Skips in a row multiple instructions • Predicts values based on the past • Solves live-input test • Introduces penalties due to misspeculations

  39. Trace Level Speculation • Two orthogonal issues • microarchitecture support for trace speculation • control and data speculation techniques • prediction of initial and final points • prediction of live output values • Trace Level Speculative Multithreaded Architecture (TSMA) • does not introduce significant misspeculation penalties • Compiler Analysis • based on static analysis that uses profiling data

  40. Live Output Update & Trace Speculation Instruction Flow Miss Trace Speculation Detection & Recovery Actions INSTRUCTION EXECUTION INSTRUCTION VALIDATION INSTRUCTION SPECULATION Trace Level Speculation with Live Output Test ST NST

  41. ST I Window NST I Window ST Ld/St Queue Branch Decode & Functional Fetch I NST Ld/St Queue Units Engine Predictor Cache Rename ST Reorder Buffer Trace NST Reorder Buffer Speculation Data L1SDC Cache NST Arch. Verification ST Arch. Register File Engine Register File L1NSDC L2NSDC TSMA Block Diagram Look Ahead Buffer

  42. Compiler Analysis • Focuses on • developing effective trace selection schemes for TSMA • based on static analysis that uses profiling data • Trace Selection • Graph Construction (CFG & DDG) • Graph Analysis

  43. Graph Analysis • Two important issues • initial and final point of a trace • maximize trace length & minimize misspeculations • predictability of live output values • prediction accuracy and utilization degree • Three basic heuristics • Procedure Trace Heuristic • Loop Trace Heuristic • Instruction Chaining Trace Heuristic

  44. Trace Speculation Engine • Traces are communicated to the hardware • at program loading time • filling a special hardware structure (trace table) • Each entry of the trace table contains • initial PC • final PC • live-output values information • branch history • frequency counter

  45. Simulation Parameters • Base microarchitecture • out of order machine, 4 instructions per cycle • I cache: 16KB, D cache: 16KB, L2 shared: 256KB • bimodal predictor • 64-entry ROB, FUs: 4 int, 2 div, 2 mul, 4 fps • TSMA additional structures • each thread: I window, reorder buffer, register file • speculative data cache: 1KB • trace table: 128 entries, 4-way set associative • look ahead buffer: 128 entries • verification engine: up to 8 instructions per cycle

  46. Speedup 1.45 1.40 1.35 1.30 1.25 1.20 1.15 1.10 1.05 1.00 Spec CPU2000, 250 million instructions

  47. Misspeculations Spec CPU2000, 250 million instructions

  48. Outline • Motivation & Objectives • Overview of Proposals • To improve memory system • To speed-up the execution of instructions • Non Redundant Data Cache • Trace-Level Speculative Multithreaded Arch. • Conclusions & Future Work

  49. Conclusions • Repetition is very common in programs • Can be applied • to improve the memory system • to speed-up the execution of instructions • Investigated several alternatives • Novel cache organizations • Instruction level reuse approach • Trace level reuse concept • Trace level speculation architecture

  50. Future Work • Value repetition in instruction caches • Profiling to support datavalue reuse schemes • Traces starting at different PCs • Value prediction in TSMA • Multiple speculations in TSMA • Multiple threads in TSMA

More Related