1 / 74

David J. Lilja Department of Electrical and Computer Engineering University of Minnesota

Exploiting Incorrectly Speculated Memory Operations in a Concurrent Multithreaded Architecture (Plus a Few Thoughts on Simulation Methodology). David J. Lilja Department of Electrical and Computer Engineering University of Minnesota lilja@ece.umn.edu. Acknowledgements.

belva
Download Presentation

David J. Lilja Department of Electrical and Computer Engineering University of Minnesota

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Exploiting Incorrectly Speculated Memory Operations in a Concurrent Multithreaded Architecture(Plus a Few Thoughts on Simulation Methodology) David J. Lilja Department of Electrical and Computer Engineering University of Minnesota lilja@ece.umn.edu

  2. Acknowledgements • Graduate students (who did the real work) • Ying Chen • Resit Sendag • Joshua Yi • Faculty collaborator • Douglas Hawkins (School of Statistics) • Funders • National Science Foundation • IBM • HP/Compaq • Minnesota Supercomputing Institute

  3. Problem #1 • Speculative execution is becoming more popular • Branch prediction • Value prediction • Speculative multithreading • Potentially higher performance • What about impact on the memory system? • Pollute cache/memory hierarchy? • Leads to more misses?

  4. Problem #2 • Computer architecture research relies on simulation • Simulation is slow • Years to simulate SPEC CPU2000 benchmarks • Simulation can be wildly inaccurate • Did I really mean to build that system? • Results are difficult to reproduce • Need statistical rigor

  5. Outline (Part 1) • The Superthreaded Architecture • The Wrong Execution Cache (WEC) • Experimental Methodology • Performance of the WEC [Chen, Sendag, Lilja, IPDPS, 2003]

  6. Hard-to-Parallelize Applications • Early exit loops • Pointers and aliases • Complex branching behaviors • Small basic blocks • Small loops counts → Hard to parallelize with conventional techniques.

  7. Introduce Maybe dependences • Data dependence? • Pointer aliasing? • Yes • No • Maybe • Maybe allows aggressive compiler optimizations • When in doubt, parallelize • Run-time check to correct wrong assumption.

  8. CONTINUATION -Values needed to fork next thread Fork Fork TARGET STORE -Forward addresses of maybe dependences CONTINUATION -Values needed to fork next thread … … Fork Sync Sync TARGET STORE -Forward addresses of maybe dependences … COMPUTATION -Forward addresses and computed data as needed CONTINUATION -Values needed to fork next thread … … Sync COMPUTATION -Forward addresses and computed data as needed TARGET STORE -Forward addresses of maybe dependences … WRITE-BACK COMPUTATION -Forward addresses and computed data as needed Sync Sync Thread i WRITE-BACK Sync Thread i+1 WRITE-BACK Thread i+2 Thread Pipelining Execution Model

  9. Instruction Cache Super-Scalar Core Super-Scalar Core Super-Scalar Core Super-Scalar Core Registers Registers Registers Registers PC Execution Unit PC Execution Unit PC Execution Unit PC Execution Unit Comm Dependence Buffer Comm Dependence Buffer Comm Dependence Buffer Comm Dependence Buffer Data Cache The Superthread Architecture

  10. Predicted path Speculative execution Correct path Prediction result is wrong Wrong path Wrong path execution Not ready to be executed Wrong Path Execution Within Superscalar Core

  11. Parallel region Parallel region Sequential region Kill all the wrong threads from the Previous parallel region Mark the successor threads as wrong threads Sequential region between two parallel regions Wrong thread kills itself Wrong Thread Execution

  12. How Could Wrong Thread Execution Help Improve Performance? When i=4, j=0,1,2,3=>y[0], y[1], y[2], y[3],y[4]… When i=5, j=0,1,2,3,4 =>y[0],y[1],y[2],y[3],y[4],y[5]… for (i=0; i<10; i++) { …… for (j=0; j<i; j++) { …… x=y[j]; …… } …… } i=4 TU1 TU2 TU3 TU4 y[0] y[1] y[2] y[4] y[3] y[5] i=5 TU1 TU2 TU3 TU4 y[0] y[1] y[2] Parallelized y[4] y[3] y[5] y[6] wrong threads

  13. Correct execution Wrong execution Operation of the WEC

  14. Processor Configurations for Simulations SIMCA (the SIMulator for the Superthreaded Architecture) features configurations

  15. Parameters for Each Thread Unit

  16. Characteristics of the Parallelized SPEC2000 Benchmarks

  17. Performance of the Superthreaded Architecture for the Parallelized Portions of the Benchmarks Baseline configuration

  18. Performance of the wth-wp-wec Configuration on Top of the Parallel Execution

  19. Performance Improvements Due to the WEC

  20. Sensitivity to L1 Data Cache Size

  21. Sensitivity to WEC Size Compared to a Victim Cache

  22. Sensitivity to WEC Size Compared to Next-Line Prefetching (NLP)

  23. Additional Loads and Reduction of Misses %

  24. Conclusions for the WEC • Allow loads to continue executing even after they are known to be incorrectly issued • Do not let them change state • 45.5% average reduction in number of misses • 9.7% average improvement on top of parallel execution • 4% average improvement over victim cache • 5.6% average improvement over next-line prefetching • Cost • 14% additional loads • Minor hardware complexity

  25. Typical Computer Architecture Study • Find an interesting problem/performance bottleneck • E.g. Memory delays • Invent a clever idea for solving it. • This is the hard part. • Implement the idea in a processor/system simulator • This is the part grad students usually like best • Run simulations on n “standard” benchmark programs • This is time-consuming and boring • Compare performance with and without your change • Execution time, clocks per instruction (CPI), etc.

  26. Problem #2 – Simulation in Computer Architecture Research • Simulators are an important tool for computer architecture research and design • Low cost • Faster than building a new system • Very flexible

  27. Performance EvaluationTechniques Used in ISCA Papers * Some papers used more than one evaluation technique.

  28. Simulation is Very Popular, But … • Current simulation methodology is not • Formal • Rigorous • Statistically-based • Never enough simulations • Design a new processor based on a few seconds of actual execution time • What are benchmark programs really exercising?

  29. An Example -- Sensitivity Analysis • Which parameters should be varied? Fixed? • What range of values should be used for each variable parameter? • What values should be used for the constant parameters? • Are there interactions between variable and fixed parameters? • What is the magnitude of those interactions?

  30. Let’s Introduce Some Statistical Rigor • Decreases the number of errors • Modeling • Implementation • Set up • Analysis • Helps find errors more quickly • Provides greater insight • Into the processor • Effects of an enhancement • Provides objective confidence in results • Provides statistical support for conclusions

  31. Outline (Part 2) • A statistical technique for • Examining the overall impact of an architectural change • Classifying benchmark programs • Ranking the importance of processor/simulation parameters • Reducing the total number of simulation runs [Yi, Lilja, Hawkins, HPCA, 2003]

  32. A Technique to Limit the Number of Simulations • Plackett and Burman designs (1946) • Multifactorial designs • Originally proposed for mechanical assemblies • Effects of main factors only • Logically minimal number of experiments to estimate effects of m input parameters (factors) • Ignores interactions • Requires O(m) experiments • Instead of O(2m) or O(vm)

  33. Plackett and Burman Designs • PB designs exist only in sizes that are multiples of 4 • Requires X experiments for m parameters • X = next multiple of 4 ≥ m • PB design matrix • Rows = configurations • Columns = parameters’ values in each config • High/low = +1/ -1 • First row = from P&B paper • Subsequent rows = circular right shift of preceding row • Last row = all (-1)

  34. PB Design Matrix

  35. PB Design Matrix

  36. PB Design Matrix

  37. PB Design Matrix

  38. PB Design Matrix

  39. PB Design Matrix

  40. PB Design • Only magnitude of effect is important • Sign is meaningless • In example, most → least important effects: • [C, D, E] → F → G → A → B

  41. Case Study #1 • Determine the most significant parameters in a processor simulator.

  42. Determine the Most Significant Processor Parameters • Problem • So many parameters in a simulator • How to choose parameter values? • How to decide which parameters are most important? • Approach • Choose reasonable upper/lower bounds. • Rank parameters by impact on total execution time.

  43. Simulation Environment • SimpleScalar simulator • sim-outorder 3.0 • Selected SPEC 2000 Benchmarks • gzip, vpr, gcc, mesa, art, mcf, equake, parser, vortex, bzip2, twolf • MinneSPEC Reduced Input Sets • Compiled with gcc (PISA) at O3

  44. Functional Unit Values

  45. Memory System Values, Part I

  46. Memory System Values, Part II

  47. Processor Core Values

  48. Determining the Most Significant Parameters 1. Run simulations to find response • With input parameters at high/low, on/off values

  49. Determining the Most Significant Parameters 2. Calculate the effect of each parameter • Across configurations

  50. Determining the Most Significant Parameters 3. For each benchmark Rank the parameters in descending order of effect (1=most important, …)

More Related