1 / 55

Dynamic Region Selection for Thread Level Speculation

Dynamic Region Selection for Thread Level Speculation. Presented by: Jeff Da Silva Stanley Fung Martin Labrecque. Feb 6, 2004. Builds on research done by: Chris Colohan from CMU Greg Steffan. Proc. Proc. Proc. Cache. Cache. Desktops. Simultaneous- Multithreading.

chi
Download Presentation

Dynamic Region Selection for Thread Level Speculation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Dynamic Region Selection for Thread Level Speculation Presented by: Jeff Da Silva Stanley Fung Martin Labrecque Feb 6, 2004 Builds on research done by: Chris Colohan from CMU Greg Steffan

  2. Proc Proc Proc Cache Cache Desktops Simultaneous- Multithreading Chip Multiprocessor (ALPHA 21464, Intel Xeon, Pentium IV) (IBM Power4/5, SUN MAJC, Ultrasparc 4) but what can we do with them? Multithreading on a Chip is here TODAY! Threads of Execution Supercomputers

  3. Improving Performance with a Chip Multiprocessor P P P P P C C C C C C C With a bunch of independent applications: Applications Execution Time Processor Caches improves throughput (total work per second)

  4. Improving Performance with a Chip Multiprocessor P P P P P P P P P C C C C C C C C C C C C With a single application:  Exec. Time need parallel threads to reduce execution time

  5. …*q violation *p…    Recover TLS Exec. Time …*q  exploit available thread-level parallelism Thread-Level Speculation: the Basic Idea 

  6.   Support for TLS: What Do We Need? • Break programs into speculative threads • to maximize thread-level parallelism • Track data dependences • to determine whether speculation was safe • Recover from failed speculation • to ensure correct execution three key elements of every TLS system

  7. Support for TLS: What Do We Need? • Lots of research has been done on TLS hardware • Tracking data dependence • Recover from violation • We focus on how to select regions to run in parallel • A region is any segment of code that you want to speculatively parallelize • For this work, region == loop, iterations == speculative threads

  8. Why is static region selection hard? • Extensive profiling information • Regions can be nested for ( i = 1 to N ) { <= 2x faster in parallel …. for ( j = 1 to N ) { <= 3x faster in parallel …. for ( k = 1 to N ) { <= 4x faster in parallel …. } Which loop should we parallelize? } } • Dynamic behaviour Dynamic Region Selection is a potential solution

  9. Dynamic Region Selection • Compiler transforms all candidate regions into parallel and sequential versions • Through dynamic profiling, we decide which regions are to be run in parallel • Key Questions: • Is there any dynamic behaviour between region instances? • What is a good algorithm for selecting regions? • Are there performance trade-offs for doing dynamic profiling? • Is there any dynamic behaviour within region instances? (not the focus of this research)

  10. Outline • The role of the TLS compiler • Characterizing dynamic behaviour • Dynamic Region Selection (DRS) algorithms • Results • Conclusions • Open questions and future work

  11. LoopA LoopB EndB LoopC LoopD EndD EndC EndA LoopE LoopF EndF EndE LoopG LoopH EndH EndH Sequential Parallel Current Compilation for TLS LoopA LoopB EndB LoopC LoopD EndD EndC EndA LoopE LoopF EndF EndE LoopG LoopH EndH EndH

  12. DRS Compilation LoopA LoopB EndB LoopC LoopD EndD EndC EndA LoopE LoopF EndF EndE LoopG LoopH EndH EndH LoopA LoopB EndB LoopC LoopD EndD EndC EndA LoopE LoopF EndF EndE LoopG LoopH EndH EndH

  13. E 1Extract candidate region DRS Compilation

  14. E 1Extract candidate region 2Create sequential and parallel versions of the region (Clone) E DRS Compilation

  15. 1Extract candidate region E 2Create sequential and parallel versions of the region (Clone) 3Add some extra overhead to monitor the region’s performance E DRS Compilation

  16. DRS Algorithm 4Introduce a DRS algorithm to make the decision at runtime 1Extract candidate region 2Create sequential and parallel versions of the region (Clone) E 3Add some extra overhead to monitor the region’s performance E DRS Compilation

  17. DRS Algorithm 1Extract candidate region 2Create sequential and parallel versions of the region (Clone) E 3Add some extra overhead to monitor the region’s performance 4Introduce a DRS algorithm to make the decision at runtime E DRS Compilation DRS Compilation by Colohan

  18. Constant Periodic Speed Up Speed Up 1x 1x Time Time Characterizing TLS Region Behaviour

  19. Continuous Continuous Improvement Degradation Speed Up Speed Up 1x 1x Time Time Characterizing TLS Region Behaviour

  20. DRS Algorithms • Sample Twice • Continuous Monitoring • Continuous Resample • Path Sensitive Sampling

  21. Constant Speed Up 1x Time Sample Twice Algorithm • Effective if behaviour is constant. • When a region is encountered: • 1st Time: Run sequential version and record execution time t1 • 2nd Time: Run parallel version (if possible) and record execution time tp • Subsequent instances: • if tp < t1 then run parallel version • else run sequential version • Note that by using execution time as a metric, it is assumed that the amount of work done from instance to instance remains relatively constant. Using throughput (IPC) as a metric eliminates the need for this assumption but adds additional complexity.

  22. Sample Sequential? Sample Parallel? Decided Sample Twice Example

  23. Continuous Continuous Improvement Degradation Speed Up Speed Up 1x 1x Time Time Continuous Monitoring • Effective if behaviour is continuously degrading. • Extension to sample twice method. Continuously monitor all regions and reevaluate your decision if speedup changes. • Not doing much more besides monitoring continuously -> the overhead is free. • When a region is encountered: • 1st Time: Run sequential version and record execution time t1 • 2nd Time: Run parallel version (if possible) and record execution time tp • Subsequent instances: • if tp < t1 then run parallel version and update tp • else run sequential version and update t1

  24. t1 = NA tp = NA t1 = 5 tp = 3 t1 = 5 tp = 4 t1 = 5 tp = 6 t1 = 5 tp = NA t1 = 4 tp = 6 Sample Sequential? Sample Parallel? Decided Continuous Monitoring Example

  25. Continuous Continuous Improvement Degradation Speed Up Speed Up 1x 1x Time Time Continuous Resample • Effective if behaviour is continuously changing. • Continuously resample by flushing values t1 and tp periodically. • Adds new overhead. • This algorithm has not yet been explored.

  26. Periodic Speed Up 1x Time Path Sensitive Sampling • If the behaviour is periodic, a means of filtering is required. • One intuitive solution is to sample when the invocation path or region nesting path changes.

  27. foo_while bar_while Speed Up 1x Periodic moo_while Time Path Sensitive Sampling • Sample when region nesting path changes • Makes the assumption that state stays the same if the invocation path does not change void foo() { while(cond) moo(); } void bar() { while(cond) moo(); } void moo() { while(cond) moo(); }

  28. Results – Static analysis Average number of per-path instances for all regions

  29. Interesting Region in IJPEG Number of speculative threads per region instance Program execution 

  30. Interesting Region in Perl Number of instructions per region instance Program execution 

  31. Experimental Framework • SPEC benchmarks • TLS compiler • MIPS architecture • TLS profiler and simulator

  32. Outline • The role of the TLS compiler • Characterizing dynamic behaviour • Dynamic Region Selection (DRS) algorithms • Results • Conclusions • Open questions and future work

  33. Is there any dynamic behavior between region instances?

  34. Results – Dynamic behavior Regions with high coverage have low instruction variance between instances

  35. Results – Dynamic behavior Regions with high coverage have low violation variance between instances

  36. Results – Dynamic behavior Regions with high coverage have low speculative thread count variance between instances

  37. What is a good algorithm for selecting regions?

  38. static optimal slower faster Continuous monitoring 1% better on average than sample twice About 10% worse than static ‘optimal’ selection

  39. How often did we agree with the ‘optimal’ selection?

  40. static optimal Sample twice agrees 57% of the time, on average Continuous monitoring agrees 43% of the time, on average Levels of agreement are close  no dynamic behavior?

  41. Agreeing with static ‘optimal’ gives better performance? Another sign of no dynamic behaviour?

  42.  Sample twice often leaves regions undecided Overall, undecided regions represent low coverage

  43. Outline • The role of the TLS compiler • Characterizing dynamic behaviour • Dynamic Region Selection (DRS) algorithms • Results • Conclusions • Open questions and future work

  44. Conclusions • This is an unexplored research topic (as far as we know) Is there any dynamic behavior between region instances? • We have good indications that there isn’t tons of it What is the best algorithm for selecting regions? • Continuous sampling does 1% better than sample twice • Within 10% of the static ‘optimal’ without any sampling done! Any performance trade-offs for doing dynamic profiling? • The code size is increased by at most 30% • The runtime performance overhead is believed to be negligible Is there any dynamic behavior within a region instance? • We don’t know yet

  45. Open Questions • The dynamic optimal is the theoretical optimal • How close are we from the dynamic optimal? • How close is the static ‘optimal’ to the dynamic optimal? • How do the other proposed algorithms perform? • What should be implemented in hardware/software?

  46. Questions?

  47. AUXILIARY SLIDES

  48. Results – Potential Study Execution time versus invocation (IJPEG)

  49. Results – Potential Study Execution time versus invocation (CRAFTY)

  50. Results – Potential Study Execution time versus invocation (LI)

More Related