1 / 34

Physical Experimentation with Prefetching Helper Threads on Intels Hyper-Threaded Processors

Physical Experimentation with Prefetching Helper Threads on Intels Hyper-Threaded Processors. MASS Lab. Kim Ik Hyun. Outline. Introduction Software infrastructures for experiments Experimental Framework Performance Evaluation Conclusions. Introduction. Background

lynde
Download Presentation

Physical Experimentation with Prefetching Helper Threads on Intels Hyper-Threaded Processors

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Physical Experimentation with Prefetching Helper Threads on Intels Hyper-Threaded Processors MASS Lab. Kim Ik Hyun.

  2. Outline • Introduction • Software infrastructures for experiments • Experimental Framework • Performance Evaluation • Conclusions

  3. Introduction • Background • Speed gap between processor and memory system – large memory latency • Helper thread? • One of a pre-execution technique • Prefetching cache block to tolerate the memory latency

  4. Helper thread • Be able to detect the dynamic program behavior at run-time • same static load incurs different number of cache misses for different time phases. • Need to be invoked judiciously to avoid potential performance degradation • the hyper-threaded processors are shared or partitioned in the multi-threading mode • Have low overhead thread synchronization mechanism • helper threads need to be activated and synchronized very frequently

  5. Outline • Introduction • Software infrastructures for experiments • Experimental Framework • Performance Evaluation • Conclusions

  6. Software infrastructures for experiments - compiler • Compiler to construct helper threads • 1 step : the loads that incur a large number of cache misses and also account for a large fraction of the total execution time are identified. • 2 step : loop selection by compiler • 3 step : the pre-computation slices and the live-in variables are identified as well. • 4 step : trigger points are placed in the original program and the helper thread codes are generated.

  7. Delinquent load identification • identify the top cache-missing loads, known as delinquent loads, through profile feedback.( IntelVTune performance analyzer) • compiler module identifies the delinquent loads and also keeps track of the cycle costs associated with those delinquent loads • the delinquent loads that account for a large portion of the entire execution time are selected to be targeted by helper threads.

  8. Loop selection • key criterion is to minimize the overhead of thread management. • One goal is to minimize the number of helper thread invocations • can be accomplished by ensuring the trip count of the outer-loop that encompasses the candidate loop is small. • the helper thread, once invoked, runs for an adequate number of cycles • It is desirable to choose a loop that iterates a reasonably large number of times.

  9. Loop selection algorithm • The analysis starts from the innermost loop that contains the delinquent loads • And keeps searching for the next outer-loop • until the loop trip-count exceeds a threshold • until the next outerloop’s trip-count is less than twice the trip-count of the currently processed loop • Searching ends when analysis reaches the outermost loop within the procedure boundary

  10. Slicing • identifies the instructions to be executed in the helper threads • 1 step : Within the selected loop, the compiler module starts from a delinquent load and traverses the dependence edges backwards. • 2 step : only the statements that affect the address computation of the delinquent loadare selected • 3 step : all the stores to heap objects or global variables are removed from the slice

  11. Live-in variable identificationand Code generation • the live-in variables to the helper thread are identified. • the constructed helper threads are attached to the application program as a separate code.

  12. Helper thread execution • When the main thread encounters a trigger point • it first passes the function pointer of the corresponding helper thread and the live-in variables, and wakes up the helper thread. • the helper thread indirectly jumps to the designated helper thread code region • reads in the live-ins, and starts execution.

  13. Software infrastructures for experiments - EmonLite • light-weight mechanismto monitor dynamic events as cache misses and at very fine sampling granularity. • profiling through the direct use of the performance monitoring events supported on the Intel processors. • compiler to instrument at any location of the program code to directly read from the Performance Monitoring Counters (PMCs). • Support dynamic optimizations such as dynamic throttling of both helper thread activation and termination.

  14. EmonLite vs. VTune • Vtune provides only a summary of sampling profile for the entire program execution, but EmonLite provides the chronology of the performance monitoring events. • VTune’s sampling based profiling relies on the buffer overflow of the PMCs to trigger an event exception handler registered at OS, EmonLitereads the counter values directly from the PMCs by executing four assembly instructions.

  15. Components of EmonLite • Emonlite_begin() • initializes and programs a set of EMON-related Machine Specific Registers (MSRs) • Emonlite_sample() • reads the counter values from the PMCs and is inserted in the user code of interest.

  16. Implementation of EmonLite • 1 step : the delinquent loads are first identified and appropriate loops are selected. • 2 step : The compiler inserts the instrumentation codes into the user program. • 3 step : the compiler inserts codes to read the PMC values once every few iterations.

  17. Example of EmonLite code instrumentation

  18. Example usage of EmonLite

  19. Outline • Introduction • Software infrastructures for experiments • Experimental Framework • Performance Evaluation • Conclusions

  20. Experimental Framework • Sytem configuration

  21. Experimental Framework • Hardware management in intel hyper threaded processors

  22. Experimental Framework • Two thread synchronization mechanisms • the Win32 API, SetEvent() and WaitForSingleObject(), can be used for thread management. • This hardware mechanism is actually implemented in real silicon as an experimental feature.

  23. Experimental Framework • SPEC CPU2000 benchmark • MCF and BZIP2 from SPEC CINT2000 • ART from SPEC CFP2000 • MST EM3D from Olden benchmark • Best compile option • VTune Analyzer

  24. Thread pinning • theWindows OS periodically reschedules a user thread on different logical processors. • a user thread and its helper thread, as two OS threads, could potentially compete with each other to be scheduled on the same logical processor. • the compiler adds a call to the Win32 API, SetThreadAffinityMask(), to manage thread affinity

  25. Thread pinning • Normarlized execution time without thread pinning

  26. Helper threading scenarios • Static trigger • Loop-based trigger • the inter-thread synchronization only occurs once for every instance of the targeted loop. • Sample-based trigger • helper thread is invoked once for every few iterations of the targeted loop.

  27. Helper threading scenarios • Dynamic trigger • helper threads may not always be beneficial • based on the sample-based trigger • the main thread dynamically decides whether or not to invoke a helper thread for a particular sample period

  28. Outline • Introduction • Software infrastructures for experiments • Experimental Framework • Performance Evaluation • Conclusions

  29. Performance Evaluation • Speedup of static trigger

  30. Performance Evaluation • LO vs LH – LH 1.8% • SO vs SH – SH 5.5% • LO vs SO – LO (except EM3D) • LH vs SH • As the thread synchronization cost becomes even lower, the sample-based trigger is expected to be more effective.

  31. Dynamic behavior of performance events with and without helper threading

  32. Outline • Introduction • Software infrastructures for experiments • Experimental Framework • Performance Evaluation • Conclusions

  33. Conclusions • Impediments to speedup • Potential resource contention with the main thread must be minimized so as not to degrade the performance of the main thread • Dynamic throttling of helper thread invocation is important for achieving effective prefetching benefit without suffering potential slow down. • Having very light-weight thread synchronization and switching mechanisms is crucial.

  34. Conclusions • Future works • run-time mechanisms to develop practical dynamic throttling framework • lighter-weight user-level thread synchronization mechanisms.

More Related