1 / 23

The Potential for Variable-Granularity Access Tracking for Optimistic Parallelism

The Potential for Variable-Granularity Access Tracking for Optimistic Parallelism. Mihai Burcea, J. Gregory Steffan, Cristiana Amza University of Toronto MSPC 2008. Getting the Most Out of Your CPUs. Ubiquitous CMPs How do we exploit all this parallelism?

muniya
Download Presentation

The Potential for Variable-Granularity Access Tracking for Optimistic Parallelism

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The Potential for Variable-Granularity Access Tracking for Optimistic Parallelism Mihai Burcea, J. Gregory Steffan, Cristiana Amza University of Toronto MSPC 2008

  2. Getting the Most Out of Your CPUs • Ubiquitous CMPs • How do we exploit all this parallelism? • How do we improve sequential applications? AMD Barcelona quad-core Intel Kentsfield quad-core

  3. Optimistic Parallelism • Flavors: • Transactional Memory (TM) • Thread-Level Speculation (TLS) • Implementations: hardware, software, hybrid • Common required support: • Buffering speculative memory changes • Tracking and detecting memory access conflicts

  4. Traditional Access Tracking Most approaches use some fixed granularity • Hardware TM/TLS:cache-line size • Typically 32/64/128 bytes • Software TLS:word-, object-level • Software TM:word/page/object granularity • Hybrid TM: mixture of above (in HW/SW) Is Fixed Granularity the best approach ?

  5. Can We Reduce The Overhead of Dependence Tracking ? Key Intuition: “best” granularity likely varies within and across benchmarks Too much overhead Too many false conflicts Fine Granularity Coarse

  6. False Conflicts when Using Uniform Coarse Granularity Measured in a TLS simulator; 32/64/128 = cache line sizes (bytes) Uniform coarse grain approach suffers false conflicts

  7. Is there potential for a variable-granularity approach?

  8. Goals Of Our Work • Show potential for Variable-Granularity Access Tracking (VGAT) • Finest grain too expensive; which coarse grain? • Show that ideal granularity varies across and within applications • Suggests need for dynamic, adaptive scheme • Show significant reduction in number of tracked memory ranges when using VGAT

  9. Related Work • Hardware TLS / TM: track accesses at cache-line size (32/64/128 bytes) • Stampede (Steffan et. al., ACM Trans. 2005), Speculative Versioning Cache (Vijaykumar et. al., HPCA 1998) • Unbounded TM (Ananian et. al., HPCA 2005), LogTM (Moore et. al., HPCA 2006) • Software TLS: • Word (Cintra et. al., PPoPP 2003) • Object (Pickett et. al., LCPC 2005) • Software TM: • Word (McRT-STM – Saha et. al., PPoPP 2006) • Page (Manassiev et. al., PPoPP 2006) • Object: RSTM (Marathe et. al., PLDI 2006), DSTM (Herlihy et. al., PODC 2003) Most systems use fixed or object grain - but not necessarily the best

  10. Related Work – Bulk Disambiguation • Ceze et. al., ISCA 2006 • Encode read/write sets into signatures • Detect conflicts by performing operations on signatures (fast) • Design of hashing (encoding) addresses into signatures includes false positives • Reduce conflict-detection traffic, but increase false conflicts Our goal: minimize false conflicts

  11. Variable Granularity Access Tracking • Approaches: vary granularity across • Time: parts of apps. (speculative code regions) • Space: ranges of memory • Can potentially reduce: • Tracking storage • Tracking traffic • Commit latency • False conflicts

  12. Impact On Conflicts Of Increasing Granularity True (actual) conflicts  Same nr. of conflicts, still ok Extra (false) conflicts! Coarsest granularity that incurs no false conflicts: Ideal Granularity

  13. Measuring the Potential for VGAT

  14. Experimental Framework • TLS simulator (CMU) • Subset of SpecINT2000 benchmarks • Instrumented for TLS • TLS regions mostly loop-based • TLS regions pre-selected based on 32-byte reading and 4-byte writing granularity • Focus on specific aspects: • Simulate first billion instructions • Track only Read-After-Write dependences Speculative code regions pre-selected for 32 bytes -> our results are conservative!

  15. Variable Granularity at Code Region Level Memory accessed by Region 1 fork Speculative Code Region 1 join Granularity 4 bytes Memory accessed by Region 2 fork Speculative Code Region 2 join Granularity 32 bytes Memory accessed by Region 3 fork Speculative Code Region 3 join Granularity 8 bytes 4 bytes 8 bytes 32 bytes

  16. Ideal Granularity at Code Region Level page-level (4 k) cache-line level word-level Code regions with no conflicts not shown in figure (in parentheses) Ideal Granularity varies significantly between code regions

  17. Variable Granularity Across Memory Ranges Memory accessed by Region 1 fork Speculative Code Region 1 join Memory accessed by Region 2 fork Speculative Code Region 2 join Memory accessed by Region 3 fork Speculative Code Region 3 join 4 bytes 8 bytes 32 bytes

  18. Ideal Granularity Across Memory Ranges Cache-line size sometimes good, sometimes not Word-level rarely necessary Page-level often sufficient Ideal Granularity varies widely across memory ranges

  19. Can VGAT improve performance?

  20. Reducing the Number of Tracked Elements by using Variable Granularity 51 61 31 458 50 35 9 5 3 VGAT can reduce the # of tracked elements more than 3x!

  21. Ongoing Work • Should memory-centric or code-centric accesses determine granularity ? • Dynamic, adaptive system for deciding granularity based on iterative sampling • How best to use and store profile information • May tolerate some percentage of false conflicts • Hardware TLS • Reduce conflict-detection traffic, possibly power • Software TM (lock-based) • Reduce number of locks – save space and time • Reduce lock contention

  22. Conclusions (for Stampede TLS) • TM/TLS systems with only fixed coarse granularity may suffer many false conflicts • 2x – 4x on average • Variable granularity can reduce false conflicts and tracking overhead • 3x – 35x reduction in tracked ranges • Ideal granularity varies widely across memory ranges and speculative code regions

  23. Thank you! Questions ?

More Related