The Potential for Variable-Granularity Access Tracking for Optimistic Parallelism

The Potential for Variable-Granularity Access Tracking for Optimistic Parallelism Mihai Burcea, J. Gregory Steffan, Cristiana Amza University of Toronto MSPC 2008

Getting the Most Out of Your CPUs • Ubiquitous CMPs • How do we exploit all this parallelism? • How do we improve sequential applications? AMD Barcelona quad-core Intel Kentsfield quad-core

Optimistic Parallelism • Flavors: • Transactional Memory (TM) • Thread-Level Speculation (TLS) • Implementations: hardware, software, hybrid • Common required support: • Buffering speculative memory changes • Tracking and detecting memory access conflicts

Traditional Access Tracking Most approaches use some fixed granularity • Hardware TM/TLS:cache-line size • Typically 32/64/128 bytes • Software TLS:word-, object-level • Software TM:word/page/object granularity • Hybrid TM: mixture of above (in HW/SW) Is Fixed Granularity the best approach ?

Can We Reduce The Overhead of Dependence Tracking ? Key Intuition: “best” granularity likely varies within and across benchmarks Too much overhead Too many false conflicts Fine Granularity Coarse

False Conflicts when Using Uniform Coarse Granularity Measured in a TLS simulator; 32/64/128 = cache line sizes (bytes) Uniform coarse grain approach suffers false conflicts

Is there potential for a variable-granularity approach?

Goals Of Our Work • Show potential for Variable-Granularity Access Tracking (VGAT) • Finest grain too expensive; which coarse grain? • Show that ideal granularity varies across and within applications • Suggests need for dynamic, adaptive scheme • Show significant reduction in number of tracked memory ranges when using VGAT

Related Work • Hardware TLS / TM: track accesses at cache-line size (32/64/128 bytes) • Stampede (Steffan et. al., ACM Trans. 2005), Speculative Versioning Cache (Vijaykumar et. al., HPCA 1998) • Unbounded TM (Ananian et. al., HPCA 2005), LogTM (Moore et. al., HPCA 2006) • Software TLS: • Word (Cintra et. al., PPoPP 2003) • Object (Pickett et. al., LCPC 2005) • Software TM: • Word (McRT-STM – Saha et. al., PPoPP 2006) • Page (Manassiev et. al., PPoPP 2006) • Object: RSTM (Marathe et. al., PLDI 2006), DSTM (Herlihy et. al., PODC 2003) Most systems use fixed or object grain - but not necessarily the best

Related Work – Bulk Disambiguation • Ceze et. al., ISCA 2006 • Encode read/write sets into signatures • Detect conflicts by performing operations on signatures (fast) • Design of hashing (encoding) addresses into signatures includes false positives • Reduce conflict-detection traffic, but increase false conflicts Our goal: minimize false conflicts

Variable Granularity Access Tracking • Approaches: vary granularity across • Time: parts of apps. (speculative code regions) • Space: ranges of memory • Can potentially reduce: • Tracking storage • Tracking traffic • Commit latency • False conflicts

Impact On Conflicts Of Increasing Granularity True (actual) conflicts  Same nr. of conflicts, still ok Extra (false) conflicts! Coarsest granularity that incurs no false conflicts: Ideal Granularity

Measuring the Potential for VGAT

Experimental Framework • TLS simulator (CMU) • Subset of SpecINT2000 benchmarks • Instrumented for TLS • TLS regions mostly loop-based • TLS regions pre-selected based on 32-byte reading and 4-byte writing granularity • Focus on specific aspects: • Simulate first billion instructions • Track only Read-After-Write dependences Speculative code regions pre-selected for 32 bytes -> our results are conservative!

Variable Granularity at Code Region Level Memory accessed by Region 1 fork Speculative Code Region 1 join Granularity 4 bytes Memory accessed by Region 2 fork Speculative Code Region 2 join Granularity 32 bytes Memory accessed by Region 3 fork Speculative Code Region 3 join Granularity 8 bytes 4 bytes 8 bytes 32 bytes

Ideal Granularity at Code Region Level page-level (4 k) cache-line level word-level Code regions with no conflicts not shown in figure (in parentheses) Ideal Granularity varies significantly between code regions

Variable Granularity Across Memory Ranges Memory accessed by Region 1 fork Speculative Code Region 1 join Memory accessed by Region 2 fork Speculative Code Region 2 join Memory accessed by Region 3 fork Speculative Code Region 3 join 4 bytes 8 bytes 32 bytes

Ideal Granularity Across Memory Ranges Cache-line size sometimes good, sometimes not Word-level rarely necessary Page-level often sufficient Ideal Granularity varies widely across memory ranges

Can VGAT improve performance?

Reducing the Number of Tracked Elements by using Variable Granularity 51 61 31 458 50 35 9 5 3 VGAT can reduce the # of tracked elements more than 3x!

Ongoing Work • Should memory-centric or code-centric accesses determine granularity ? • Dynamic, adaptive system for deciding granularity based on iterative sampling • How best to use and store profile information • May tolerate some percentage of false conflicts • Hardware TLS • Reduce conflict-detection traffic, possibly power • Software TM (lock-based) • Reduce number of locks – save space and time • Reduce lock contention

Conclusions (for Stampede TLS) • TM/TLS systems with only fixed coarse granularity may suffer many false conflicts • 2x – 4x on average • Variable granularity can reduce false conflicts and tracking overhead • 3x – 35x reduction in tracked ranges • Ideal granularity varies widely across memory ranges and speculative code regions

Thank you! Questions ?

The Potential for Variable-Granularity Access Tracking for Optimistic Parallelism

The Potential for Variable-Granularity Access Tracking for Optimistic Parallelism

Presentation Transcript

Optimistic Methods for Concurrency Control

Scheduling for parallelism

Granularity and Elasticity Adaptation in Visual Tracking

Granularity

The Preference for Potential

Granularity

On disjoint access parallelism

So much for being optimistic .

7 Questions for Parallelism

COMP60621 Designing for Parallelism

COMP60621 Designing for Parallelism

Optimistic Intra-Transaction Parallelism using Thread Level Speculation

COMP60621 Designing for Parallelism

Pricing Granularity for Congestion-Sensitive Pricing

Tracking for Health (Potential Application for IOT)

An Optimistic Future for Manufacturing

COMP60621 Designing for Parallelism

Optimistic Intra-Transaction Parallelism on Chip-Multiprocessors

Optimistic Mixing for Exit-Polls

7 Questions for Parallelism

Granularity and Elasticity Adaptation in Visual Tracking

Feasibility study for improving TileCal granularity