Rajiv Gupta Chen Tian, Min Feng, Vijay Nagarajan

Rajiv Gupta Chen Tian, Min Feng, Vijay Nagarajan Speculative Parallelization of Applications on Multicores

Speculative Parallelization • Goal: Exploit parallelism frequently observed --- not guaranteed to be present due the presence of dependences: • Dependences due to Cold Code • Dependences that are Harmless • Dependences that are Silent

Outline • Thread-based Execution Model • Non-Speculative thread and state • Speculativethreads and state • Committing Results • Rollback-free Recovery • Software Only – Coarse-grained parallelism • Speculative parallelization of loops • Speedups on a real machine 3

Execution Model • Main Thread • Performs Non-speculative Computation • Non-Parallelizable Code • Parts of Parallelized Code • Controls Parallel Threads • Initialization & Memory Allocation • Termination & Miss-speculation Checks • Commit Results In-Order • Multiple Parallel Threads • Perform Speculative Computations • E.g., Speculative Loop Bodies

p1 sp1 e1 p2 sp2 e2 p3 sp3 e3 p4 sp4 e4 Execution Model prologue speculative body epilogue Static code Sequential Execution

p1 sp1 e1 p2 sp2 e2 p3 sp3 e3 p4 sp4 e4 Execution Model Main thread P1 P2 prologue speculative body epilogue • In-order commit. • At any time, only two threads are executing. So • the main thread doesn’t require a separate core. Static code Sequential Execution Parallel Execution

P space Parallel Thread C space D space Main Thread … … P space Parallel Thread C space Memory State • Non-Speculative State (D space) • Maintained by the main thread. • Speculative State (P space) • Allocated by the main thread and used by parallel threads. • Results will be either committed to D space or discarded. • Coordinating State (C space) • Version number for variables in D space. • Mapping table for variables in P space.

Copy Operations • Naïve scheme • Copy-in: copy values from D space to P space when work assigned. • Copy-out:copy variable values from P space to D space when the speculation check succeeds. • Optimized scheme • Use profiling to discover access pattern of variables in the speculative loop body. • In-Out, Only-in, Only-out and Thread-local. • Unknown variables untouched in the profiling run. • Copied on-the-fly through message passing. Mapping table • Mapping information of those variables. • D space address, P space address, size, version and write-flag. • Updated when variables are copied into P space. • Referred to when variables are copied back to D space.

Miss-speculation Check • Version number – maintained by the main thread • For each variable that is potentially read/written by parallel threads. • Version number is copied into mapping table when the corresponding variable is being copied into P space. • Miss-speculation check – performed by the main thread • For every entry in the mapping table, compare its version number with the one maintained by the main thread. • If all are same, the speculation succeeds. • Perform the copy-out operations. • Update the version numbers accordingly. • If any version number is different, the speculation fails because some earlier thread has changed this variable’s value: • Re-execute the speculative body with the latest value. • Value-based dependence check.  Rollback Free Recovery

On-the-fly Copying • Access Checks to consult the mapping table at: • Loads and Stores • Pointer Assignments • Reducing Overhead of Access Checks • Stack & Global variables:Based upon classification. • In-Out, Only-in, Only-out, Thread-local. • Heap:Optimizations beyond classification. • Locally created objects require no checks. • Once object is copied, other fields accessed without checking. • Copy-on-write only: No checks needed at Loads; Since version number not copied on a read, miss-speculation detection implicitly carried out by another copied variable.

Other Enhancements • Reducing Thread Idling • Scenario: an earlier thread finishes its task, but the main thread has not finished assigning tasks to later threads, and hence cannot handle this earlier thread. • Performance fell when 4 or more parallel threads are used. • Solution: assign more work to each thread by loop unrolling. • Reducing Miss-speculation Rate • Scenario: the value of a speculative variable being used by a thread is changed by an earlier thread, and hence the speculation fails. • For benchmark 181.mcf, the miss-speculation rate becomes higher when more threads are used. • Solution:delay copying of some variables - on-the-fly mechanism. • Increases the chance of getting the latest version.

Speculative Parallelization while (){ line = read_one_line(input_file); if (line cannot be parsed) { error_num++; } else { result = parse(line); } line_handled ++; print(result); } • Prologue • Input statements (e.g. fgets). • Loop counters. • Epilogue • Output statements (e.g. printf) • Statements highly-dependent on previous iteration. • E.g. line_handled • Speculative body • The remainder. • Loop carried dependence on error_num rarely manifests itself. An example from 197.parse

Main Thread for (i=0; i < Num_Proc; i++) { allocate P and C space for thread i; Prologue code; create thread i to execute thrd_func (i); } reset i = 0; Create thread and initialize their tasks while (){ Prologue code; Speculative body code; Epilogue code } while() { while (speculation_check(i) == FAIL) { update P and C space for thread i; re-execute thrd_func (i); } commit result and execute Epilogue code; Prologue code; update P and C space for thread i; ask thread i to execute thrd_func (i); i= (i+1) % Num_Proc; } wait for all threads’ completion and execute Epilogue code; Handle misspeculation In-order commit Assign a new iteration

Parallel Thread void * thrd_func(i) { while (1) { wait for the “start” message; Speculative body code; send “finish” message; } } while (){ Prologue code; Speculative body code; Epilogue code } • Checks preceding/following • Loads, Stores, Pointer Assignments

Experimental Setup Profiling tool (Pin) Dependence graph and access patterns Binary and a small input objdump Transformation Template Symbols Compiler infrastructure (LLVM) Source code -native option x86 binary Dell PowerEdge 1900 Two Intel Xeon Quadcores 3 GHz, 16 GB

Experimental Setup • Benchmarks • 5 SPEC benchmarks • 197.parser, 181.mcf, 130.li, 256.bzip2 & 255.vortex. • 1 MiBench benchmark • CRC32- Best speedup achieved among all benchmarks. • Variables in speculative body (obtained via profiling) Dell PowerEdge 1900 server with two quad-core processors, 3GHz, &16 GB.

Execution Speedups • All benchmarks get the best speedup when 8 threads are used. • The highest speedups ranges from 3.7 to 7.8 across all benchmarks.

Thread Idling

Delayed Copying Threads • Without delayed copying, miss-speculation rate of 181.mcf increases from 0.7% to 17.5% as the number of parallel thread increases from 2 to 8. • With delayed copying, miss-speculation rate of 181.mcf is below 10%. • The miss-speculation rate of other benchmarks is less than 2%.

Copy Optimization • Considered three schemes: 1. All: all variables copied before parallel thread starts work • Unnecessary copying occurs. 2. On-the-fly:all variables copied on-the-fly via message passing • Need to check every variable to see if it has been copied into P space 3. Opt.: profiling used to determine when to copy

Copy Optimization • The experiment shows the result when 4 threads are used. • Opt. outperforms other two schemes. • On-the-fly outperforms all when heap accesses dominate (bzip2, mcf).

Overhead – Instruction Count • Overhead breakdown per core when 8 threads are used. • No more than 7% of total instructions are used for operations related to the execution model.

Overhead – Memory Space • For most benchmarks, the space overhead is around 2-3x. • 256.bzip2 - a large chunk of heap needs to be copied to P space.

Rajiv Gupta Chen Tian, Min Feng, Vijay Nagarajan