Program Demultiplexing: Data-flow based Speculative Parallelization

Program Demultiplexing:Data-flow based Speculative Parallelization Saisanthosh Balakrishnan Guri Sohi University of Wisconsin-Madison

Speculative Parallelization • Construct threads from sequential program • Loops, methods, … • Execute threads speculatively • Hardware support to enforce program order • Application domain • Irregularly parallel • Importance now • Single-core performance incremental

Speculative Parallelization Execution T1 T2 T3 T4 • Execution model • Fork threads in program order for execution • Commit tasks in that order Control-flow Speculative Parallelization • Limitation • Reaching distant parallelism

Outline • Program Demultiplexing Overview • Program Demultiplexing Execution Model • Hardware Support • Evaluation

Program Demultiplexing Framework Handler Trigger Sequential Execution EB Call Site M() • Trigger • Begins execution of Handler • Handler • Setup execution, parameters • Demultiplexed execution • Speculative • Stored in Execution Buffer • At call site • Search EB for execution • Dependence violations • Invalidate executions M() PD Execution

Program Demultiplexing Highlights • Method granularity • Well defined • Parameters • Stack for local communication • Trigger forks execution • Means for reaching distant method • Different from call site • Independent speculative executions • No control dependence with other executions • Triggers lead to unordered execution • Not according to program order

Example: 175.vpr, update_bb () .. x_from = block [b_from].x; y_from = block [b_from].y; find_to (x_from, y_from, block [b_from].type, rlim, &x_to, &y_to); .. .. for ( k = 0; k < num_nets_affected; k++ ) { inet = nets_to_update [k]; if (net_block_moved [k] == FROM_AND_TO) continue; .. if ( net [inet].num_pins <= SMALL_NET ) { get_non_updateable_bb (inet, &bb_coord_new [bb_index]); } else { if ( net_block_moved [k] == FROM ) update_bb ( inet, &bb_coord_new [bb_index], &bb_edge_new [bb_index], x_from, y_from, x_to, y_to ); else update_bb ( inet, &bb_coord_new [bb_index], &bb_edge_new [bb_index], x_to, y_to, x_from, y_from ); } .. .. bb_index++ }` Call Site 2 Call Site 1

Handlers • Provides parameters to execution • Achieves separation of call site and execution • Handler code • Slice of dependent instructions from call site • Many variants possible update_bb(inet, &bb_coord_new [bb_index], &bb_edge_new [bb_index], x_from, y_from, x_to, y_to);

Handlers Example H1 H2 .. x_from = block [b_from].x; y_from = block [b_from].y; find_to ( x_from, y_from, block [b_from].type, rlim, &x_to, &y_to ); .. .. for ( k = 0; k < num_nets_affected; k++ ) { inet = nets_to_update [k]; if (net_block_moved [k] == FROM_AND_TO) continue; .. if ( net [inet].num_pins <= SMALL_NET ) { get_non_updateable_bb (inet, &bb_coord_new [bb_index]); } else { if ( net_block_moved [k] == FROM ) update_bb ( inet, &bb_coord_new [bb_index], &bb_edge_new [bb_index], x_from, y_from, x_to, y_to ); else update_bb ( inet, &bb_coord_new [bb_index], &bb_edge_new [bb_index], x_to, y_to, x_from, y_from ); } .. .. bb_index++ }

Triggers • Fork demultiplexed execution • Usually when method and handler are ready • i.e. when data dependencies satisfied • Begins execution of the handler

Identifying Triggers Program state for H + M available • Generate memory profile • Identify trigger point • Collect for many executions • Good coverage • Represent trigger points • Use instruction attributes • PCs, Memory write address Sequential Exec. Program state for H+M Handler M()

Triggers Example H1 M M H2 M M M .. x_from = block [b_from].x; y_from = block [b_from].y; find_to ( x_from, y_from, block [b_from].type, rlim, &x_to, &y_to ); .. .. for ( k = 0; k < num_nets_affected; k++ ) { inet = nets_to_update [k]; if (net_block_moved [k] == FROM_AND_TO) continue; .. if ( net [inet].num_pins <= SMALL_NET ) { get_non_updateable_bb (inet, &bb_coord_new [bb_index]); } else { if ( net_block_moved [k] == FROM ) update_bb ( inet, &bb_coord_new [bb_index], &bb_edge_new [bb_index], x_from, y_from, x_to, y_to ); else update_bb ( inet, &bb_coord_new [bb_index], &bb_edge_new [bb_index], x_to, y_to, x_from, y_from ); } .. .. bb_index++ } T1 T2 Minimum of 400 cycles 90 cycles per execution

Handlers Example … (2) Stack references H1 H2 .. x_from = block [b_from].x; y_from = block [b_from].y; find_to ( x_from, y_from, block [b_from].type, rlim, &x_to, &y_to ); .. .. for ( k = 0; k < num_nets_affected; k++ ) { inet = nets_to_update [k]; if (net_block_moved [k] == FROM_AND_TO) continue; .. if ( net [inet].num_pins <= SMALL_NET ) { get_non_updateable_bb (inet, &bb_coord_new [bb_index]); } else { if ( net_block_moved [k] == FROM ) update_bb ( inet, &bb_coord_new [bb_index], &bb_edge_new [bb_index], x_from, y_from, x_to, y_to ); else update_bb ( inet, &bb_coord_new [bb_index], &bb_edge_new [bb_index], x_to, y_to, x_from, y_from ); } .. .. bb_index++ } T1 T2

Hardware Support Outline Dealt in other spec. parallelization proposals • Support for triggers • Demultiplexed execution • Maintaining executions • Storage • Invalidation • Committing

Support for Triggers • Triggers are registered with hardware • ISA extensions • Similar to debug watchpoints • Evaluation of triggers • Only by Committed instructions • PC, address • Fast lookup with filters

Demultiplexed Execution Main Auxiliary • Hardware: Typical MP system • Private cache for speculative data • Extend cache line with “access” bit • Misses serviced by Main processor • No communication with other executions • On completion • Collect read set (R) • Accessed lines • Collect write set (W) • Dirty lines • Invalidate write set in cache C C P0 P3 P1 P2 C C

Execution buffer pool Method (Parameters) Read Set <tag> <data> Write Set <tag> <data> Return value Method (Parameters) Read Set . . . • Holds speculative executions • Execution entry contains • Read and write set • Parameters and return value • Alternatives • Use cache • May be more efficient • Similar to other proposals • Not the focus in this paper

For a committed store address Search Read and Write sets Invalidate matching executions Invalidating Executions Invalidation

For a given call site Search method name, parameters Get write and read set Commit If accessed by program Use If accessed by another method Nested methods Using Executions Search

Reaching distant parallelism Call site Fo rk A B A Call Site B M()

Performance evaluation • Performance benefits limited by • Methods in program • Handler implementation

Summary of other results (Refer paper) • Method sizes • 10s to 1000s of instructions. Lower 100s usually • Demultiplexed execution overheads • Common case 1.1x to 2.0x • Trigger points • 1 to 3. Outliers exist: macro usage • Handler length • 10 to 50 instructions average • Cache lines • Read – 20s, Written – 10s • Demultiplexed execution • Held average of 100s of cycles

Conclusions • Method granularity • Exploit modularity in program • Trigger and handler to allow “earliest” execution • Data-flow based • Unordered execution • Reach distant parallelism • Orthogonal to other speculative parallelization • Use to further speedup demultiplexed execution

Backup

Average trigger points in call site • Small set of trigger points for a given call site • Defines reachability from trigger to the call site

Evaluation • Full-system execution-based simulator • Intel x86 ISA and Virtutech Simics • 4-wide out-of-order processors • 64K Level 1 caches (2 cycle), 1 MB Level 2 (12 cycle) • MSI coherence • Software toolchain • Modified gcc-compiler and lancet tool • Debugging information, CFG, program dependence graph • Simulator based memory profile • Generates triggers and handlers • No mis-speculations occur

Reaching distant parallelism A = Cycles between Fork and Call Site A M()

Execution Buffer Entries • Storage requirements • Max case 284 KB • Minimize entries by better scheduling 900 590 70 520 413 244 160 308 Avg. Cycles Held

Read and write set Cache lines written Cache lines read

Overheads due to Handler Cache misses due to demultiplexed execution Common case between 1.1 to 2.0x Small methods  High overheads Demultiplexed execution overheads Execution Time Overhead

Length of handlers 14% 10% 9% 100% 16% 4% 40% 4% Handler Instruction Count Overhead

Method sizes

Methods • Runtime includes frequently called methods • crafty gap gzip mcf parser twolf vortex vpr • 24 16 9 8 12 10 11 11 • 59 27 9 84 26 106 20 Methods Call Sites Exec. time (%) 85 90 51 30 55 92 88 99

Loop-level Parallelization • Unit: Loop iterations • Live-ins from • P-slice • Similar to handler • Fork instruction • Restricted • Same basic block level, method • Program order dependent • Ordered forking Mitosis fork loop endl

Method-level parallelization • Unit: Method continuations • Program after the method returns • Orthogonal to PD Method-level call M() ret

Reaching distant parallelism M1() B A B A M2() crafty gap gzip mcf pars twolf vortex vpr B A > 1 (%) 60 72 30 80 70 40 63 47

Reaching distant parallelism B = Call Time to Earliest execution time (1 outstanding) M1() C B C / B = R1 CNo params/C = R2 M2() A

Issues with Stack • Stack pointer is position dependent • Handler has to insert parameters at right position • Same stack addresses denote different variables • Affects triggers • Different stack pointers in program and execution • Stack may be discarded • To commit requires relocation of stack results • Example: parameters passed by reference

Benchmarks • SPECint2000 benchmarks • C programs • Did not evaluate gcc, perl, bzip2, and eon • No intention of creating concurrency • No specific/ clean Programming style • Many methods perform several tasks • May have less opportunities

Hardware System • Intel x86 simulation • Virtutech Simics based full-system, Bochs decoder • 4-processors at 3 GHz • Simple memory system • Micro-architecture model • 4-wide out of order without cracking into micro-ops • Branch predictors • 32K L1 (2-cycle), 1 MB L2 (12-cycle) • MSI, 15-cycle communication cache to cache • Infinite Execution buffer pool

Software • Modified gcc-compiler tool chain and lancet tool • Extract from compiled binary • Debugging information • CFG, Program Dependence Graph • Software • Dynamic information from simulator • Generates handler, trigger for call site as encountered • Control-flow in handler not included [ongoing work] • Perfect control transfer from trigger to method • Handler doesn’t execute if a branch leads to not calling the method

Generating Handlers • Cannot easily identify and demarcate code • Heuristic to demarcate • Terminate when load address is from heap • Handler has • Loads and stores to stack • No stores to heap • Limitation • Heuristic. Doesn’t always work

Generating Handlers • 1: Specify parameters to method • Pushed into stack by program • Introduces dependency • Prevents separation • 2: Computing parameters • Program performs it near call site • Need to identify the code • Deal with • Use of stack • Control-flow • Inter-method dependence 1: G = F (N) 2: if (…) 3: X = G + 2 4: else 5: X = G * 2 6: M (X)

Control-flow in Handlers C 1 2 3 4 • Depends on call site’s CF • Handler for D • Call site in C () BB 3 • Include Loop • BB 4 to BB 1 • Include Branch • Branch in BB 1 • Inclusion depends on trigger • Multiple iterations, diff. triggers • Ongoing work CFG (C), Call Graph D

Other dependencies in Handlers • C calls D, A or B calls C • Dependence (X) extends • May need multiple handlers • If multiple call sites Call Graph A(X) C(X) D(X) B(X)

Buffering Handler Writes • General case • Writes in handler to be buffered • Provided to execution • Discarded after execution • Current implementation • Only stack writes P1 P2 P3 C C C EB

Methods for Speculative Execution • Well encapsulated • Defined by parameters and return value • Stack for local computation • Heap for global state • Often performs specific tasks • Access limited global state • Limits side-effects

Program Demultiplexing: Data-flow based Speculative Parallelization