1 / 50

Program Demultiplexing: Data-flow based Speculative Parallelization

Program Demultiplexing: Data-flow based Speculative Parallelization. Saisanthosh Balakrishnan Guri Sohi University of Wisconsin-Madison. Speculative Parallelization. Construct threads from sequential program Loops, methods, … Execute threads speculatively

jbridgers
Download Presentation

Program Demultiplexing: Data-flow based Speculative Parallelization

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Program Demultiplexing:Data-flow based Speculative Parallelization Saisanthosh Balakrishnan Guri Sohi University of Wisconsin-Madison

  2. Speculative Parallelization • Construct threads from sequential program • Loops, methods, … • Execute threads speculatively • Hardware support to enforce program order • Application domain • Irregularly parallel • Importance now • Single-core performance incremental

  3. Speculative Parallelization Execution T1 T2 T3 T4 • Execution model • Fork threads in program order for execution • Commit tasks in that order Control-flow Speculative Parallelization • Limitation • Reaching distant parallelism

  4. Outline • Program Demultiplexing Overview • Program Demultiplexing Execution Model • Hardware Support • Evaluation

  5. Program Demultiplexing Framework Handler Trigger Sequential Execution EB Call Site M() • Trigger • Begins execution of Handler • Handler • Setup execution, parameters • Demultiplexed execution • Speculative • Stored in Execution Buffer • At call site • Search EB for execution • Dependence violations • Invalidate executions M() PD Execution

  6. Program Demultiplexing Highlights • Method granularity • Well defined • Parameters • Stack for local communication • Trigger forks execution • Means for reaching distant method • Different from call site • Independent speculative executions • No control dependence with other executions • Triggers lead to unordered execution • Not according to program order

  7. Outline • Program Demultiplexing Overview • Program Demultiplexing Execution Model • Hardware Support • Evaluation

  8. Example: 175.vpr, update_bb () .. x_from = block [b_from].x; y_from = block [b_from].y; find_to (x_from, y_from, block [b_from].type, rlim, &x_to, &y_to); .. .. for ( k = 0; k < num_nets_affected; k++ ) { inet = nets_to_update [k]; if (net_block_moved [k] == FROM_AND_TO) continue; .. if ( net [inet].num_pins <= SMALL_NET ) { get_non_updateable_bb (inet, &bb_coord_new [bb_index]); } else { if ( net_block_moved [k] == FROM ) update_bb ( inet, &bb_coord_new [bb_index], &bb_edge_new [bb_index], x_from, y_from, x_to, y_to ); else update_bb ( inet, &bb_coord_new [bb_index], &bb_edge_new [bb_index], x_to, y_to, x_from, y_from ); } .. .. bb_index++ }` Call Site 2 Call Site 1

  9. Handlers • Provides parameters to execution • Achieves separation of call site and execution • Handler code • Slice of dependent instructions from call site • Many variants possible update_bb(inet, &bb_coord_new [bb_index], &bb_edge_new [bb_index], x_from, y_from, x_to, y_to);

  10. Handlers Example H1 H2 .. x_from = block [b_from].x; y_from = block [b_from].y; find_to ( x_from, y_from, block [b_from].type, rlim, &x_to, &y_to ); .. .. for ( k = 0; k < num_nets_affected; k++ ) { inet = nets_to_update [k]; if (net_block_moved [k] == FROM_AND_TO) continue; .. if ( net [inet].num_pins <= SMALL_NET ) { get_non_updateable_bb (inet, &bb_coord_new [bb_index]); } else { if ( net_block_moved [k] == FROM ) update_bb ( inet, &bb_coord_new [bb_index], &bb_edge_new [bb_index], x_from, y_from, x_to, y_to ); else update_bb ( inet, &bb_coord_new [bb_index], &bb_edge_new [bb_index], x_to, y_to, x_from, y_from ); } .. .. bb_index++ }

  11. Triggers • Fork demultiplexed execution • Usually when method and handler are ready • i.e. when data dependencies satisfied • Begins execution of the handler

  12. Identifying Triggers Program state for H + M available • Generate memory profile • Identify trigger point • Collect for many executions • Good coverage • Represent trigger points • Use instruction attributes • PCs, Memory write address Sequential Exec. Program state for H+M Handler M()

  13. Triggers Example H1 M M H2 M M M .. x_from = block [b_from].x; y_from = block [b_from].y; find_to ( x_from, y_from, block [b_from].type, rlim, &x_to, &y_to ); .. .. for ( k = 0; k < num_nets_affected; k++ ) { inet = nets_to_update [k]; if (net_block_moved [k] == FROM_AND_TO) continue; .. if ( net [inet].num_pins <= SMALL_NET ) { get_non_updateable_bb (inet, &bb_coord_new [bb_index]); } else { if ( net_block_moved [k] == FROM ) update_bb ( inet, &bb_coord_new [bb_index], &bb_edge_new [bb_index], x_from, y_from, x_to, y_to ); else update_bb ( inet, &bb_coord_new [bb_index], &bb_edge_new [bb_index], x_to, y_to, x_from, y_from ); } .. .. bb_index++ } T1 T2 Minimum of 400 cycles 90 cycles per execution

  14. Handlers Example … (2) Stack references H1 H2 .. x_from = block [b_from].x; y_from = block [b_from].y; find_to ( x_from, y_from, block [b_from].type, rlim, &x_to, &y_to ); .. .. for ( k = 0; k < num_nets_affected; k++ ) { inet = nets_to_update [k]; if (net_block_moved [k] == FROM_AND_TO) continue; .. if ( net [inet].num_pins <= SMALL_NET ) { get_non_updateable_bb (inet, &bb_coord_new [bb_index]); } else { if ( net_block_moved [k] == FROM ) update_bb ( inet, &bb_coord_new [bb_index], &bb_edge_new [bb_index], x_from, y_from, x_to, y_to ); else update_bb ( inet, &bb_coord_new [bb_index], &bb_edge_new [bb_index], x_to, y_to, x_from, y_from ); } .. .. bb_index++ } T1 T2

  15. Outline • Program Demultiplexing Overview • Program Demultiplexing Execution Model • Hardware Support • Evaluation

  16. Hardware Support Outline Dealt in other spec. parallelization proposals • Support for triggers • Demultiplexed execution • Maintaining executions • Storage • Invalidation • Committing

  17. Support for Triggers • Triggers are registered with hardware • ISA extensions • Similar to debug watchpoints • Evaluation of triggers • Only by Committed instructions • PC, address • Fast lookup with filters

  18. Demultiplexed Execution Main Auxiliary • Hardware: Typical MP system • Private cache for speculative data • Extend cache line with “access” bit • Misses serviced by Main processor • No communication with other executions • On completion • Collect read set (R) • Accessed lines • Collect write set (W) • Dirty lines • Invalidate write set in cache C C P0 P3 P1 P2 C C

  19. Execution buffer pool Method (Parameters) Read Set <tag> <data> Write Set <tag> <data> Return value Method (Parameters) Read Set . . . • Holds speculative executions • Execution entry contains • Read and write set • Parameters and return value • Alternatives • Use cache • May be more efficient • Similar to other proposals • Not the focus in this paper

  20. For a committed store address Search Read and Write sets Invalidate matching executions Invalidating Executions Invalidation

  21. For a given call site Search method name, parameters Get write and read set Commit If accessed by program Use If accessed by another method Nested methods Using Executions Search

  22. Outline • Program Demultiplexing Overview • Program Demultiplexing Execution Model • Hardware Support • Evaluation

  23. Reaching distant parallelism Call site Fo rk A B A Call Site B M()

  24. Performance evaluation • Performance benefits limited by • Methods in program • Handler implementation

  25. Summary of other results (Refer paper) • Method sizes • 10s to 1000s of instructions. Lower 100s usually • Demultiplexed execution overheads • Common case 1.1x to 2.0x • Trigger points • 1 to 3. Outliers exist: macro usage • Handler length • 10 to 50 instructions average • Cache lines • Read – 20s, Written – 10s • Demultiplexed execution • Held average of 100s of cycles

  26. Conclusions • Method granularity • Exploit modularity in program • Trigger and handler to allow “earliest” execution • Data-flow based • Unordered execution • Reach distant parallelism • Orthogonal to other speculative parallelization • Use to further speedup demultiplexed execution

  27. Backup

  28. Average trigger points in call site • Small set of trigger points for a given call site • Defines reachability from trigger to the call site

  29. Evaluation • Full-system execution-based simulator • Intel x86 ISA and Virtutech Simics • 4-wide out-of-order processors • 64K Level 1 caches (2 cycle), 1 MB Level 2 (12 cycle) • MSI coherence • Software toolchain • Modified gcc-compiler and lancet tool • Debugging information, CFG, program dependence graph • Simulator based memory profile • Generates triggers and handlers • No mis-speculations occur

  30. Reaching distant parallelism A = Cycles between Fork and Call Site A M()

  31. Execution Buffer Entries • Storage requirements • Max case 284 KB • Minimize entries by better scheduling 900 590 70 520 413 244 160 308 Avg. Cycles Held

  32. Read and write set Cache lines written Cache lines read

  33. Overheads due to Handler Cache misses due to demultiplexed execution Common case between 1.1 to 2.0x Small methods  High overheads Demultiplexed execution overheads Execution Time Overhead

  34. Length of handlers 14% 10% 9% 100% 16% 4% 40% 4% Handler Instruction Count Overhead

  35. Method sizes

  36. Methods • Runtime includes frequently called methods • crafty gap gzip mcf parser twolf vortex vpr • 24 16 9 8 12 10 11 11 • 59 27 9 84 26 106 20 Methods Call Sites Exec. time (%) 85 90 51 30 55 92 88 99

  37. Loop-level Parallelization • Unit: Loop iterations • Live-ins from • P-slice • Similar to handler • Fork instruction • Restricted • Same basic block level, method • Program order dependent • Ordered forking Mitosis fork loop endl

  38. Method-level parallelization • Unit: Method continuations • Program after the method returns • Orthogonal to PD Method-level call M() ret

  39. Reaching distant parallelism M1() B A B A M2() crafty gap gzip mcf pars twolf vortex vpr B A > 1 (%) 60 72 30 80 70 40 63 47

  40. Reaching distant parallelism B = Call Time to Earliest execution time (1 outstanding) M1() C B C / B = R1 CNo params/C = R2 M2() A

  41. Issues with Stack • Stack pointer is position dependent • Handler has to insert parameters at right position • Same stack addresses denote different variables • Affects triggers • Different stack pointers in program and execution • Stack may be discarded • To commit requires relocation of stack results • Example: parameters passed by reference

  42. Benchmarks • SPECint2000 benchmarks • C programs • Did not evaluate gcc, perl, bzip2, and eon • No intention of creating concurrency • No specific/ clean Programming style • Many methods perform several tasks • May have less opportunities

  43. Hardware System • Intel x86 simulation • Virtutech Simics based full-system, Bochs decoder • 4-processors at 3 GHz • Simple memory system • Micro-architecture model • 4-wide out of order without cracking into micro-ops • Branch predictors • 32K L1 (2-cycle), 1 MB L2 (12-cycle) • MSI, 15-cycle communication cache to cache • Infinite Execution buffer pool

  44. Software • Modified gcc-compiler tool chain and lancet tool • Extract from compiled binary • Debugging information • CFG, Program Dependence Graph • Software • Dynamic information from simulator • Generates handler, trigger for call site as encountered • Control-flow in handler not included [ongoing work] • Perfect control transfer from trigger to method • Handler doesn’t execute if a branch leads to not calling the method

  45. Generating Handlers • Cannot easily identify and demarcate code • Heuristic to demarcate • Terminate when load address is from heap • Handler has • Loads and stores to stack • No stores to heap • Limitation • Heuristic. Doesn’t always work

  46. Generating Handlers • 1: Specify parameters to method • Pushed into stack by program • Introduces dependency • Prevents separation • 2: Computing parameters • Program performs it near call site • Need to identify the code • Deal with • Use of stack • Control-flow • Inter-method dependence 1: G = F (N) 2: if (…) 3: X = G + 2 4: else 5: X = G * 2 6: M (X)

  47. Control-flow in Handlers C 1 2 3 4 • Depends on call site’s CF • Handler for D • Call site in C () BB 3 • Include Loop • BB 4 to BB 1 • Include Branch • Branch in BB 1 • Inclusion depends on trigger • Multiple iterations, diff. triggers • Ongoing work CFG (C), Call Graph D

  48. Other dependencies in Handlers • C calls D, A or B calls C • Dependence (X) extends • May need multiple handlers • If multiple call sites Call Graph A(X) C(X) D(X) B(X)

  49. Buffering Handler Writes • General case • Writes in handler to be buffered • Provided to execution • Discarded after execution • Current implementation • Only stack writes P1 P2 P3 C C C EB

  50. Methods for Speculative Execution • Well encapsulated • Defined by parameters and return value • Stack for local computation • Heap for global state • Often performs specific tasks • Access limited global state • Limits side-effects

More Related