1 / 20

Potential of Dynamic Binary Parallelization

Potential of Dynamic Binary Parallelization. Jing Yang , Kevin Skadron, Mary Lou Soffa, and Kamin Whitehouse Department of Computer Science University of Virginia. UCAS 7 Feburary 26, New Orleans, Louisiana. Why Automatic Parallelization ?.

lizina
Download Presentation

Potential of Dynamic Binary Parallelization

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Potential of Dynamic Binary Parallelization Jing Yang, Kevin Skadron, Mary Lou Soffa, and Kamin Whitehouse Department of Computer Science University of Virginia UCAS 7 Feburary 26, New Orleans, Louisiana

  2. Why Automatic Parallelization ? • Bridge the gap between parallel hardware and sequential software • Manual parallelization • Typically yield the best speedups • Time-consuming • Error-prone: data races and memory consistency complexities • Difficult to understand or refactor for parallelization

  3. Why Dynamic Binary Parallelization ? • Source code is sometimes unavailable • Legacy software • Third-party software • Y2K crisis: up to 60% of source code was missing • Assembled and defined at run time • Shared libraries, virtual functions, plugins, and dynamically-generated code • Components written in different languages • Exploit runtime information

  4. Trace-Based Dynamic Binary Parallelization • State of the art • Distributed superscalar design • Dynamic CFG transformation • Instruction window size vs. spurious dependencies • Combine the best of two worlds • Long traces: large instruction window • Atomic execution: no control dependencies • High speculation accuracy: low rollback overhead • High execution coverage: Admiral’s Law

  5. Conceptual Overview of T-DBP Predict Dispatch Sequential Execution Parallelized Candidate Traces T-DBP Skip Abort Success Predict Dispatch Continue Abort Abort Predict Dispatch Skip Success Success Predict Dispatch Core 1 Cores 2-7

  6. Evaluation of T-DBP Prototype Is there room for further improvements ? How does runtime information help ? Cross boundaries between application and library code ! Only respect dependencies on the actual execution path !

  7. Limit Study Setup • SPEC CPU2000: test input • Unlimited number of cores • Perfect speculation accuracy • Always identify the most frequently repeating patterns of instructions

  8. Limit Study Process • Record execution sequences • Analyze execution sequences  traces • Parallelize execution sequences • Model parallel execution time • Verify parallel execution sequences

  9. Record Execution Sequences • Dynamic binary instrumentation • Basic block: execution sequence • Effective address of loads and stores: memory disambiguation • Values of loads: deterministic replay • Reduce overhead • Double buffering: time • VPC3 compression algorithm: disk space

  10. Analyze Execution Sequences • Offline dictionary-based algorithm How to emulate the handicap of static parallelization? Only combine adjacent basic blocks if both of them belong to application code or both of them belong to library code !

  11. Parallelize Execution Sequences • Dynamic critical path scheduling algorithm • Build the dependency graph • Pick the next ready instruction with the smallest value of ALST – AEST • Schedule the instruction so that it does not delay the ALST of all scheduled instructions • Continue if not all instructions are scheduled

  12. How to Emulate the Handicap of Static Parallelization ? I1 : R1 = R4 I3 : R0 = R2 I1 : R1 = R4 3 clock cycles I2 : R0 = R1 I2 : R0 = R1 I4 : R3 = R0 (b) Parallelization on the CFG. I3 : R0 = R2 I5 : R2 = 2 2 clock cycles I3 : R0 = R2 I1 : R1 = R4 I4 : R3 = R0 I4 : R3 = R0 I2 : R0 = R1 (a) A Simple CFG. (c) Parallelization on the Trace.

  13. Model Parallel Execution Time • Instruction: one clock cycle • Pipelining • Inter-core synchronization: one clock cycle • Operand network • Synchronization array • Execution time of a parallelized trace • Maximum AEST of all instructions + one

  14. Verify Parallel Execution Sequences • Link into a single executable • Basic blocks • Traces: one possibility of linearization • Load into the original address space • Replay on a real machine

  15. Experimental Configurations • T-DBP: unconstrained • T-DBP – 1: not cross boundaries between application and library code • T-DBP – 2: not cross boundaries between application and library code; respect all true dependencies in the CFG

  16. Results of Integer Benchmarks 9.19 6.56 4.52

  17. Results of Floating Point Benchmarks 22.35 17.12 9.36

  18. Conclusion • There is much room for further improvements • Runtime information helps a lot ?

More Related