1 / 25

Compiling with multicore

Compiling with multicore. Jeehyung Lee 15-745 Spring 2009. Papers. Automatic Thread Extraction with Decoupled Software Pipelining Fully automatic Fine grained pipelining A Practical Approach to Exploring Coarse-Grained Pipeline Parallelism in C Programs Semi-automatic

Download Presentation

Compiling with multicore

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Compiling with multicore Jeehyung Lee 15-745 Spring 2009

  2. Papers • Automatic Thread Extraction with Decoupled Software Pipelining • Fully automatic • Fine grained pipelining • A Practical Approach to Exploring Coarse-Grained Pipeline Parallelism in C Programs • Semi-automatic • Coarse grained pipelining

  3. First paper • Automatic Thread Extraction with Decoupled Software Pipelining • Guilherme Ottoni, Ram Rangan, Adam Stoler and David August • From Princeton University

  4. What is the paper about? • Despite increasing uses of multiprocessors, many single threaded applications do not benefit • Let the compiler automatically extract threads and exploit lurking pipeline parallelism • Extract non-speculative and truly decoupled threads through Decoupled Software Pipelining(DSWP)

  5. Why decoupled pipelining? Example Linked list traversal

  6. Why decoupled pipelining? DOACROSS Iteration * (LD latency + communication latency)

  7. Why decoupled pipelining? DSWP One way pipelining Iteration * LD latency

  8. DSWP • Flow of data (dependency) is acyclic among cores • With use of inter-core queue, threads can be decoupled • Efficiency + high tolerance for latency

  9. DSWP Algorithm • Build dependence graph • Find strongly connected components (SCC) • Create DAG of SCC • Partition DAG • Split codes into partitions • Add flows to partitions

  10. Build dependence graph Include every traditional dependence (data, control, and memory) & extensions

  11. Find SCC • SCC : Instructions that form a dependency cycle in a loop • Instructions in SCC cannot be parallelized 1 1 2 2 1 2

  12. Create DAG of SCCs • Merge instructions within each SCC and update dependency arrows

  13. Partition DAG • Partition DAG nodes into n partitions ( n <= # of processors) • Use heuristic to maximize load balance • Decide # of partitions (threads) • Start filling in from partition 1 with nodes from the top of DAG. • When the partition is stuffed (estimated by # of cycles), move on to next partition • Find the best # of threads and its partition

  14. Split codes and insert flows (done!) • For each partition, insert code basic blocks relevant to its contained SCC node • Add in codes for dependency flow

  15. Result • 19.4% speedup on important benchmark loops, 9.2% overall • When core bandwidth is halved • Single threaded code slows down by 17.1% • DSWP code is still slightly faster than single-threaded code running on full-bandwidth core • Promising enabler for Thread-Level-Parallelism(TLP)?

  16. Second Paper • A Practical Approach to Exploring Coarse-Grained Pipeline Parallelism in C Programs • William Thies, Vikram Chandrasekhar and Saman Amaransinghe • From MIT

  17. What is the paper about? • Despite increasing uses of multiprocessors, many single threaded… (Repeated) • Coarse grained pipelining is more desirable, but is especially hard with obfuscated C codes • Let people define pipeline, and learn practical dependencies in runtime

  18. What is the paper about? • Despite increasing uses of multiprocessors, many single threaded… (Repeated) • Coarse grained pipelining is more desirable, but is especially hard with obfuscated C codes • Let people define stages, and learn practical dependencies in runtime …for streaming applications

  19. Interface • Add annotations in the body of top loop

  20. Dynamic analysis • The system creates a stream graph according to annotations. How do they find dependencies?

  21. Dynamic analysis • Streaming applications tend to have a fixed pattern of dataflow (stable flow) among pipeline stages

  22. Dynamic analysis • Run the application on training examples, and record every relevant store-load pair across pipeline boundaries This gives us practical dependencies

  23. Interface • Program shows a complete stream graph • User decides if he/she likes this • pipelining or not • If yes, done! • else, redo annotations. Iterate over until satisfied

  24. Actual pipelining • When compiled, annotation macros emit codes that will fork original program for each pipeline stage

  25. Result • Average 2.78x speedup, max 3.89x on 4-core • Seems unsound but practical (?)

More Related