1 / 21

Towards a Toolchain for Pipeline-Parallel Programming on CMPs

Towards a Toolchain for Pipeline-Parallel Programming on CMPs. Tipp Moseley, Graham Price, Brian Bushnell, Manish Vachharajani, and Dirk Grunwald University of Colorado at Boulder 2007.03.11. John Giacomoni. Problem. UP performance at “end of life” Chip-Multiprocessor systems

fia
Download Presentation

Towards a Toolchain for Pipeline-Parallel Programming on CMPs

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Towards a Toolchain forPipeline-Parallel Programming on CMPs Tipp Moseley, Graham Price, Brian Bushnell, Manish Vachharajani, and Dirk Grunwald University of Colorado at Boulder 2007.03.11 John Giacomoni

  2. Problem • UP performance at “end of life” • Chip-Multiprocessor systems • Individual cores less powerful than UP • Asymmetric and Heterogeneous • 10s-100s-1000s of cores • How to program? Intel (2x2-core) MIT RAW (16-core) 100-core 400-core

  3. Programmers… • Programmers are: • Bad at explicitly parallel programming • “Better” at sequential programming • Solutions? • Hide parallelism • Compilers • Sequential libraries? • Math, iteration, searching, and ??? Routines

  4. Using Multi-Core • Task Parallelism • Desktop • Data Parallelism • Web serving • Split/Join, MapReduce, etc… • Pipeline Parallelism • Video Decoding • Network Processing

  5. Joining theMinority Chorus We believe that the best strategy for developing parallel programs may be to evolve them from sequential implementations. Therefore we need a toolchain that assists programmers in converting sequential programs into parallel ones. This toolchain will need to support all four conversion stages: identification, implementation, verification, and runtime system support.

  6. TheToolchain • Identification • LoopProf and LoopSampler • ParaMeter • Implementation • Concurrent Threaded Pipelining • Verification • Runtime system support

  7. LoopProfLoopSampler • Thread level parallelism benefits from coarse grain information • Not provided by gprof, et al. • Visualize relationship between functions and hot loops • No recompilation • LoopSampler is effectively overhead free

  8. Partial LoopCall Graph Boxes are functions Ovals are loops

  9. ParaMeter • Dynamic Instruction Number vs. Ready Time graph • Visualize dependence chains • Fast random access of trace information • Compact representation • Trace Slicing • Moving forward or backwards in a trace based on a flow (control, dependences, etc) • Requires information from disparate trace locations • Variable Liveness Analysis

  10. DIN Instruction Mem Addr DIN Ready Time … 14 14 mov r3,r4 14 13 13 st 0(r1),r5 0xbee4 13 12 12 add r5,r3,r4 12 11 11 bnz r2, loop 11 10 10 addi r1,r1,4 10 9 9 addi r2,r2,-1 9 8 8 mov r4,r5 8 7 7 mov r3,r4 7 6 6 st 0(r1),r5 0xbee0 6 5 5 add r5,r3,r4 5 4 4 mov r3, 1 4 3 3 mov r4, 1 3 2 2 mov r2, 8 2 1 1 li r1, &fibarr 1 1 2 3 4 5 6 7 8 9 10 11 12 13 14 DIN vs.Ready Time

  11. Multiple Dep. chains DIN vs.Ready Time DIN plot for 254.gap (IA64,gcc,inf)

  12. Challenging Trace size Trace complexity Need fast random access Solution Binary Decision Diagrams Compression ratios: 16-60x 109 instructions in 1GB Handling theInformation Glut

  13. Implementation • Well researched • Task-Parallel • Data-Parallel • More work to be done • Pipeline-Parallel • Concurrent Threaded Pipelining • FastForward • DSWP • Stream languages • Streamit

  14. ConcurrentThreaded Pipelining • Pipeline-Parallel organization • Each stage bound to a processor • Sequential data flow • Data Hazards are a problem • Software solution • FastForward

  15. ThreadedPipelining Concurrent Sequential

  16. Related Work • Software architectures • Click, SEDA, Unix pipes, sysv queues, etc… • Locking queues take >= 1200 cycles (600ns) • Additional overhead for cross-domain communication • Compiler extracted pipelines • Decoupled Software Pipelining (DSWP) • Modified IMPACT compiler • Communication operations <= 100 cycles (50ns) • Assumes hardware queues • Decoupled Access/Execute Architectures

  17. FastForward • Portable software only framework ~70-80 cycles (35-40ns)/queue operation • Core-core & die-die • Architecturally tuned CLF queues • Works with all consistency models • Temporal slipping & prefetching to hide die-die communication • Cross-domain communication • Kernel/Process/Thread

  18. Network ScenarioFShm • How do we protect? • GigE Network Properties: • 1,488,095 frames/sec • 672 ns/frame • Frame dependencies

  19. Verification • Characterize run-time behavior with static analysis • Test generation • Code verification • Post-mortem root-fault analysis • Identify the frontier of states leading to an observed fault • Use formal methods to final fault-lines

  20. RuntimeSystem Support • Hardware virtualization • Asymmetric and heterogeneous cores • Cores may not share main memory (GPU) • Pipelined OS services • Pipelines may cross process domains • FShm • Each domain should keep its private memory • Protection • Need label for each pipeline • Co/gang-scheduling of pipelines

  21. Questions?

More Related