1 / 28

Acceleration of an Asynchronous Message Driven Programming Paradigm on IBM Blue Gene/Q

This paper discusses the acceleration of the asynchronous message driven programming paradigm on the IBM Blue Gene/Q machine. It covers the Charm++ programming model, optimization techniques, performance results, and the Blue Gene/Q architecture.

Download Presentation

Acceleration of an Asynchronous Message Driven Programming Paradigm on IBM Blue Gene/Q

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Acceleration of an Asynchronous Message Driven Programming Paradigm on IBM Blue Gene/Q Sameer Kumar* IBM T J Watson Research Center, Yorktown Heights, NY Yanhua Sun, Laxmikant Kale Department of Computer Science University of Illinois at Urbana Champaign IPDPS

  2. Overview • Charm++ programming model • Blue Gene/Q machine • Programming models and messaging libraries on Blue Gene/Q • Optimization of Charm++ on BG/Q • Performance results • Summary IPDPS

  3. Charm++ Programming Model • Asynchronous Message Driven Programming • Users decompose the problem (over decomposition) • Intelligent runtime : task processor mapping, communication load balancing, fault tolerance • Overlap computation and communication via asynchronous communication • Execution driven by available message data IPDPS

  4. Charm++ Runtime System • Non SMP mode • One process per hardware thread • Each process has a separate charm scheduler • SMP Mode • Single or a few processes per network node • Multiple threads executing charm++ schedulers in the same address space • Lower space overheads as read only data structures are not replicated • Communication threads can drive network progress • Communication within the node via pointer exchange IPDPS

  5. Blue Gene/Q IPDPS

  6. Integrated scalable 5D torus Virtual Cut-Through routing Hardware assists for collective & barrier functions FP addition support in network RDMA Integrated on-chip Message Unit 272 concurrent endpoints 2 GB/s raw bandwidth on all 10 links each direction -- i.e. 4 GB/s bidi 1.8 GB/s user bandwidth protocol overhead 5D nearest neighbor exchange measured at 1.76 GB/s per link (98% efficiency) Processor architecture Implemented 64-bit PowerISATM v2.06 1.6 GHz @ 0.8V. 4-way Simultaneous Multi- Threading Quad FPU 2-way concurrent issue In-order execution with dynamic branch prediction Node architecture Large multi-core SMPs with 64 threads/node Relatively small amount of memory per thread: 16 GB node share by 64 threads Blue Gene/Q Architecture IPDPS

  7. New Hardware Features • Scalable L2 Atomics • Atomic operations can be invoked on 64bit words in DDR • Several operations supported including load-increment, store-add, store-XOR .. • Bounded atomics supported • Wait on pin • Thread can arm a wakeup unit and go to wait • Core resources such load/store pipeline slots, arithmetic units not used • Thread awakened by • Network packet • Store to a memory location that results in an L2 invalidate • Inter-process-interrupt (IPI) IPDPS

  8. PAMI Messaging Library on BG/Q Applications MPICH2 2.x CAF Runtime X10 Runtime UPC Runtime GA ARMCI CHARM++ GASNet IBM MPI 2.x Middleware APGAS Runtime PAMI ADI MPCI PAMI API BG/Q messaging implementation Intel x86 messaging implementation PERCS messaging implementation System Software MU SPI HAL API BG/Q Intel x86 PERCS PAMI: Parallel Active Messaging Interface IPDPS

  9. Point-to-point Operations • Active messages • A registered handle is called on the remote node • PAMI_Send_immediate for short transfers • PAMI_Send • One-sided remote DMA • PAMI_Get, PAMI_Put : application initiates RDMA with remote virtual address • PAMI_Rget, PAMI_Rput: application first exchanges memory regions before starting RDMA transfer IPDPS

  10. Multi-threading in PAMI • Multi-context communication • Enable several threads in a multi-core architecture concurrent access to the network • Eliminate contention for shared resources • Enable parallel send and receive operations on different contexts via different BG/Q injection and reception FIFOs • Endpoint addressing scheme • Communication is between network endpoints, not processes, threads, or tasks • Multiple contexts progressed on multiple communication threads • Communication threads on BG/Q wait on pin • L2 writes or network packets can awaken communication threads with very low overheads • Post work to PAMI contexts via PAMI_Context_post • Work posted to a concurrent L2 atomic queue • Work functions advanced by main or communication threads IPDPS

  11. Charm++ Port over PAMI on BG/Q IPDPS

  12. Charm++ Port and Optimizations • Ported the converse machine interface to make PAMI API calls • Explored various optimizations • Lockless queues • Scalable memory allocation • Concurrent communication • Allocate multiple PAMI contexts • Multiple communication threads driving multiple PAMI contexts • Optimize short messages • Manytomany IPDPS

  13. Lockless Queues • Concurrent producer consumer array based queues based on L2 atomic increments • Overflow queue used when L2 queue is full • Threads in the same process can send messages via concurrent enqueues IPDPS

  14. Scalable Memory Allocation • Systems software on BG/Q calls glibc shared arena allocator • Malloc • Find an available arena and lock it • Allocate and return memory buffer • Release lock • Free • Find arena where buffer was allocated from • Lock arena, free buffer in that arena and unlock • Free results in thread contention • Can slow down short malloc/free calls typically used in Charm++ applications such as NAMD IPDPS

  15. Scalable Memory Allocation (2) • Optimize via memory pools of short buffers • L2 atomic queues for fast thread concurrent access • Allocate • Dequeue from Charm thread’s local memory pool if memory buffer is available • If pool is empty allocate via glibc malloc • De allocate • Enqueue to owner thread’s pool via a lockless enqueue • Release via glibc free if owner thread’s pool is full IPDPS

  16. Multiple Contexts and Communication Threads • Maximize concurrency in sends and receives • Charm++ SMP mode creates multiple PAMI contexts • Sub groups of Charm++ worker threads are associated with a PAMI context • For example at 64 threads/node we use 16 PAMI contexts • Sub groups of 4 threads access a PAMI context • PAMI library calls protected via critical sections • Worker threads advance PAMI contexts when idle • This mode is suitable for compute bound applications • SMP mode with communication threads • Each PAMI context advanced by a different communication thread • Charm++ worker threads post work via PAMI_Context_post • Charm++ worker threads do not advance PAMI contexts • This mode is suitable for communication bound applications IPDPS

  17. Optimize Short Messages • CmiDirectManytomany • Charm++ interface to optimize a burst of short messages • Message buffer addresses and sizes registered ahead • Communication operations kicked off via a start call • Completion callback notifies Charm++ scheduler when data has been fully sent and received • Charm++ scheduling and header overheads are eliminated • We parallelize burst sends of several short messages by posting work to multiple communication threads • Worker threads call PAMI_Context_post with a work function • Work functions execute PAMI_Send_immediate to calls move data on the network • On the receiver data is directly moved to registered destination buffers IPDPS

  18. Performance Results IPDPS

  19. Converse Internode Ping Pong Latency IPDPS

  20. Converse Intranode Ping Pong Latency IPDPS

  21. Scalable Memory Allocation 64 Threads on a node allocate and free 100 buffers in each iteration IPDPS

  22. Performance Impact of L2 Atomic Queues Speedup 2.7x Speedup 1.5x NAMD APoA1 Benchmark IPDPS

  23. NAMD Application on 512 Nodes Time Profile with 48 Worker Threads and No Communication Threads per node Time Profile with 32 Worker Threads and 8 Communication Threads per node 512 Nodes IPDPS

  24. PME Optimization with CmiDirectManytoMany 1024 Nodes IPDPS

  25. 3D Complex to Complex FFT Complex to Complex Forward + Backward 3D FFT Time in Microseconds IPDPS

  26. NAMD APoA1 Benchmark Performance Results BG/Q Time Step 0.68 ms/step IPDPS

  27. Summary • Presented several optimizations for the Charm++ runtime on the Blue Gene/Q machine • SMP mode outperforms non-SMP • Best performance on BG/Q with 1 to 4 processes per node and 16 to 64 threads/process • Best time step of 0.68ms/step for the NAMD application with the APoA1 benchmark IPDPS

  28. Thank YouQuestions? IPDPS

More Related