1 / 36

The Computational Plant 9 th ORAP Forum Paris (CNRS)

The Computational Plant 9 th ORAP Forum Paris (CNRS). Rolf Riesen Sandia National Laboratories Scalable Computing Systems Department March 21, 2000. Tech Report SAND98-2221. Distributed and Parallel Systems. Distributed systems hetero- geneous. Massively parallel systems homo-

Download Presentation

The Computational Plant 9 th ORAP Forum Paris (CNRS)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The Computational Plant9th ORAP ForumParis (CNRS) Rolf Riesen Sandia National Laboratories Scalable Computing Systems Department March 21, 2000

  2. Tech Report SAND98-2221 Distributed and Parallel Systems Distributed systems hetero- geneous Massively parallel systems homo- geneous Legion\Globus Berkley NOW SETI@home ASCI Red Tflops Beowulf Internet Cplant • Gather (unused) resources • Steal cycles • System SW manages resources • System SW adds value • 10% - 20% overhead is OK • Resources drive applications • Time to completion is not critical • Time-shared • Bounded set of resources • Apps grow to consume all cycles • Application manages resources • System SW gets in the way • 5% overhead is maximum • Apps drive purchase of equipment • Real-time constraints • Space-shared 9th ORAP Forum. March 21, 2000

  3. Massively Parallel Processors • Intel Paragon • 1,890 compute nodes • 3,680 i860 processors • 143/184 GFLOPS • 175 MB/sec network • SUNMOS lightweight kernel • Intel TeraFLOPS • 4,576 compute nodes • 9,472 Pentium II processors • 2.38/3.21 TFLOPS • 400 MB/sec network • Puma/Cougar lightweight kernel 9th ORAP Forum. March 21, 2000

  4. Cplant Goals • Production system • Multiple users • Scalable (easy to use buzzword) • Large scale (proof of the above) • General purpose for scientific applications (not Beowulf dedicated to a single user) • 1st step: Tflops look and feel for users 9th ORAP Forum. March 21, 2000

  5. Cplant Strategy • Hybrid approach combining commodity cluster technology with MPP technology • Build on the design of the TFLOPS: • large systems should be built from independent building blocks • large systems should be partitioned to provide specialized functionality • large systems should have significant resources dedicated to system maintenance 9th ORAP Forum. March 21, 2000

  6. Why Cplant? • Modeling and simulation, essential to stockpile stewardship, require significant computing power • Commercial supercomputer are a dying breed • Pooling of SMPs is expensive and more complex • Commodity PC market is closing the performance gap • WEB services and e-commerce are driving high-performance interconnect technology 9th ORAP Forum. March 21, 2000

  7. Cplant Approach • Emulate the ASCI Red environment • Partition model (functional decomposition) • Space sharing (reduce turnaround time) • Scalable services (allocator, loader, launcher) • Ephemeral user environment • Complete resource dedication • Use Existing Software when possible • Red Hat distribution, Linux/Alpha • Software developed for ASCI Red 9th ORAP Forum. March 21, 2000

  8. Conceptual Partition View Compute Service File I/O Users Net I/O /home 9th ORAP Forum. March 21, 2000 System Support Sys Admin

  9. User View File I/O Net I/O Service partition alaska0 alaska1 alaska2 alaska3 alaska4 9th ORAP Forum. March 21, 2000 Load balancing daemon rlogin alaska /home

  10. System Support Hierarchy Admin access Master copy of system software sss1 sss0 sss0 sss0 In-use copy of system software In-use copy of system software In-use copy of system software node node node node node node NFS mount root from SSS0 NFS mount root from SSS0 NFS mount root from SSS0 node node node 9th ORAP Forum. March 21, 2000 Scalable Unit Scalable Unit Scalable Unit node node node

  11. Scalable Unit 8 Myrinet LAN cables Power controller Terminal server compute 16 port Myrinet switch Power controller Terminal server compute 16 port Myrinet switch compute compute compute compute To system support network compute compute compute compute compute compute compute compute 9th ORAP Forum. March 21, 2000 service sss0 service 100BaseT hub 100BaseT hub Myrinet power serial Ethernet

  12. “Virtual Machines” Production SU configuration database Uses rdist to push system software down Alpha sss1 Beta sss0 sss0 sss0 In-use copy of system software In-use copy of system software In-use copy of system software node node node node node node NFS mount root from SSS0 NFS mount root from SSS0 NFS mount root from SSS0 node node node 9th ORAP Forum. March 21, 2000 Scalable Unit Scalable Unit Scalable Unit node node node

  13. Runtime Environment • yod - Service node parallel job launcher • bebopd - Compute node allocator • PCT - Process control thread, compute node daemon • pingd - Compute node status tool • fyod - Independent parallel I/O 9th ORAP Forum. March 21, 2000

  14. Phase I - Prototype (Hawaii) • 128 Digital PWS 433a (Miata) • 433 MHz 21164 Alpha CPU • 2 MB L3 Cache • 128 MB ECC SDRAM • 24 Myrinet dual 8-port SAN switches • 32-bit, 33 MHz LANai-4 NIC • Two 8-port serial cards per SSS-0 for console access • I/O - Six 9 GB disks • Compile server - 1 DEC PWS 433a • Integrated by SNL 9th ORAP Forum. March 21, 2000

  15. Phase II Production (Alaska) • 400 Digital PWS 500a (Miata) • 500 MHz Alpha 21164 CPU • 2 MB L3 Cache, 192 MB RAM • 16-port Myrinet switch • 32-bit, 33 MHz LANai-4 NIC • 6 DEC AS1200, 12 RAID (.75 Tbyte) || file server • 1 DEC AS4100 compile & user file server • Integrated by Compaq • 125.2 GFLOPS on MPLINPACK (350 nodes) • would place 53rd on June 1999 Top 500 9th ORAP Forum. March 21, 2000

  16. Phase III Production (Siberia) • 624 Compaq XP1000 (Monet) • 500 MHz Alpha 21264 CPU • 4 MB L3 Cache • 256 MB ECC SDRAM • 16-port Myrinet switch • 64-bit, 33 MHz LANai-7 NIC • 1.73 TB disk I/O • Integrated by Compaq and Abba Technologies • 247.6 GFLOPS on MPLINPACK (572 nodes) • would place 40th on Nov 1999 Top 500 9th ORAP Forum. March 21, 2000

  17. ~1350 DS10 Slates (NM+CA) 466MHz EV6, 256MBRAM Myrinet 33MHz 64bit LANai 7.x Will be combined with Siberia for a ~1600-node system Red, black, green switchable Phase IV (Antarctica?) 9th ORAP Forum. March 21, 2000

  18. Based on 64-port Clos switch 8x2 16-port switches in a 12U rack-mount case 64 LAN cables to nodes 64 SAN cables (96 links) to mesh Myrinet Switch 16-port switch 4 nodes 9th ORAP Forum. March 21, 2000

  19. 4 Clos switches in one rack 256 nodes per plane (8 racks) Wrap-around in x and y direction 128+128 links in z direction y 4 nodes per not shown z x One Switch Rack = One Plane 9th ORAP Forum. March 21, 2000

  20. Cplant 2000: “Antarctica” Connected to classified network Wrap-around and z links and nodes not shown Connected to unclassified network 9th ORAP Forum. March 21, 2000 Compute nodes swing between red, black, or green Connected to open network

  21. 1056 + 256 + 256 nodes  1600 nodes  1.5TFlops 320 “64-port” switches + 144 16-port switches from Siberia 40 + 16 system support stations Cplant 2000: “Antarctica” cont. 9th ORAP Forum. March 21, 2000

  22. Portals • Data movement layer from SUNMOS and PUMA • Flexible building blocks for supporting many protocols • Elementary constructs that support MPI semantics well • Low-level message-passing layer (not a wire protocol) • API intended for library writers, not application programmers • Tech report SAND99-2959 9th ORAP Forum. March 21, 2000

  23. Interface Concepts • One-sided operations • Put and Get • Zero copy message passing • Increased bandwidth • OS Bypass • Reduced latency • Application Bypass • No polling, no threads • Reduced processor utilization • Reduced software complexity 9th ORAP Forum. March 21, 2000

  24. MPP Network: Paragon and Tflops Network interface is on the memory bus Network Memory Memory Bus Processor Processor Message passing or computational co-processor 9th ORAP Forum. March 21, 2000

  25. Commodity: Myrinet Network is far from the memory Processor Memory Memory Bus Bridge PCI Bus OSBypass NIC 9th ORAP Forum. March 21, 2000 Network

  26. “Must” Requirements • Common protocols (MPI, system protocols) • Portability • Scalability to 1000’s of nodes • High-performance • Multiple process access • Heterogeneous processes (binaries) • Runtime independence • Memory protection • Reliable message delivery • Pairwise message ordering 9th ORAP Forum. March 21, 2000

  27. “Will” Requirements • Operational API • Zero-copy MPI • Myrinet • Sockets implementation • Unrestricted message size • OS Bypass, Application Bypass • Put/Get 9th ORAP Forum. March 21, 2000

  28. “Will” Requirements • Packetized implementations • Receive uses start and length • Receiver managed • Sender managed • Gateways • Asynchronous operations • Threads 9th ORAP Forum. March 21, 2000

  29. “Should” requirements • No message alignment restrictions • Striping over multiple channels • Socket API • Implement on ST • Implement on VIA • No consistency/coherency • Ease of use • Topology information 9th ORAP Forum. March 21, 2000

  30. Portal Addressing Operational Boundary Portal Table Event Queue Match List Memory Descriptors Memory Region Portal API Space Application Space 9th ORAP Forum. March 21, 2000

  31. Portal Address Translation Enter Get Next Match Entry Yes Match? More Match Entries? No No Discard Message Increment Drop Count Yes Empty & Unlink ME? First MD Accepts? No No Unlink MD Record Event Yes Yes Unlink ME Yes Yes 9th ORAP Forum. March 21, 2000 Unlink MD? Event Queue? No Perform Operation Exit No

  32. Implementing MPI • Short message protocol • Send message (expect receipt) • Unexpected messages • Long message protocol • Post receive • Send message • On ACK or Get, release message • Event includes the memory descriptor 9th ORAP Forum. March 21, 2000

  33. Implementing MPI Pre-posted Mark Match none Match any short,unlink buffer short,unlink buffer Event Queue short,unlink 9th ORAP Forum. March 21, 2000 buffer Match any 0, trunc, no ACK

  34. Flow Control • Basic flow control • Drop messages that receiver is not prepared for • Long messages might waste network resources • Good performance for well-behaved MPI apps • Managing the network – packets • Packet size big enough to hold a short message • First packet is an implicit RTS • Flow control ACK can indicate that message will be dropped 9th ORAP Forum. March 21, 2000

  35. Portals 3.0 Status • Currently testing Cplant Release 0.5 • Portals 3.0 kernel module using the RTS/CTS module over Myrinet • Port of MPICH 1.2.0 over Portals 3.0 • TCP/IP reference implementation ready • Port to LANai begun 9th ORAP Forum. March 21, 2000

  36. http://www.cs.sandia.gov/cplant 9th ORAP Forum. March 21, 2000

More Related