1 / 20

Major Systems at ANL

Bill Gropp www.mcs.anl.gov/~gropp (virtual Remy Evard). Major Systems at ANL. Current User Facilities. Chiba City – Linux Cluster for Scalabilty OASCR funded. Installed in 1999. 512 CPUs, 256 nodes, Myrinet, 2TB storage. Mission: address scalability issues in system software,

von
Download Presentation

Major Systems at ANL

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Bill Gropp www.mcs.anl.gov/~gropp(virtual Remy Evard) Major Systems at ANL

  2. Current User Facilities Chiba City – Linux Cluster for Scalabilty OASCR funded. Installed in 1999. 512 CPUs, 256 nodes, Myrinet, 2TB storage. Mission: address scalability issues in system software, open source software, and applications code. Jazz – Linux Cluster for ANL Apps ANL funded. Installed in 2002. Achieved 1.1 TF sustained. 350 CPUs, Myrinet, 20TB storage. Mission: support and enhance ANL application community. 50 projects. On the DOE Science Grid. TeraGrid – Linux Cluster for NSF Grid Users NSF funded as part of DTF and ETF. 128 IA-64 CPUs for computing. 192 IA-32 CPUs for visualization. Mission: production grids, grid application code, visualization service.

  3. Current Testbeds Advanced Architectures Testbed ANL LDRD Funding. Established in 2002. Experimental systems: FPGAs,Hierarchical Architectures, ... Mission: explore programming models and hardware architectures for future architectures. Grid and Networking Testbeds I-WIRE: Illinois-funded Dark Fiber Participation in large number of Grid projects. Facilities at ANL include DataGrid, Distributed Optical Testbed, and others. Mission: Grids and networks as an enabling technology for Petascale science. Visualization and Collaboration Facilities AccessGrid, ActiveMural, Linux CAVE, others

  4. Chiba City - the Argonne Scalable Cluster 1 of 2 rows of Chiba City: 256 computing nodes. 512 PIII CPUs. 32 visualization nodes. 8 storage nodes. 4TB of disk. Myrinet interconnect. Mission: Scalability and open source software testbed. http://www.mcs.anl.gov/chiba/

  5. Systems Software Challenges • Scale Invariance • Systems services need to scale to arbitrary large-scale systems (e.g. I/O, scheduling, monitoring, process management, error reporting, diagnostics etc.) • Self-organizing services provides one path to scale invariance • Fault Tolerance • Systems services need to provide sustained performance in spite of hardware failures • No-single point of control, peer-to-peer redundancy • Autonomy • Systems services should be self-configuring, auto-updating and self-monitoring

  6. Testbed Uses • System Software • MPI Process Management • Parallel Filesystems • Cluster Distribution Testing • Network Research • Virtual Node Tests

  7. Testbed Software Development • Largely based on SSS component architecture and interfaces • Existing resource management software didn’t meet needs • SSS Component architecture allowed easy substitution of system software where required • Simple interfaces allow fast implementation of custom components (resource manager) • Open architecture allows implementation of extra component based in local requirements (file staging)

  8. Chiba City Implementation Meta Scheduler Meta Monitor Meta Manager Meta Services Accounting Scheduler* System & Job Monitor Node State Manager* Service Directory* Node Configuration & Build Manager* Communication Library* Event Manager* Allocation Management* Usage Reports Validation & Testing Process Manager* Job Queue Manager* Hardware Infrastructure Manager* Checkpoint / Restart

  9. Software Deployment Testing • Beta software run in production • Testbed software stack • Configuration management tools • Global process manager • Cluster distribution installation testing • Friendly users provide useful feedback during the development process

  10. The ANL LCRC Computing Clusterhttp://www.lcrc.anl.gov 350 computing nodes: 2.4 GHz Pentium IV 50% w/ 2 GB RAM 50% w/ 1 GB RAM 80 GB local scratch disk Linux 10 TB global working disk: 8 dual 2.4 GHz Pentium IV servers 10 TB SCSI JBOD disks PVFS file system 10 TB home disk: 8 dual 2.4 GHz Pentium IV servers 10 TB Fiber Channel disks GFS between servers NFS to the nodes Network: Myrinet 2000 to all systems Fast Ethernet to the nodes GigE aggregation Support: 4 front end nodes: 2x 2.4 GHz PIV 8 management systems 1Gb to ANL

  11. Catalysis in Nanoporous Materials 3D Numerical Reactor Regional Aerosol Impacts Spatio-Temporal Chaos LCRC enables analysis of complex systems

  12. Neocortical Seizure Simulation Lattice Quantum-Chromodynamics Aerodynamic Drag for Heavy Vehicles Sediment Transport LCRC enables studies of system dynamics

  13. Jazz Usage – Capacity and Load • We’ve reached the practical capacity limit given the job mix. • There are always jobs in the queue. Wait time varies enormously, averaging ~ 1 hr.

  14. Jazz Usage – Accounts • Constant growth of ~15 users a month.

  15. Jazz Usage – Projects Steady addition of ~ 6 new projects a month.

  16. FY2003 LCRC Usage by DomainA wide range of lab missions

  17. Jazz Usage by Domain over time

  18. Jazz Usage – Large Projects (>5000 hrs) PETSc Startup Projects Ptools QMC - PHY Nanocatalysis - CNM Sediment - MCS Protein - NE Neocortex Sim - MCS NumericalReactor - NE Heights EUV - ET Lattice QCD - HEP Compnano - CNM Foam - MCS COLUMBUSCHM Climate - MCS Chaos - MCS Aerosols - ER

  19. 75 TB Storage 750 4pAlpha EV68 Quadrics 128p EV7 Marvel 4p Vis 16 2p (ER) Itanium2 Quadrics ETF Hardware Deployment, Fall 2003http://www.teragrid.org 100 TB DataWulf 96 GeForce4 Graphics Pipes 96 Pentium4 64 2p Madison Myrinet 32 Pentium4 52 2p Itanium2 20 2p Madison Myrinet 20 TB Caltech ANL 1.1 TF Power4 Federation 128 2p Itanium2 256 2p Madison Myrinet 256 2p Itanium2 670 2p Madison Myrinet 500 TB FCS SAN 230 TB FCS SAN SDSC NCSA PSC

  20. ETF ANL: 1.4 TF Madison/Pentium IV, 20 TB, Viz 30 Gbps to TeraGrid Network Visualization .9 TF Pentium IV 96 nodes Compute .5 TF Madison 64 nodes GbE Fabric 2p Madison 4 GB memory 2x73 GB 2p 2.4 GHz 4 GB RAM 73 GB disk Radeon 9000 2p 2.4 GHz 4 GB RAM 73 GB disk Radeon 9000 2p Madison 4 GB memory 2x73 GB 250MB/s/node * 64 nodes 96 visualization streams 250MB/s/node * 96 nodes Myrinet Fabric Network Viz Viz Devices Storage I/O over Myrinet and/or GbE Viz I/O over Myrinet and/or GbE To TG network. 2p 2.4 GHz 4 GB RAM 4 2p PIV Nodes 4 4p Madison Nodes Login, FTP 8 2x FC 20 TB Storage Nodes Interactive Nodes

More Related