1 / 30

RP Operations

RP Operations. TeraGrid Annual Review 6 April 2009 M. Levine. Structure. A huge amount of information in the annual report Explicit discussion of RP Operations in Sec. 8.3 & 8.4 Voluminous tables and graphs in Appendix B Numerous tidbits scattered through the report

Download Presentation

RP Operations

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.


Presentation Transcript

  1. RP Operations TeraGrid Annual Review 6 April 2009 M. Levine

  2. Structure • A huge amount of information in the annual report • Explicit discussion of RP Operations in Sec. 8.3 & 8.4 • Voluminous tables and graphs in Appendix B • Numerous tidbits scattered through the report • Science projects credit specific RP support • Activities discussions credit specific RP expertise [e.g. Data (7.1) & Visualization (7.2)] • Outline • Summary & augmentations of Sec. 8 & Appendix B • Highlight specific items • General comments on the roles of RP’s in HP-IT in this era of commoditization • People from the RP’s & GIG here to help with questions

  3. Major System Developments in 2008 • Sites • Indiana -- LONI -- NCAR -- NCSA • NICS -- ORNL -- PSC -- Purdue • SDSC -- TACC -- ANL • New systems • Ranger@TACC (Sun Constellation Cluster 63k cores, 580 Tf) • Kraken@NICS (Cray XT, 18k cores, 160608 Tf) • Dell clusters @LONI & @Purdue • SGI Altix@PSC • FPGA system @Purdue • Vis system @TACC • Increased and integrated Condor Pool @Purdue & @Indiana • System retirements at • NCSA -- PSC -- Purdue -- SDSC • TACC

  4. User community & usage growth (Table 18-1)(strong growth in capability & usage)

  5. Compute (27) & Storage (11) Resources • 1.20 DataStar p690 • 1.21 Blue Gene • 1.22 IA64 Cluster • 1.23 TACC Ranger • 1.24 Lonestar • 1.25 Spur (vis) • 1.26 Maverick • 1.27 UC/ANL IA64 & IA32 Cluster • Storage • 2.1 Indiana Data Capacitor • 2.2 HPSS • 2.3 NCSA Mass Storage System • 2.4 NICS HPSS • 2.5 PSC FAR HSM • 2.6 Bessemer • 2.7 SDSC GPFS-WAN • 2.8 HPSS • 2.9 SAM-QFS • 2.10 Storage Resource Broker (Collections Disk Space) • 2.11 TACC Ranch • Compute • 1.1 Indiana Big Red • 1.2 Quarry • 1.3 LONI Queen Bee • 1.4 NCAR Frost • 1.5 NCSA Abe • 1.6 Cobalt • 1.7 Tungsten • 1.8 Mercury • 1.9 NICS Kraken • 1.10 ORNL NSTG Cluster • 1.11 PSC BigBen • 1.12 Pople • 1.13 Rachel • 1.14 Purdue Steele • 1.15 Lear • 1.16 Condor Pool • 1.17 TeraDRE • 1.18 Brutus • 1.19 SDSC DataStar p655

  6. Compute (27) & Storage (11) Resources • From the Appendix, you know that they represent a broad spectrum of • Architectures • Strengths • Vendors • And therefore a wide range of • Capabilities • As has been mentioned, this heterogeneity improves TG ability to support a wide range of Science & Engineering Research

  7. Compute Resources

  8. Storage Resources

  9. Breadth ! • Computational strength (individual systems) • 0.31 to 521 (608) Tf [**Recall graph from previous presentations**] • Total memory • .5 to 126 (132) TB • Memory per core • .256 to 32 GB/cores • Storage capacity • 2.2 to 41.3 PB • Architectures • Clusters ̶ MPP’s • SMP’s ̶ Experimental • Compute vendors • Cray ̶ Dell ̶ HP ̶ IBM • Nvidia ̶ SGI ̶ Sun • Storage vendors • HPSS ̶ SGI • STK ̶ Sun

  10. Usage by Discipline (major users)

  11. NSF Divisions (Coverage: B.4) • DMS Mathematical Sciences • EAR Earth Sciences • ECS Electrical and Communications Systems • ERC Engineering Research Centers • IBN Integrative Biology and Neuroscience • IRI Informatics, Robotics and Intelligent Systems • MCB Molecular and Cellular Biology • MSS Mechanical and Structural Systems • NCR Networking and Comm’ns Research • OCE Ocean Sciences • PHY Physics • SES Social and Economic Sciences • ASC Advanced Scientific Computing • AST Astronomical Sciences • ATM Atmospheric Sciences • BIR Biological Instrumentation and Resources • BNS Behavioral and Neural Sciences • CCR Computing and Computation Research • CDA Cross-Disciplinary Activities • CHE Chemistry • CTS Chemical and Thermal Systems • DDM Design and Manufacturing Systems • DEB Environmental Biology • DMR Materials Research

  12. Distribution of Users: Institutions (477, 1/3)

  13. Distribution of Users: Institutions (2/3)

  14. Distribution of Users: Institutions (3/3)

  15. Distribution of Users: Geographical (unweighted) Appendix B, Summaries Apologies to AK & HI !

  16. Distribution of Users: Geographical (weighted) Appendix B, Summaries Apologies to AK & HI !

  17. RP Roles • One might get the idea that the TG simply makes available a range of systems • Enforced by the voluminous, dry statistics • Not a bad assumption for “routine resources” • Not applicable to most TG systems which are typically cultivating new ground in scale or functionality. • In this age of the commoditization of most parts of IT systems, extreme price pressure and thin margins • Vendors are not able to innovate or support as in past and are often particularly weak in integrative areas which fall through the cracks between vendor domains • Yet, the TG strives, to excel in both areas and • The research community benefits from both innovation and support • The GIG works on broad, integrative issues; the more specialized and detailed burdens fall ever move on the RP’s. • Specifically, that burden is borne by the RP technical staff. • Their importance is often insufficiently emphasized. • (My impression is that the UK, for example, is more conscious of this issue.)

  18. RP Roles • RP activities span the range • Hardware • Configuration • System & operational software • Application support • Many times leading to TG-wide and vendor improvements • Although the annual report does not focus on these activities, the careful reader will notice them in the descriptions of • Science projects • Activities discussions

  19. RP Roles • Standing up new machines and • adapting them to the needs of their user community • requires care, expertise & experience. • The following 3 examples demonstrate a range of machine capabilities and scheduling adaptations. • Working well on varying conditions is NOT simply a matter of the jobs submitted.

  20. Range of Usage Styles • System & user driven • Wide range • Focus: large • Focus: long

  21. RP Activities Heterogeneity important provider of wide range of capabilities for wide range of research requirements. Selection of RP activities demonstrating breadth LONI --PSC University of Chicago & ANL SDSC --NCAR NICS --TACC Indiana --Purdue

  22. LONI (Louisiana Optical Network Initiative) LONI: statewide network and distributed computing system (12 HPC systems) Centerpiece LONI system: Queen Bee (QB) 668-node Dell PowerEdge 1950 IB system, 50 TF peak, each node has 8 2.33-GHz Intel “Cloverton” Xeons cores, 8 GB RAM, 36 GB disk 60 TB /work and 60 TB /project Lustre disk, 300 TB NCSA archival storage (through subcontract) 50% allocated through TeraGrid, allocated in pool with NCSA’s Abe (users get one allocation they can use on either system) Began running jobs for TeraGrid users on 1 February 2008 10.7M SUs were used by TeraGrid users to run 24.6K jobs in CY2008 2 planned, 6 unplanned outages (456 hours) -> 94.3% availability 0 security incidents in CY2008 impacting TeraGrid usage QB is an early test site for IU’s LustreWAN and gateways LONI also develops and tests software (on QB and other TG systems) HARC – Highly Available Resource Co-allocators (reserves and co-allocates multiple resources, including compute and network) PetaShare – a distributed file system being developed across LONI SAGA – Simple API for Grid Applications (lets application programmers use distributed computing without worrying about the specific implementation) Cactus – scientific framework for astrophysics, fluid dynamics, etc. GridChem science gateway – biology applications being added

  23. PSC: SMP & XT3/5 • Introduced Pople & Rachel • Large shared memory systems • 21% of TG SMP capability • Developed SMP scheduling techniques • PSC’s XT3 innovations (2004) benefit XT5 & TG (2009) • PSC externalized XT3 login & IO • Improved TG system performance & stability • Cray adopts for XT5 standard product

  24. Visualization Hardware Provides clusters with dedicated graphics hardware, 20 TB of online storage, 30 Gb connection to TeraGrid backbone network Visualization Gateway Develops and hosts TeraGrid Visualization Gateway New in 2008 Volume rendering service Dynamic accounts for community user access Visualization Support Provides visualization support to researchers from a range of scientific domains, including: Astrophysics, Fluid Dynamics, Life Sciences, Applied Mathematics TeraGrid Resource Provider

  25. SDSC RP Operations • Global File Systems • Large production GPFS-WAN • Supporting IU & PSC Lustre-WAN efforts • User-Settable Advance Reservations and Co-/Meta-Scheduling • Archival Storage Systems, including dual-site option • Advanced Support for TeraGrid Applications (ASTA) • Education, Outreach & Training • SDSC staff hold a number of key positions within TG (ADs, WG leads, Inca, documentation, etc.) GPFS-WAN 0.9 PB TeraGrid Linux Cluster IBM/Intel IA-64 Archival Systems 36 PB capacity (~5 PB used)

  26. NCAR Teragrid RP Developments • Current Cyberinfrastructure • 5.7 TFlops/2048 core Blue Gene/L system • 100 TB storage cluster • 1.5 PB SL8500 Tape archive with 1 PB media running HPSS/HSI • Sun DAV visualization node • Chronopolis data preservation testbed • New Cyberinfrastructure Developments • Possible Blue Gene/L Expansion to 22.8 TF system with 8192 cores • 100 TFLOPS MRI System @ CU with NCAR 10% stakeholder • DAV node upgrade to more capable 8 core system with larger memory • Science Gateway developments • Asteroseismic Modeling Portal (AMP). Successful launch of the Kepler spacecraft increases potential scientific impact of this gateway. 26

  27. National Institute for Computational Sciences Kraken and Krakettes! Proprietary Cray Interconnect, SeaStar2, provides excellent scaling, with numerous tightly coupled applications running at 32K and 64K cores on the XT5. NICS is specializing on true capability applications, plus high performance file and archival systems.

  28. TACC Major Achievements in 2008 • Deployed Ranger, first TeraGrid Track2 system • 100s of millions of SUs, >> greater than all TG in 2007 • Users scaling to full system for unprecedented science • Deployed Spur, most powerful TG vis system • tightly coupled to Ranger HPC system, parallel I/O • Already over-requested, demonstrating pent-up demand for large-scale vis • Supporting most of largest science projects in TG • Ex. Hurricane Gustav/Ike NOAA project demonstrated TG end-to-end science support with societal impact • TG User Portal added file management, other capabilities • Heavily used online resource by most TG users • Leveraging local TACC s/w development, projects, staff

  29. Indiana University RP Highlights • Particular community needs. Big Red – particularly suited for Molecular Dynamics codes - #1 use of system • Workflows & Lustre WAN. SC07 BandWidth challenge led to software enhancements, LustreWAN as production service. Deployed as production at IU and PSC; testing at LONI; TACC; SDSC. Support for workflows that integrate Lustre, Big Red, other RPs (40% decrease in end-to-end time for LEAD workflows based on use of Lustre-WAN) • New service – Virtual Machine hosting. Quarry – Now serving 17 projects running a total of 25 VMs. e.g.: TeraGrid Information Services; LEAD (Linked Environments for Atmospheric Discover); Data Collections (e.g. Flybase)

  30. Purdue RP Operations highlights High-throughput computing Expanded campus participation in the Condor pool (IU and Wisconsin joined, others are in process). Support job routing from Purdue Condor pool to remote sites automatically (OSG, TeraGrid, etc). Need scalable job reporting tool to better serve users - working with IU and Wisconsin’s Condor team. • Steele - 893 nodes (7144 cores), 66 Tflops peak • Began production 5/2008, replaced Lear. Installed in less than a day and began running TG jobs as the cluster was being installed. • Brutus (FPGA) • Integrated into TeraGrid in 2/2008 as an experimental resource. Supporting users who develop programming tools to make FPGA-based acceleration easier to adopt and those who will use the FPGA-accelerated BLAST application through Condor scheduling. • The Wispy cloud – experimental, a collaboration with UC/ANL Nimbus group. • Made available to TG users in fall 2008. Users can transfer virtual machine images to Wispy and run it the same way as submitting a job to the grid. Using Wispy to help configure OSG environment for non-OSG TG resources (collaboration with RENCI, UC/ANL). • OSG collaboration - Standardize methods for advertising support for parallel (MPI) jobs – make user’s lives easier.

More Related