rp operations n.
Skip this Video
Download Presentation
RP Operations

Loading in 2 Seconds...

play fullscreen
1 / 30

RP Operations - PowerPoint PPT Presentation

  • Uploaded on

RP Operations. TeraGrid Annual Review 6 April 2009 M. Levine. Structure. A huge amount of information in the annual report Explicit discussion of RP Operations in Sec. 8.3 & 8.4 Voluminous tables and graphs in Appendix B Numerous tidbits scattered through the report

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about ' RP Operations' - latifah-warren

Download Now An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
rp operations

RP Operations

TeraGrid Annual Review

6 April 2009

M. Levine

  • A huge amount of information in the annual report
    • Explicit discussion of RP Operations in Sec. 8.3 & 8.4
    • Voluminous tables and graphs in Appendix B
    • Numerous tidbits scattered through the report
      • Science projects credit specific RP support
      • Activities discussions credit specific RP expertise [e.g. Data (7.1) & Visualization (7.2)]
  • Outline
    • Summary & augmentations of Sec. 8 & Appendix B
    • Highlight specific items
    • General comments on the roles of RP’s in HP-IT in this era of commoditization
  • People from the RP’s & GIG here to help with questions
major system developments in 2008
Major System Developments in 2008
  • Sites
    • Indiana -- LONI -- NCAR -- NCSA
    • NICS -- ORNL -- PSC -- Purdue
    • SDSC -- TACC -- ANL
  • New systems
    • Ranger@TACC (Sun Constellation Cluster 63k cores, 580 Tf)
    • Kraken@NICS (Cray XT, 18k cores, 160608 Tf)
    • Dell clusters @LONI & @Purdue
    • SGI Altix@PSC
    • FPGA system @Purdue
    • Vis system @TACC
    • Increased and integrated Condor Pool @Purdue & @Indiana
  • System retirements at
    • NCSA -- PSC -- Purdue -- SDSC
    • TACC
compute 27 storage 11 resources
Compute (27) & Storage (11) Resources
    • 1.20 DataStar p690
    • 1.21 Blue Gene
    • 1.22 IA64 Cluster
    • 1.23 TACC Ranger
    • 1.24 Lonestar
    • 1.25 Spur (vis)
    • 1.26 Maverick
    • 1.27 UC/ANL IA64 & IA32 Cluster
  • Storage
    • 2.1 Indiana Data Capacitor
    • 2.2 HPSS
    • 2.3 NCSA Mass Storage System
    • 2.4 NICS HPSS
    • 2.5 PSC FAR HSM
    • 2.6 Bessemer
    • 2.7 SDSC GPFS-WAN
    • 2.8 HPSS
    • 2.9 SAM-QFS
    • 2.10 Storage Resource Broker (Collections Disk Space)
    • 2.11 TACC Ranch
  • Compute
    • 1.1 Indiana Big Red
    • 1.2 Quarry
    • 1.3 LONI Queen Bee
    • 1.4 NCAR Frost
    • 1.5 NCSA Abe
    • 1.6 Cobalt
    • 1.7 Tungsten
    • 1.8 Mercury
    • 1.9 NICS Kraken
    • 1.10 ORNL NSTG Cluster
    • 1.11 PSC BigBen
    • 1.12 Pople
    • 1.13 Rachel
    • 1.14 Purdue Steele
    • 1.15 Lear
    • 1.16 Condor Pool
    • 1.17 TeraDRE
    • 1.18 Brutus
    • 1.19 SDSC DataStar p655
compute 27 storage 11 resources1
Compute (27) & Storage (11) Resources
  • From the Appendix, you know that they represent a broad spectrum of
    • Architectures
    • Strengths
    • Vendors
  • And therefore a wide range of
    • Capabilities
  • As has been mentioned, this heterogeneity improves TG ability to support a wide range of Science & Engineering Research
Breadth !
  • Computational strength (individual systems)
    • 0.31 to 521 (608) Tf [**Recall graph from previous presentations**]
  • Total memory
    • .5 to 126 (132) TB
  • Memory per core
    • .256 to 32 GB/cores
  • Storage capacity
    • 2.2 to 41.3 PB
  • Architectures
    • Clusters ̶ MPP’s
    • SMP’s ̶ Experimental
  • Compute vendors
    • Cray ̶ Dell ̶ HP ̶ IBM
    • Nvidia ̶ SGI ̶ Sun
  • Storage vendors
    • HPSS ̶ SGI
    • STK ̶ Sun
nsf divisions coverage b 4
NSF Divisions (Coverage: B.4)
  • DMS Mathematical Sciences
  • EAR Earth Sciences
  • ECS Electrical and Communications Systems
  • ERC Engineering Research Centers
  • IBN Integrative Biology and Neuroscience
  • IRI Informatics, Robotics and Intelligent Systems
  • MCB Molecular and Cellular Biology
  • MSS Mechanical and Structural Systems
  • NCR Networking and Comm’ns Research
  • OCE Ocean Sciences
  • PHY Physics
  • SES Social and Economic Sciences
  • ASC Advanced Scientific Computing
  • AST Astronomical Sciences
  • ATM Atmospheric Sciences
  • BIR Biological Instrumentation and Resources
  • BNS Behavioral and Neural Sciences
  • CCR Computing and Computation Research
  • CDA Cross-Disciplinary Activities
  • CHE Chemistry
  • CTS Chemical and Thermal Systems
  • DDM Design and Manufacturing Systems
  • DEB Environmental Biology
  • DMR Materials Research
distribution of users geographical unweighted
Distribution of Users: Geographical (unweighted)

Appendix B, Summaries

Apologies to AK & HI !

distribution of users geographical weighted
Distribution of Users: Geographical (weighted)

Appendix B, Summaries

Apologies to AK & HI !

rp roles
RP Roles
  • One might get the idea that the TG simply makes available a range of systems
    • Enforced by the voluminous, dry statistics
    • Not a bad assumption for “routine resources”
    • Not applicable to most TG systems which are typically cultivating new ground in scale or functionality.
  • In this age of the commoditization of most parts of IT systems, extreme price pressure and thin margins
    • Vendors are not able to innovate or support as in past and are often particularly weak in integrative areas which fall through the cracks between vendor domains
    • Yet, the TG strives, to excel in both areas and
    • The research community benefits from both innovation and support
  • The GIG works on broad, integrative issues; the more specialized and detailed burdens fall ever move on the RP’s.
    • Specifically, that burden is borne by the RP technical staff.
    • Their importance is often insufficiently emphasized.
    • (My impression is that the UK, for example, is more conscious of this issue.)
rp roles1
RP Roles
  • RP activities span the range
    • Hardware
    • Configuration
    • System & operational software
    • Application support
    • Many times leading to TG-wide and vendor improvements
  • Although the annual report does not focus on these activities, the careful reader will notice them in the descriptions of
    • Science projects
    • Activities discussions
rp roles2
RP Roles
  • Standing up new machines and
    • adapting them to the needs of their user community
    • requires care, expertise & experience.
  • The following 3 examples demonstrate a range of machine capabilities and scheduling adaptations.
  • Working well on varying conditions is NOT simply a matter of the jobs submitted.
range of usage styles
Range of Usage Styles
  • System & user driven
    • Wide range
    • Focus: large
    • Focus: long
rp activities
RP Activities

Heterogeneity important provider of wide range of capabilities for wide range of research requirements.

Selection of RP activities demonstrating breadth


University of Chicago & ANL



Indiana --Purdue

loni louisiana optical network initiative
LONI (Louisiana Optical Network Initiative)

LONI: statewide network and distributed computing system (12 HPC systems)

Centerpiece LONI system: Queen Bee (QB)

668-node Dell PowerEdge 1950 IB system, 50 TF peak, each node has 8 2.33-GHz Intel “Cloverton” Xeons cores, 8 GB RAM, 36 GB disk

60 TB /work and 60 TB /project Lustre disk, 300 TB NCSA archival storage (through subcontract)

50% allocated through TeraGrid, allocated in pool with NCSA’s Abe (users get one allocation they can use on either system)

Began running jobs for TeraGrid users on 1 February 2008

10.7M SUs were used by TeraGrid users to run 24.6K jobs in CY2008

2 planned, 6 unplanned outages (456 hours) -> 94.3% availability

0 security incidents in CY2008 impacting TeraGrid usage

QB is an early test site for IU’s LustreWAN and gateways

LONI also develops and tests software (on QB and other TG systems)

HARC – Highly Available Resource Co-allocators (reserves and co-allocates multiple resources, including compute and network)

PetaShare – a distributed file system being developed across LONI

SAGA – Simple API for Grid Applications (lets application programmers use distributed computing without worrying about the specific implementation)

Cactus – scientific framework for astrophysics, fluid dynamics, etc.

GridChem science gateway – biology applications being added

psc smp xt3 5
PSC: SMP & XT3/5
  • Introduced Pople & Rachel
    • Large shared memory systems
    • 21% of TG SMP capability
    • Developed SMP scheduling techniques
  • PSC’s XT3 innovations (2004) benefit XT5 & TG (2009)
    • PSC externalized XT3 login & IO
    • Improved TG system performance & stability
    • Cray adopts for XT5 standard product
teragrid resource provider
Visualization Hardware

Provides clusters with dedicated graphics hardware, 20 TB of online storage, 30 Gb connection to TeraGrid backbone network

Visualization Gateway

Develops and hosts TeraGrid Visualization Gateway

New in 2008

Volume rendering service

Dynamic accounts for community user access

Visualization Support

Provides visualization support to researchers from a range of scientific domains, including: Astrophysics, Fluid Dynamics, Life Sciences, Applied Mathematics

TeraGrid Resource Provider
sdsc rp operations
SDSC RP Operations
  • Global File Systems
    • Large production GPFS-WAN
    • Supporting IU & PSC Lustre-WAN efforts
  • User-Settable Advance Reservations and Co-/Meta-Scheduling
  • Archival Storage Systems, including dual-site option
  • Advanced Support for TeraGrid Applications (ASTA)
  • Education, Outreach & Training
  • SDSC staff hold a number of key positions within TG (ADs, WG leads, Inca, documentation, etc.)


0.9 PB

TeraGrid Linux Cluster

IBM/Intel IA-64

Archival Systems

36 PB capacity (~5 PB used)

ncar teragrid rp developments
NCAR Teragrid RP Developments
  • Current Cyberinfrastructure
    • 5.7 TFlops/2048 core Blue Gene/L system
    • 100 TB storage cluster
    • 1.5 PB SL8500 Tape archive with 1 PB media running HPSS/HSI
    • Sun DAV visualization node
    • Chronopolis data preservation testbed
  • New Cyberinfrastructure Developments
    • Possible Blue Gene/L Expansion to 22.8 TF system with 8192 cores
    • 100 TFLOPS MRI System @ CU with NCAR 10% stakeholder
    • DAV node upgrade to more capable 8 core system with larger memory
  • Science Gateway developments
    • Asteroseismic Modeling Portal (AMP). Successful launch of the Kepler spacecraft increases potential scientific impact of this gateway.


national institute for computational sciences
National Institute for Computational Sciences

Kraken and Krakettes!

Proprietary Cray Interconnect, SeaStar2, provides excellent scaling, with numerous tightly coupled applications running at 32K and 64K cores on the XT5.

NICS is specializing on true capability applications, plus high performance file and archival systems.

tacc major achievements in 2008
TACC Major Achievements in 2008
  • Deployed Ranger, first TeraGrid Track2 system
    • 100s of millions of SUs, >> greater than all TG in 2007
    • Users scaling to full system for unprecedented science
  • Deployed Spur, most powerful TG vis system
    • tightly coupled to Ranger HPC system, parallel I/O
    • Already over-requested, demonstrating pent-up demand for large-scale vis
  • Supporting most of largest science projects in TG
    • Ex. Hurricane Gustav/Ike NOAA project demonstrated TG end-to-end science support with societal impact
  • TG User Portal added file management, other capabilities
    • Heavily used online resource by most TG users
    • Leveraging local TACC s/w development, projects, staff
indiana university rp highlights
Indiana University RP Highlights
  • Particular community needs. Big Red – particularly suited for Molecular Dynamics codes - #1 use of system
  • Workflows & Lustre WAN. SC07 BandWidth challenge led to software enhancements, LustreWAN as production service. Deployed as production at IU and PSC; testing at LONI; TACC; SDSC. Support for workflows that integrate Lustre, Big Red, other RPs (40% decrease in end-to-end time for LEAD workflows based on use of Lustre-WAN)
  • New service – Virtual Machine hosting. Quarry – Now serving 17 projects running a total of 25 VMs. e.g.: TeraGrid Information Services; LEAD (Linked Environments for Atmospheric Discover); Data Collections (e.g. Flybase)
purdue rp operations highlights
Purdue RP Operations highlights

High-throughput computing

Expanded campus participation in the Condor pool (IU and Wisconsin joined, others are in process).

Support job routing from Purdue Condor pool to remote sites automatically (OSG, TeraGrid, etc).

Need scalable job reporting tool to better serve users - working with IU and Wisconsin’s Condor team.

  • Steele - 893 nodes (7144 cores), 66 Tflops peak
    • Began production 5/2008, replaced Lear. Installed in less than a day and began running TG jobs as the cluster was being installed.
  • Brutus (FPGA)
    • Integrated into TeraGrid in 2/2008 as an experimental resource. Supporting users who develop programming tools to make FPGA-based acceleration easier to adopt and those who will use the FPGA-accelerated BLAST application through Condor scheduling.
  • The Wispy cloud – experimental, a collaboration with UC/ANL Nimbus group.
    • Made available to TG users in fall 2008. Users can transfer virtual machine images to Wispy and run it the same way as submitting a job to the grid. Using Wispy to help configure OSG environment for non-OSG TG resources (collaboration with RENCI, UC/ANL).
  • OSG collaboration - Standardize methods for advertising support for parallel (MPI) jobs – make user’s lives easier.