Rp operations
This presentation is the property of its rightful owner.
Sponsored Links
1 / 30

RP Operations PowerPoint PPT Presentation


  • 56 Views
  • Uploaded on
  • Presentation posted in: General

RP Operations. TeraGrid Annual Review 6 April 2009 M. Levine. Structure. A huge amount of information in the annual report Explicit discussion of RP Operations in Sec. 8.3 & 8.4 Voluminous tables and graphs in Appendix B Numerous tidbits scattered through the report

Download Presentation

RP Operations

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Rp operations

RP Operations

TeraGrid Annual Review

6 April 2009

M. Levine


Structure

Structure

  • A huge amount of information in the annual report

    • Explicit discussion of RP Operations in Sec. 8.3 & 8.4

    • Voluminous tables and graphs in Appendix B

    • Numerous tidbits scattered through the report

      • Science projects credit specific RP support

      • Activities discussions credit specific RP expertise [e.g. Data (7.1) & Visualization (7.2)]

  • Outline

    • Summary & augmentations of Sec. 8 & Appendix B

    • Highlight specific items

    • General comments on the roles of RP’s in HP-IT in this era of commoditization

  • People from the RP’s & GIG here to help with questions


Major system developments in 2008

Major System Developments in 2008

  • Sites

    • Indiana-- LONI-- NCAR-- NCSA

    • NICS-- ORNL-- PSC-- Purdue

    • SDSC-- TACC-- ANL

  • New systems

    • [email protected] (Sun Constellation Cluster 63k cores, 580 Tf)

    • [email protected] (Cray XT, 18k cores, 160608 Tf)

    • Dell clusters @LONI & @Purdue

    • SGI [email protected]

    • FPGA system @Purdue

    • Vis system @TACC

    • Increased and integrated Condor Pool @Purdue & @Indiana

  • System retirements at

    • NCSA-- PSC-- Purdue-- SDSC

    • TACC


User community usage growth table 18 1 strong growth in capability usage

User community & usage growth (Table 18-1)(strong growth in capability & usage)


Compute 27 storage 11 resources

Compute (27) & Storage (11) Resources

  • 1.20 DataStar p690

  • 1.21 Blue Gene

  • 1.22 IA64 Cluster

  • 1.23 TACC Ranger

  • 1.24 Lonestar

  • 1.25 Spur (vis)

  • 1.26 Maverick

  • 1.27 UC/ANL IA64 & IA32 Cluster

  • Storage

    • 2.1 Indiana Data Capacitor

    • 2.2 HPSS

    • 2.3 NCSA Mass Storage System

    • 2.4 NICS HPSS

    • 2.5 PSC FAR HSM

    • 2.6 Bessemer

    • 2.7 SDSC GPFS-WAN

    • 2.8 HPSS

    • 2.9 SAM-QFS

    • 2.10 Storage Resource Broker (Collections Disk Space)

    • 2.11 TACC Ranch

    • Compute

      • 1.1 Indiana Big Red

      • 1.2 Quarry

      • 1.3 LONI Queen Bee

      • 1.4 NCAR Frost

      • 1.5 NCSA Abe

      • 1.6 Cobalt

      • 1.7 Tungsten

      • 1.8 Mercury

      • 1.9 NICS Kraken

      • 1.10 ORNL NSTG Cluster

      • 1.11 PSC BigBen

      • 1.12 Pople

      • 1.13 Rachel

      • 1.14 Purdue Steele

      • 1.15 Lear

      • 1.16 Condor Pool

      • 1.17 TeraDRE

      • 1.18 Brutus

      • 1.19 SDSC DataStar p655


    Compute 27 storage 11 resources1

    Compute (27) & Storage (11) Resources

    • From the Appendix, you know that they represent a broad spectrum of

      • Architectures

      • Strengths

      • Vendors

    • And therefore a wide range of

      • Capabilities

    • As has been mentioned, this heterogeneity improves TG ability to support a wide range of Science & Engineering Research


    Compute resources

    Compute Resources


    Storage resources

    Storage Resources


    Breadth

    Breadth !

    • Computational strength (individual systems)

      • 0.31 to 521 (608) Tf [**Recall graph from previous presentations**]

    • Total memory

      • .5 to 126 (132) TB

    • Memory per core

      • .256 to 32 GB/cores

    • Storage capacity

      • 2.2 to 41.3 PB

    • Architectures

      • Clusters̶ MPP’s

      • SMP’s̶ Experimental

    • Compute vendors

      • Cray̶ Dell̶ HP̶ IBM

      • Nvidia̶ SGI̶ Sun

    • Storage vendors

      • HPSS̶ SGI

      • STK̶ Sun


    Usage by discipline major users

    Usage by Discipline (major users)


    Nsf divisions coverage b 4

    NSF Divisions (Coverage: B.4)

    • DMS Mathematical Sciences

    • EAR Earth Sciences

    • ECS Electrical and Communications Systems

    • ERC Engineering Research Centers

    • IBN Integrative Biology and Neuroscience

    • IRI Informatics, Robotics and Intelligent Systems

    • MCB Molecular and Cellular Biology

    • MSS Mechanical and Structural Systems

    • NCR Networking and Comm’ns Research

    • OCE Ocean Sciences

    • PHY Physics

    • SES Social and Economic Sciences

    • ASC Advanced Scientific Computing

    • AST Astronomical Sciences

    • ATM Atmospheric Sciences

    • BIR Biological Instrumentation and Resources

    • BNS Behavioral and Neural Sciences

    • CCR Computing and Computation Research

    • CDA Cross-Disciplinary Activities

    • CHE Chemistry

    • CTS Chemical and Thermal Systems

    • DDM Design and Manufacturing Systems

    • DEB Environmental Biology

    • DMR Materials Research


    Distribution of users institutions 477 1 3

    Distribution of Users: Institutions (477, 1/3)


    Distribution of users institutions 2 3

    Distribution of Users: Institutions (2/3)


    Distribution of users institutions 3 3

    Distribution of Users: Institutions (3/3)


    Distribution of users geographical unweighted

    Distribution of Users: Geographical (unweighted)

    Appendix B, Summaries

    Apologies to AK & HI !


    Distribution of users geographical weighted

    Distribution of Users: Geographical (weighted)

    Appendix B, Summaries

    Apologies to AK & HI !


    Rp roles

    RP Roles

    • One might get the idea that the TG simply makes available a range of systems

      • Enforced by the voluminous, dry statistics

      • Not a bad assumption for “routine resources”

      • Not applicable to most TG systems which are typically cultivating new ground in scale or functionality.

    • In this age of the commoditization of most parts of IT systems, extreme price pressure and thin margins

      • Vendors are not able to innovate or support as in past and are often particularly weak in integrative areas which fall through the cracks between vendor domains

      • Yet, the TG strives, to excel in both areas and

      • The research community benefits from both innovation and support

    • The GIG works on broad, integrative issues; the more specialized and detailed burdens fall ever move on the RP’s.

      • Specifically, that burden is borne by the RP technical staff.

      • Their importance is often insufficiently emphasized.

      • (My impression is that the UK, for example, is more conscious of this issue.)


    Rp roles1

    RP Roles

    • RP activities span the range

      • Hardware

      • Configuration

      • System & operational software

      • Application support

      • Many times leading to TG-wide and vendor improvements

    • Although the annual report does not focus on these activities, the careful reader will notice them in the descriptions of

      • Science projects

      • Activities discussions


    Rp roles2

    RP Roles

    • Standing up new machines and

      • adapting them to the needs of their user community

      • requires care, expertise & experience.

    • The following 3 examples demonstrate a range of machine capabilities and scheduling adaptations.

    • Working well on varying conditions is NOT simply a matter of the jobs submitted.


    Range of usage styles

    Range of Usage Styles

    • System & user driven

      • Wide range

      • Focus: large

      • Focus: long


    Rp activities

    RP Activities

    Heterogeneity important provider of wide range of capabilities for wide range of research requirements.

    Selection of RP activities demonstrating breadth

    LONI--PSC

    University of Chicago & ANL

    SDSC--NCAR

    NICS--TACC

    Indiana--Purdue


    Loni louisiana optical network initiative

    LONI (Louisiana Optical Network Initiative)

    LONI: statewide network and distributed computing system (12 HPC systems)

    Centerpiece LONI system: Queen Bee (QB)

    668-node Dell PowerEdge 1950 IB system, 50 TF peak, each node has 8 2.33-GHz Intel “Cloverton” Xeons cores, 8 GB RAM, 36 GB disk

    60 TB /work and 60 TB /project Lustre disk, 300 TB NCSA archival storage (through subcontract)

    50% allocated through TeraGrid, allocated in pool with NCSA’s Abe (users get one allocation they can use on either system)

    Began running jobs for TeraGrid users on 1 February 2008

    10.7M SUs were used by TeraGrid users to run 24.6K jobs in CY2008

    2 planned, 6 unplanned outages (456 hours) -> 94.3% availability

    0 security incidents in CY2008 impacting TeraGrid usage

    QB is an early test site for IU’s LustreWAN and gateways

    LONI also develops and tests software (on QB and other TG systems)

    HARC – Highly Available Resource Co-allocators (reserves and co-allocates multiple resources, including compute and network)

    PetaShare – a distributed file system being developed across LONI

    SAGA – Simple API for Grid Applications (lets application programmers use distributed computing without worrying about the specific implementation)

    Cactus – scientific framework for astrophysics, fluid dynamics, etc.

    GridChem science gateway – biology applications being added


    Psc smp xt3 5

    PSC: SMP & XT3/5

    • Introduced Pople & Rachel

      • Large shared memory systems

      • 21% of TG SMP capability

      • Developed SMP scheduling techniques

    • PSC’s XT3 innovations (2004) benefit XT5 & TG (2009)

      • PSC externalized XT3 login & IO

      • Improved TG system performance & stability

      • Cray adopts for XT5 standard product


    Teragrid resource provider

    Visualization Hardware

    Provides clusters with dedicated graphics hardware, 20 TB of online storage, 30 Gb connection to TeraGrid backbone network

    Visualization Gateway

    Develops and hosts TeraGrid Visualization Gateway

    New in 2008

    Volume rendering service

    Dynamic accounts for community user access

    Visualization Support

    Provides visualization support to researchers from a range of scientific domains, including: Astrophysics, Fluid Dynamics, Life Sciences, Applied Mathematics

    TeraGrid Resource Provider


    Sdsc rp operations

    SDSC RP Operations

    • Global File Systems

      • Large production GPFS-WAN

      • Supporting IU & PSC Lustre-WAN efforts

    • User-Settable Advance Reservations and Co-/Meta-Scheduling

    • Archival Storage Systems, including dual-site option

    • Advanced Support for TeraGrid Applications (ASTA)

    • Education, Outreach & Training

    • SDSC staff hold a number of key positions within TG (ADs, WG leads, Inca, documentation, etc.)

    GPFS-WAN

    0.9 PB

    TeraGrid Linux Cluster

    IBM/Intel IA-64

    Archival Systems

    36 PB capacity (~5 PB used)


    Ncar teragrid rp developments

    NCAR Teragrid RP Developments

    • Current Cyberinfrastructure

      • 5.7 TFlops/2048 core Blue Gene/L system

      • 100 TB storage cluster

      • 1.5 PB SL8500 Tape archive with 1 PB media running HPSS/HSI

      • Sun DAV visualization node

      • Chronopolis data preservation testbed

    • New Cyberinfrastructure Developments

      • Possible Blue Gene/L Expansion to 22.8 TF system with 8192 cores

      • 100 TFLOPS MRI System @ CU with NCAR 10% stakeholder

      • DAV node upgrade to more capable 8 core system with larger memory

    • Science Gateway developments

      • Asteroseismic Modeling Portal (AMP). Successful launch of the Kepler spacecraft increases potential scientific impact of this gateway.

    26


    National institute for computational sciences

    National Institute for Computational Sciences

    Kraken and Krakettes!

    Proprietary Cray Interconnect, SeaStar2, provides excellent scaling, with numerous tightly coupled applications running at 32K and 64K cores on the XT5.

    NICS is specializing on true capability applications, plus high performance file and archival systems.


    Tacc major achievements in 2008

    TACC Major Achievements in 2008

    • Deployed Ranger, first TeraGrid Track2 system

      • 100s of millions of SUs, >> greater than all TG in 2007

      • Users scaling to full system for unprecedented science

    • Deployed Spur, most powerful TG vis system

      • tightly coupled to Ranger HPC system, parallel I/O

      • Already over-requested, demonstrating pent-up demand for large-scale vis

    • Supporting most of largest science projects in TG

      • Ex. Hurricane Gustav/Ike NOAA project demonstrated TG end-to-end science support with societal impact

    • TG User Portal added file management, other capabilities

      • Heavily used online resource by most TG users

      • Leveraging local TACC s/w development, projects, staff


    Indiana university rp highlights

    Indiana University RP Highlights

    • Particular community needs. Big Red – particularly suited for Molecular Dynamics codes - #1 use of system

    • Workflows & Lustre WAN. SC07 BandWidth challenge led to software enhancements, LustreWAN as production service. Deployed as production at IU and PSC; testing at LONI; TACC; SDSC. Support for workflows that integrate Lustre, Big Red, other RPs (40% decrease in end-to-end time for LEAD workflows based on use of Lustre-WAN)

    • New service – Virtual Machine hosting. Quarry – Now serving 17 projects running a total of 25 VMs. e.g.: TeraGrid Information Services; LEAD (Linked Environments for Atmospheric Discover); Data Collections (e.g. Flybase)


    Purdue rp operations highlights

    Purdue RP Operations highlights

    High-throughput computing

    Expanded campus participation in the Condor pool (IU and Wisconsin joined, others are in process).

    Support job routing from Purdue Condor pool to remote sites automatically (OSG, TeraGrid, etc).

    Need scalable job reporting tool to better serve users - working with IU and Wisconsin’s Condor team.

    • Steele - 893 nodes (7144 cores), 66 Tflops peak

      • Began production 5/2008, replaced Lear. Installed in less than a day and began running TG jobs as the cluster was being installed.

    • Brutus (FPGA)

      • Integrated into TeraGrid in 2/2008 as an experimental resource. Supporting users who develop programming tools to make FPGA-based acceleration easier to adopt and those who will use the FPGA-accelerated BLAST application through Condor scheduling.

    • The Wispy cloud – experimental, a collaboration with UC/ANL Nimbus group.

      • Made available to TG users in fall 2008. Users can transfer virtual machine images to Wispy and run it the same way as submitting a job to the grid. Using Wispy to help configure OSG environment for non-OSG TG resources (collaboration with RENCI, UC/ANL).

    • OSG collaboration - Standardize methods for advertising support for parallel (MPI) jobs – make user’s lives easier.


  • Login