1 / 12

PDSF and the Alvarez Clusters

PDSF and the Alvarez Clusters. Presented by Shane Canon, NERSC/PDSF canon@nersc.gov. NERSC Hardware. National Energy Research Scientific Computing Center http://www.nersc.gov

chico
Download Presentation

PDSF and the Alvarez Clusters

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. PDSF and the Alvarez Clusters Presented by Shane Canon, NERSC/PDSF canon@nersc.gov

  2. NERSC Hardware National Energy Research Scientific Computing Centerhttp://www.nersc.gov One of the nation’s top unclassified Computing resources, funded by the DOE for over 25 years with the mission of providing computing and network services for research. NERSC is located at Lawrence Berkeley Laboratory, in Berkeley, CA http://www.lbl.gov High Performance Computing Resourceshttp://hpcf.nersc.gov - IBM SP cluster, 2000+ processors, 1.2+ TB RAM, 20 TB+ cluster filesystem - Cray T3E, 692 processors, 177 GB RAM - Cray PVP, 64 processors, 3 GW RAM - PDSF, 160 Compute nodes, 281 processors, 7.5 TB disk space - HPSS, 6 StorageTek Silos, 880 TB’s of near-line and offline storage. Soon to be expanded to a full PetaByte of storage

  3. NERSC Facilities New Oakland Scientific Facility - 20,000 sq. foot data center - 24x7 operations team - OC48 (2.5 Gbits/sec) connection to LBL/ESNet - options on 24,000 sq. foot expansion

  4. NERSCInternet Access ESNet Headquarters http://www.es.net/ - Provides leading edge networking to DOE researchers - Backbone has OC12 (622 Mbit/sec) connection to CERN - Backbone connects key DOE sites - Headquartered at Lawrence Berkeley - Location assures prompt response

  5. Cluster Design • Embarrassingly Parallel • Commodity networking • Commodity parts • Buy “at the knee” • No modeling

  6. Issues with Cluster Configuration • Maintaining consistency • Scalability • System • Human • Adaptability/Flexibility • Community tools

  7. Cluster ConfigurationPresent • Installation • Home grown (nfsroot/tar image) • Configuration management • Rsync/RPM • Cfengine

  8. Cluster Configuration Future • Installation • kickstart (or systemimager/systeminstaller) • Configuration management • RPM • Cfengine • Database • Resource management • Integrate with configuration management

  9. NERSC Staff NERSC and LBL have dedicated, experienced staff in the fields of high performance computing, GRID computing and mass storage Researchers - Will Johnston, Head of Distributed Systems Dept. GRID researcher http://www.itg.lbl.gov/ Project manager for NASA Information Power Grid http://www.nas.nasa.gov/IPG - Arie Shoshani, Head of Scientific Data Management http://gizmo.lbl.gov/DM.html Researches mass storage issues related to scientific computing - Doug Olson, Project Coordinator Particle Physics Data Grid http://www.ppdg.net/ Coordinator for STAR computing at PDSF - Dave Quarrie, Chief Software Architect, ATLAS http://www.nersc.gov/aboutnersc/bios/henpbios.html - Craig Tull, Offline Software Framework/Control, Coordinator for ATLAS computing at PDSF NERSC High Performance Computing Department http://www.nersc.gov/aboutnersc/hpcd.html - Advanced Systems Group evaluates and vetts HW/SW for production computing (4 FTE) - Computing Systems Group manages infrastructure for computing (9 FTE) - Computer Operations & Support provides 24x7x365 support (14 FTE) - Networking and Security Group provides Networking and Security (3 FTE) - Mass Storage manages the near-line and off-line storage facilities (5 FTE)

  10. PDSF & STAR PDSF has been working with the STAR since 1998 http://www.star.bnl.gov/l - Data collection occurs at Brookhaven, and DST’s are sent to NERSC - PDSF is the primary offsite computing facility for STAR - Collaboration carries out DST analysis and simulations at PDSF - STAR has 37 collaborating institutions (too many for arrows!)

  11. PDSF Philosophy PDSF is a Linux cluster built from commodity hardware and open source software - Our mission is to provide the most effective distributed computer cluster possible that is suitable for experimental HENP applications http://pdsf.nersc.gov - PDSF acronym came from SSC lab in 1995, along with original equipment - Architecture tuned for “embarassingly parallel” applications - Uses LSF 4.1 for batch scheduling - AFS access, and access to HPSS for mass storage - High speed (Gigabit Ethernet) access to HPSS system - One of several Linux clusters at LBL - Alvarez cluster has similar architecture, but supports Myrinet cluster interconnect - NERSC PC Cluster project by Future Technology Group is an experimental cluster http://www.nersc.gov/research/FTG/index.html - Genome cluster at LBL for research into fruit fly genome - 152 compute nodes, 281 processors, 7.5 TB of storage - Cluster uptime for year 2000 was > 98%, and for most recently measured period (January 2001), cluster utilization for batch jobs was 78%. - Overall cluster has had zero downtime due to security issues - PDSF and NERSC have a track record of solid security balanced with unobtrusive practices

  12. More About PDSF PDSF uses a common resource pool for all projects - PDSF supports multiple experiments: STAR, ATLAS, BABAR, D0, Amanda, E871, E895, E896 and CDF. - Multiple projects have access to the computing resources, s/w available supports all experiments - Actual level of access is determined by the batch scheduler, using fair share rules - Each project’s investment goes into purchasing hardware and support infrastructure for the entire cluster - The use of a common configuration decreases management overhead, lowers administration complexity, and increases availability of useable computing resources - Use of commodity Intel hardware makes us vendor neutral, and lowers the cost to all of our users - Low cost and easy access to hardware makes it possible for us to update configurations relatively quickly to support new computing requirements. - Because the actual physical resources available is always greater than any individual contributor’s investment, there is usually some excess capacity available for sudden peaks in usage, and always a buffer to absorb sudden hardware failures

More Related