1 / 13

NERSC Site Report

NERSC Site Report. HEPiX October 20, 2003 TRIUMF. LBL, NERSC, and PDSF. LBL manages the NERSC Center for DOE PDSF is the production Linux cluster at NERSC used primarily for HEP science Site report will touch on activities of interest to HEPiX community at each of these levels.

redford
Download Presentation

NERSC Site Report

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. NERSC Site Report HEPiX October 20, 2003 TRIUMF

  2. LBL, NERSC, and PDSF • LBL manages the • NERSC Center for DOE • PDSF is the production Linux cluster at NERSC used primarily for HEP science • Site report will touch on activities of interest to HEPiX community at each of these levels

  3. PDSF - New Hardware • 96 Dual Athlon Systems • 8 Storage Nodes - ~18 TB formatted • All gigabit attached (Dell switches) • Purchased two Opteron systems for testing

  4. PDSF Projects • HostDB - Presentation later • Sun GridEngine Evaluation • Met all requirements (long list) • Putting in semi-production on retired nodes • Grid certificate DN kernel module • 1-wire based monitoring and control network • High Availability Server • Uses heartbeat code • IDE based Fibre-Channel array

  5. PDSF - Other news • Aztera • Zambeel folded • StorAd is making best effort to support the system • New User Groups • KamLAND • e896 • ALICE

  6. IBM SP • Upgraded • 208 nodes added - 16 way Nighthawk II • Additional 20 TB of disk • Total System • 10 Tflops/s peak • 7.8 TB memory • 44 TB of GPFS storage

  7. Mass Storage • Hardware • New DataDirect disk cache • New tape drives allow high capacity cartridges (200 GB) • Software • Currently running HPSS 4.3 • Testing 5.1 • Testing • DMAPI • htar command

  8. Grid Activities • GridFTP and gatekeeper deployed on all productions system (except gatekeeper on Seaborg which is coming soon) • Integrating account management system with grid certificates • Testing myproxy based system • Portal • Web interface to HPSS

  9. Networking • Jumbo support to ESNET • Looking for other sites to test Jumbo across WAN • New production router (Juniper)

  10. GUPFS • Hardware testbed: • 3Par Data • Yotta Yotta • Dell EMC • Dot Hill • Data Direct (Soon) • Panasas • Interconnect hardware: • Topspin (IB) • Infinicon (IB) • Cisco (ISCSI) • Qlogic (ISCSI) • Adaptec (ISCSI) • Myrinet 2000 • Various FC • Filesystems: • ADIC license • GPFS license • GFS 5.2 license • Lustre • Test clients: • Dual processor 2.2GHz Xeons • 2GB memory • 2 PCI-X • Local HD for OS

  11. Distributed System Dept. • Net100 (http://www.net100.org/) - Built on Web100 (PSC, NCAR, NCSA) and NetLogger (LBNL), Net100 modifies operating systems to respond dynamically to network conditions and make adjustments in network transfers, sending data as fast as the network will allow. • Self Configuring Network Monitor (SCNM) - (http://dsd.lbl.gov/Net-Mon/Self-Config.html) provide accurate, comprehensive, and on-demand, application-to-application monitoring capabilities throughout the interior of the interconnecting network domains.

  12. Distributed Systems (cont’d) • Netlogger (http://www-didc.lbl.gov/NetLogger/) • pyGlobus (http://dsd.lbl.gov/gtg/projects/pyGlobus/) Python interface to the Globus Toolkit. LIGO gravity wave experiment is using it to replicate TB/day data around the US with the LIGO Data Replicator (http://www.lsc-group.phys.uwm.edu/LDR/) • DOEGrids.org PKI for the DOE science community, part of federation supporting international scientific collaborations

  13. Repaired Hardware • System from 2000 wide spread failure (half of 90 systems) • Had broken systems inspected by LBL Electronics Shop • Discovered 4 bad capictors (~$2) • Prep’d systems can be repaired for ~$20/board • 16 systems repaired so far • Plan to eventually repair all system from batch

More Related