1 / 32

Si Liu NCAR/CISL/OSD/USS Consulting Service Group

National Center for Atmospheric Research Computation and Information Systems Laboratory Facilities and Support Overview Feb 14, 2010. Si Liu NCAR/CISL/OSD/USS Consulting Service Group.

amara
Download Presentation

Si Liu NCAR/CISL/OSD/USS Consulting Service Group

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. National Center for Atmospheric ResearchComputation and Information Systems Laboratory Facilities and Support OverviewFeb 14, 2010 Si Liu NCAR/CISL/OSD/USS Consulting Service Group

  2. CISL’s Mission for User Support“CISL will provide a balanced set of services to enable researchers to securely, easily, and effectively utilize community resources”CISL Strategic Plan CISL also supports special colloquia, workshops andcomputational campaigns; giving these groups of usersspecial privileges and access to facilities and servicesabove and beyond normal service levels.

  3. CISL Facility Navigation and usage of the facility requires a basic familiarity with a number of the functional aspects of the facility. • Computing systems • Bluefire • Frost • Lynx • NWSC machine • Usage • Batch • Interactive • Data Archival • MSS • HPSS • GLADE • Data Analysis & Visualization • Mirage and Storm • Allocations and security • User support

  4. Allocations • Allocations are granted in General Accounting Units (GAUs) • Monitor GAU usage through the CISL portal: http://cislportal.ucar.edu/portal(requires UCAS password) • Charges are assessed overnight and will be available for review for runs that complete by midnight. • GAUs charged = wallclock hours used * number of nodes used * number of processors in that node * computer factor * queue charging factor • The computer charging factor for bluefire is 1.4.

  5. Security • CISL Firewall • Internal networks are separated from the external Internet • protect servers from malicious attacks from external networks • Secure Shell • Use SSH for local and remote access to all CISL systems. • One-time passwords for protected systems • Cryptocard • Yubikey

  6. Computing System - Bluefire • IBM clustered Symmetric MultiProcessing (SMP) system • Operating System: AIX (IBM-proprietary UNIX) • Batch system: Load Sharing Facility (LSF) • File system: General Parallel File System (GPFS) • 127 32-Way 4.7 GHz nodes • 4,064 POWER6 processors • SMT enabled (64 SMT threads per node) • 76.4 TFLOPS • 117 Compute nodes (70.4 TFLOPS peak) • 3,744 POWER6 processors (32 per node) • 69 compute nodes have 64 GB memory • 48 compute nodes have 128 GB memory • 10 other nodes • 2 Interactive sessions/Login nodes (256 GB memory) • 2 Debugging and Share queue nodes (256 GB memory) • 4 GPFS/VSD nodes • 2 Service nodes

  7. Computing System – Bluefire, continued • Memory • L1 cache is 128 KB per processor. • L2 cache is 4 MB per processor on-chip. • The off-chip L3 cache is 32 MB per two-processor chip, and is shared by the two processors on the chip. • 48 nodes contain 128 GB of shared memory, and 69 nodes contain 64 GB per processor of shared memory. • Disk storage • 150 TeraBytes of usable file system space • 5 GB home directory; backed up and not subject to scrubbing. • 400 GB /ptmp directory; not backed up • High-speed interconnect : Infiniband Switch • 4X Infiniband DDR links capable of 2.5 GB/sec with 1.3 microsecond latency • 8 links per node

  8. Computing System - Frost • Four-rack, IBM Blue Gene/L system • Operating system: SuSE Linux Enterprise Server 9. • Batch System: Cobalt from ANL • File system: The scratch space (/ptmp) is GPFS. Home directories are NFS mounted. • 4096 compute nodes (22.936 TFLOPS peak) • 1024 compute nodes per rack • Two 700MHz PowerPC-440 CPUs, 512MB of memory, and two floating-point units (FPUs) per core • One I/O node for every 32 compute nodes • Four front-end cluster nodes. • IBM OpenPower720 server • Four POWER5 1.65GHz CPUs • 8GB of memory

  9. Computing System – Lynx • Single cabinet Massively Parallel Processing supercomputer. • Operating system: Cray Linux Environment • Compute Node Linux (CNL) – based on SuSE Linux SLES 10 • Batch System: • MOAB workload manager • Torque (aka OpenPBS) resource manager • Cray’s ALPS (Application Level Placement Scheduler) • File system: Luster file system • 912 compute processors (8.026 TFLOPS peak) • 76 compute nodes, 12 processors per node • Two hex-core AMD 2.2 GHz Opteron chips  • Each processor has 1.3 gigabytes of memory and totaling 1.216 terabytes of memory in the system. • 10 I/O nodes • A single dual-core AMD 2.6 GHz Opteron chip and 8 gigabytes of memory. • 2 login nodes, 4 nodes reserved for system functions. • 4 nodes are for external Lustrefilesystem and GPFS file system testing.

  10. A Job Script on Bluefire #!/bin/csh # LSF batch script to run an MPI application #BSUB -P 12345678 # project number (required) #BSUB -W 1:00 # wall clock time (in minutes) #BSUB -n 64 # number of MPI tasks #BSUB -R "span[ptile=64]" # run 64 tasks per node #BSUB -J matadd_mpi # job name #BSUB -o matadd_mpi.%J.out # output filename #BSUB -e matadd_mpi.%J.err # error filename #BSUB -q regular # queue # edit exec header to enable using 64K pages ldedit -bdatapsize=64K -bstackpsize=64K -btextpsize=64K matadd_mpi.exe # set this env for launch as default processor binding setenv TARGET_CPU_LIST "-1" mpirun.lsf /usr/local/bin/launch ./matadd_mpi.exe For more examples, see the /usr/local/examples directory.

  11. Submitting, Deleting, and Monitoring Jobs on Bluefire • Job submission • bsub < script • Monitor jobs • bjobs • bjobs -u all • bjobs -q regular -u all • bhist -n 3 -a • Deleting a job • bkill[jobid] • System batch load • batchview

  12. "Big 3“ to get a best performance on bluefire • Simultaneous Multi-Threading(SMT) • a second, on-board "virtual" processor • 64 virtual cpus in each node • Multiple page size support • The default page size: 4 KB. • 64-KB page size when running the 64-bit kernel. • Large pages (16 MB) and "huge" pages (16 GB) • Processor binding

  13. Submitting, Deleting, and Monitoring Jobs on Frost • Submitting a job • cqsub -n 28 -c 55 -m vn \ -t 00:10:00 example • Check job status • cqstat • Deleting a job • cqdel [jobid] • Altering jobs • qalter -t 60 -n 32 --mode vn 118399

  14. A job script on Lynx #!/bin/bash #PBS -q regular #PBS -l mppwidth=60 #PBS -l walltime=01:30:00 #PBS –N example #PBS -e testrun.$PBS_JOBID.err #PBS -o testrun.$PBS_JOBID.out cd $PBS_O_WORKDIR aprun -n 60 ./testrun

  15. Submitting, Deleting, and Monitoring Jobs on Lynx • Submitting a job • qsub [batch_script] • Check job status • qstat –a • Deleting a job • qdel [jobid] • Hold or release a job • qhold [jobid] qrls [jobid] • Change attributes of submitted job • qalter

  16. NCAR Archival System • NCAR Mass Store Subsystem - (MSS) • Currently stores 70 million files, 11 petabytes of data • Library of Congress (printed collection) • 10 Terabytes = 0.01 Petabytes • Mass Store holds 800 * Library of Congress • Growing by 2-6 Terabytes of new data per day • Data holdings increasing exponentially • 1986 - 2 Tb • 1997 - 100 Tb • 2002 - 1000 Tb • 2004 - 2000 Tb • 2008 - 5000 Tb • 2010 – 8000 Tb

  17. Migrating from the NCAR MSS to HPSS • MSS and its interfaces will be replaced by High Performance Storage System (HPSS) in March 2011. • When the conversion is complete, the existing storage hardware currently in use will still be in place, but it will be under HPSS control. • Users will not have to copy any of their files from the MSS to HPSS. There will be a period of downtime. • The MSS Group will be working closely with users during this transition.

  18. The differences between MSS and HPSS • What will be going away: • msrcp and all DCS libraries. • The MSS ftp server. • All DCS metadata commands for listing files, and manipulating files. • Retention periods, the weekly "Purge" and email purge notices, the "trash can" for purged/deleted files, and the "msrecover" command. • Read and write passwords on MSS files. • What we will have instead: • Hierarchical Storage Interface (HSI), which will be the primary interface that NCAR will be supporting for data transfers to/from HPSS along with metadata access and data management. • GridFTP (under development) • HPSS files have NO expiration date. They remain in the archive until they are explicitly deleted. Once deleted, they cannot be recovered. • Posix-style permission bits for controlling access to HPSS files and directories. • Persistent directories. • Longer filenames (HPSS full pathnames can be up to 1023 characters). • Higher maximum file size (HPSS files can be up to 1 Terabyte).

  19. Charges for MSS and HPSS The current MSS charging formula is: GAUs charged = .0837*R + .0012*A +N(.1195*W + .205*S) • R = Gigabytes read • W = Gigabytes created or written • A = Number of disk drive or tape cartridge accesses • S = Data stored, in gigabyte-years • N = Number of copies of file: = 1 if economy reliability selected = 2 if standard reliability selected

  20. HPSS Usage • To request an account • TeraGrid users should send email to: help@teragrid.org • NCAR users should contact CISL Customer Support: cislhelp@ucar.edu • The HPSS uses HSI as its POSIX compliant interface. • HSI uses Kerberos as an authentication mechanism.

  21. Kerberoshttp://www2.cisl.ucar.edu/docs/hpss/kerberos • Authentication service • Authentication – validate who you are • Service – with a server, set of functions, etc. • Kerberos operates in a domain • Default is UCAR.EDU • Domain served by a KDC • Key Distribution Center • Server (dog.ucar.edu) • Different pieces (not necessary to get into this) • Users of the service • Individual people • Kerberized Services (like HPSS)

  22. Kerberos Commands • kinit • Authenticate with KDC, get your TGT • klist • List your ticket cache • kdestroy • Clear your ticket cache • kpasswd • Change your password

  23. Hierarchical Storage Interface (HSI) • POSIX like interface • Different Ways to Invoke HSI • Command line invocation • hsicmd • hsi get myhpssfile (in your default dir on HPSS) • hsi put myunixfile (in your default dir on HPSS) • Open an HSI session • hsi to get in and establish session; end, exit, quit to get out • restricted shell like environment • hsi “in cmdfile” • File of commands scripted in “cmdfile” • Navigating HPSS while in HSI session • pwd , cd, ls, cdls

  24. Data Transfer • Writing data – put command • [HSI]/home/user1> put file.01 • [HSI]/home/user1> put file.01 : new.hpss.file • Reading data – get command • [HSI]/home/user1-> get file.01 • [HSI]/home/user1-> get file.01 : new.hpss.file Detailed documentation for HSI can be found at http://www.ucar.edu/docs/hpss

  25. GLADE centralized file service • High performance shared file system technology • Shared work spaces across CISL's HPC resources • Multiple different spaces based requirements. • /glade/home/username • /glade/scratch/username • /glade/proj*

  26. Data Analysis & Visualization • Data Analysis and Visualization • High-end servers available 7 x 24 for interactive data analysis, data-postprocessing and visualization. • Data Sharing • Shared data access within the Lab. Access to the NCAR Archival Systems and NCAR Data Sets. • Remote Visualization • Access DASG's visual computing platforms from the convenience of your office using DASG's tcp/ip based remote image delivery service. • Visualization Consulting • Consult with DASG staff on your visualization problems.   • Production Visualization • DASG staff can in some instances generate images and/or animations of your data on your behalf.

  27. Software Configuration (Mirage-Storm). • Development Environments • Intel C, C++, F77, F90 • GNU C, C++, F77, Tools • TotalView debugger • Software Tools • VAPOR • NCL, NCO, NCARG • IDL • Matlab • Paraview • ImageMagick • VTK

  28. User Support • CISL homepage: • http://www2.cisl.ucar.edu/ • CISL documentation • http://www2.cisl.ucar.edu/docs/user-documentation • CISL HELP • Call (303)497-1200 • Email to cislhelp@ucar.edu • Submit an extraview ticket • CISL Consulting Service • NCAR Mesa Lab Area 51/55, Floor 1B

  29. Information We Need From You • Machine name • Job ID, date • Nodes job ran on, if known • Description of the problem • Commands you typed • Error code you are getting • These are best provided in ExtraView ticket or via email to cislhelp@ucar.edu.

  30. Working on mirage • Log on to ‘mirage’ • ssh –X –l username mirage3.ucar.edu • One-time password using CryptoCard or Yubikey • Use 'free' or 'top' to see if there is currently enough resources • Enabling your Yubikey token • Your yubikey has been activated and is ready for use. • The yubikey is activated by the warmth of your finger not the pressure in pushing the button. • Using your Yubikey token • When you are logging in to mirage, your screen displays a response: Token_Response: • Enter you PIN number on the screen (do not hit enter) then touch the yubikey button. This will insert a new one-time password(OTP) and a return.

More Related