80 likes | 175 Views
Phoenix Training session. Introduction for CHIPP sysadmins. Contents. Scope of the course Documentation Monitoring Remote access Handling services Shared filesystems Visit the Machine R oom. Scope of the course.
E N D
Phoenix Training session Introduction for CHIPP sysadmins
Contents • Scope of the course • Documentation • Monitoring • Remote access • Handling services • Shared filesystems • Visit the Machine Room
Scope of the course • Give you enough information so that you can bring the cluster back to production under an emergency. • Emergency = sysadmins are offsite, or ask you for help • Not intended to give you a full understanding of everything, nor to be able to install or configure new services. • Covers: Arc, Cream, Torque, Moab, dCache, BDII, WNs, NFS, Lustre, GPFS • Does not cover: Monitoring, Apel, Ui, Cfengine, Voboxes, Argus (too soon) • Interactive session, please ask questions! Please give us your feedback!
Documentation • Everything should be on the twiki: https://wiki.chipp.ch/twiki/bin/view/LCGTier2 • It will be our Course Documentation • But documentation is never enough, and only shows reality in a dream world. • Users Section not really interesting for us. • Logs Section is the place to look for things that happened, like meetings, issues and such. • The fun part is in Technical Section.
Monitoring • Overview in twiki: Technical -> Monitoring • PhoenixMonOverview is our main source of information. Ganglia complements it. • They lose detail as time goes by • Our Nagios instances are useless right now. • You can click on some graphs to get extended information • The first section links to VO tests
Remote Access • You should have your root private key for the cluster on username@pub.lcg.cscs.ch and grid certificate in ui64 • If the agent does not work, try to killallssh-agent and relogin. Agent is forwarded by default. • Use ssh tunnels to access private interfaces, like dCacheGui (-L 22223:storage02:22223). • For hardware reboot, use ireset from xen12. Be careful. • Network traffic is un-firewalled within the cluster. • You can use dsh for massive operations. /etc/dsh/groups lists the available groups xen12# dsh -g WN “service pbs_mom restart”
Handling services • Every service is different. Use your logic! • One script to rule grid services: grid-service. • Two tools to check general status: • chk_CREAM-CEs submits a job to cream01/02 to the cscs queue and polls for results. • chk_SE-lcgtools copies and registers a file in/out dCache using BDII information. • Use them from ui64 with your certificate.
Shared filesystems • nfs01/02 hosts 9 DRBD LVMs with heartbeat. • /experiment_software for each VO • /shared for gridmapdir, vo_tags and torque/moab High Availability locks and logs. • mds1/2 and oss{1-4}{1-2} hosts /lustre/scratch • /home symlink on all WNs and Ces (/home/egee) • /tmpdir_pbssymlink on some WNs • gpfs01-03 hosts /gpfs • /tmpdir_pbssymlink on some other WNs