Managing A Large Farm: CSF

Managing A Large Farm: CSF Andrew Sansum 26 November 2002

Overview • Will cover many of the large scale issues associated with big CPU/disk farms • Intent is to provoke discussion rather than provide answers: • I don’t claim to be an expert! • Many RAL solutions are dated but new staff will soon be making changes.

Large FarmsThe BIG differences • BIG is not beautiful - • A small mistake can proliferate: • problems can multiply, • many components can become involved. • THINK before you make changes! • Manual login on 500 nodes is major disaster! • Funding bodies often expect big farms to be run more professionally.

Hardware Specification • Good quality hardware is vital. • Go with a reputable company • Evaluate quality of solution. • Check for component compatibility • Consider long warranties or be prepared for major interventions yourself (eg replace all the fans)

Power Requirements • Is there enough (steady state). Right plugs!! • Cope with surge on power up (think about power sequencing). • What impact do PSUs have on power supply (cf. SLAC) - neutral current imbalance - higher order harmonics… • Remote/Automated power up/down is nice (eg APC units) • Worry about equipment on different phases

Cooling • Cooling must be sufficient! • Must be able to cope with local hot spots. • If cooling fails - things get hot very fast - monitoring/automated shutdown.

Installation • Netboot/PXE avoids need for manual insertion of floppies. • Use something like kickstart to: • Speed up installation task • Maintain record of configuration • Allow automated reconfiguration • LCFG not recommended - but maybe successors?

Configuration Management • Autorpm is useful for maintaining updates, but update from local managed copy - control changes! • Test changes before rolling out!!!!!!!! • Need to ensure coherent, reproducible configuration - tricky! • LCFG is good at this but cumbersome • Kickstart needs great care - update kickstart AND systems independently?

Management Tools • Very simple at RAL. Local parallel ssh • Parallel rsh/ssh commands: prsh seems popular. • Project C3 seems worth a look • Oscar bundles many interesting tools together

Exception monitoring • Need to spot problems before users do. • Run daemon or crontab checking for errors. On detection: • Notify: SURE, Bigbrother,... (not email!) • Automated fixup (Daemon restart, filesystem cleanup ...) • Automated Drain/Remove from configuration. Automated power down/up. Automated DNS updates.

Incident Tracking • Keep track of significant interventions. • Which hosts keep crashing. Dates, times errors etc. • What disks failed - serial numbers of returns - returns outstanding ... • Keep track of tasks outstanding: eg: why is csflnx231 currently offline - who is fixing it ...

Hardware Management • Many systems, eventually means: • Many system crashes. • Many hardware failures • Consider purchasing 3 years warranty. On-site is easier. • Define standard hardware (re) certification procedure . Make use of junior staff (operators postgrads, gran, ...!)

Utilisation/Capacity planning • Monitor everything you can conveniently manage. • MRTG is standard network monitoring • Ganglia appears to be popular for system utilisation etc. • PBS accounting records (or process accounting).

Conclusions • Careful planning, specification and hardware selection can pay dividends. • Get smart or invest in lots of staff • Monitor so you know what is going on. • Many issues raised - few solutions offered. Wide range of experience out in the UK HEPSYSMAN community. Make use of of it!

Managing A Large Farm: CSF