1 / 14

Managing A Large Farm: CSF

Managing A Large Farm: CSF. Andrew Sansum 26 November 2002. Overview. Will cover many of the large scale issues associated with big CPU/disk farms Intent is to provoke discussion rather than provide answers: I don’t claim to be an expert!

varden
Download Presentation

Managing A Large Farm: CSF

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Managing A Large Farm: CSF Andrew Sansum 26 November 2002

  2. Overview • Will cover many of the large scale issues associated with big CPU/disk farms • Intent is to provoke discussion rather than provide answers: • I don’t claim to be an expert! • Many RAL solutions are dated but new staff will soon be making changes.

  3. Large FarmsThe BIG differences • BIG is not beautiful - • A small mistake can proliferate: • problems can multiply, • many components can become involved. • THINK before you make changes! • Manual login on 500 nodes is major disaster! • Funding bodies often expect big farms to be run more professionally.

  4. Hardware Specification • Good quality hardware is vital. • Go with a reputable company • Evaluate quality of solution. • Check for component compatibility • Consider long warranties or be prepared for major interventions yourself (eg replace all the fans)

  5. Power Requirements • Is there enough (steady state). Right plugs!! • Cope with surge on power up (think about power sequencing). • What impact do PSUs have on power supply (cf. SLAC) - neutral current imbalance - higher order harmonics… • Remote/Automated power up/down is nice (eg APC units) • Worry about equipment on different phases

  6. Cooling • Cooling must be sufficient! • Must be able to cope with local hot spots. • If cooling fails - things get hot very fast - monitoring/automated shutdown.

  7. Installation • Netboot/PXE avoids need for manual insertion of floppies. • Use something like kickstart to: • Speed up installation task • Maintain record of configuration • Allow automated reconfiguration • LCFG not recommended - but maybe successors?

  8. Configuration Management • Autorpm is useful for maintaining updates, but update from local managed copy - control changes! • Test changes before rolling out!!!!!!!! • Need to ensure coherent, reproducible configuration - tricky! • LCFG is good at this but cumbersome • Kickstart needs great care - update kickstart AND systems independently?

  9. Management Tools • Very simple at RAL. Local parallel ssh • Parallel rsh/ssh commands: prsh seems popular. • Project C3 seems worth a look • Oscar bundles many interesting tools together

  10. Exception monitoring • Need to spot problems before users do. • Run daemon or crontab checking for errors. On detection: • Notify: SURE, Bigbrother,... (not email!) • Automated fixup (Daemon restart, filesystem cleanup ...) • Automated Drain/Remove from configuration. Automated power down/up. Automated DNS updates.

  11. Incident Tracking • Keep track of significant interventions. • Which hosts keep crashing. Dates, times errors etc. • What disks failed - serial numbers of returns - returns outstanding ... • Keep track of tasks outstanding: eg: why is csflnx231 currently offline - who is fixing it ...

  12. Hardware Management • Many systems, eventually means: • Many system crashes. • Many hardware failures • Consider purchasing 3 years warranty. On-site is easier. • Define standard hardware (re) certification procedure . Make use of junior staff (operators postgrads, gran, ...!)

  13. Utilisation/Capacity planning • Monitor everything you can conveniently manage. • MRTG is standard network monitoring • Ganglia appears to be popular for system utilisation etc. • PBS accounting records (or process accounting).

  14. Conclusions • Careful planning, specification and hardware selection can pay dividends. • Get smart or invest in lots of staff • Monitor so you know what is going on. • Many issues raised - few solutions offered. Wide range of experience out in the UK HEPSYSMAN community. Make use of of it!

More Related