1 / 19

Patrol/Ranger Update

Patrol/Ranger Update. Chuck Boeheim Assistant Director SLAC Computer Services. History. Patrol originated in 1994 Originally only to renice processes Extended to monitor filesystems, daemons, and to perform more notifications/repairs

echo-hyde
Download Presentation

Patrol/Ranger Update

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Patrol/Ranger Update Chuck Boeheim Assistant Director SLAC Computer Services

  2. History • Patrol originated in 1994 • Originally only to renice processes • Extended to monitor filesystems, daemons, and to perform more notifications/repairs • Downloaded by over 300 sites, in production use in about 20 known sites

  3. Limitations • Original rules language simple, columnar PC afs[0-9]* 50 log,mail(unix-admin) • Difficult to extend to express complexities • E.g., renice processes using more than 20% of the CPU if the load average is over 3. • Written in Perl4, limited by not having complex data structures

  4. The Rewrite • Update to Perl5 • Introduce new rules language • Introduce extensible data collectors • Rename to System Ranger

  5. Rules file structure • Config section supplies local customizations • Ruleset sections defines data collectors and the set of rules to be applied to them • Message section defines message texts

  6. Config section • Supplies the common customizations made at other sites config { optsfile(/etc/tailor.opts) path(/usr/ucb:/bin:/usr/bin) mailfrom('The System Ranger <root>') mailreply(’Unix Admins <unix-admin>') }

  7. Rulesets • Rulesets name a set of rules and associate them with a data collector Ruleset(anyname) collector(process) { list of rules... } • Builtin data collectors are: System, Process, Daemon, User, Filesystem, File, Service • Custom collectors are planned

  8. Rules • A rule is a set of function calls in braces Rule { cpu(gt,50) kill() log() } • Functions return SUCCESS or FAILURE • FAILURE causes remainder of rule not to be executed, execution passes to next rule • A rule that succeeds ends processing of the ruleset unless the CONTINUE function appears in it.

  9. Rules • The word OR may connect functions Rule { cpu(gt,50) or size(gt,20M) kill() } • A sequence of functions in braces returns SUCCESS or FAILURE for the entire sequence Rule {{cpu(gt,50) kill()} or cpu(gt,25) log } • A sequence of functions in brackets always returns SUCCESS • Rule { cpu(gt,50) [size(gt,10M) kill] log }

  10. Selection Functions • Apply to specific machines: • host • option • arch • test • Apply to specific instances: • user • group • name All tests may be negative or positive e.g., host(icarus) or user(!root)

  11. Comparison Functions • Determine when thresholds crossed • cpu - percent of CPU • size - memory or file size or rate of change • time - total CPU time • Or test global values • loadavg, numusers, numprocs, uptime • Have optional first argument specifying comparison: gt, lt, eq, etc.

  12. Action Functions • Specify some action to perform • log • mail • page • kill, signal (by pid or name) • nice

  13. Sample Process Rules Rule { host(www.*) pct(gt,10) or size(gt,20M) mail(PROC_REPORT,www-monitor) mcons(info) log } Rule { {time(gt,6h) kill mail(OVERLIM, $user)} or {time(gt,4h) mail(WARN2, $user)} or {time(gt,2h) mail(WARN1, $user)} } Message OVERLIM <<EOF The CPU limit for $host is 6 hours. Your process $pid $cmd has been terminated for exceeding the limit. <<EOF

  14. Sample Filesystem Rules Rule { name(/u[0-9]) pct(gt,99,90+1) page(admin)} Rule { host(afs[0-9]+) name(/vicep.*) { host(afs07) name(/vicepg) } or { host(afs08) name(/vicepf) } or { pct(gt,98) mail(FSFULL, admin) } } Message FSFULL <<EOF File system $name is $pct% full, grew by $delta%. EOF

  15. Sample File Rules Rule { name(/var/adm*) size(gt,1M) page(admin) } Rule { name(/etc/passwd) md5() mail(PSWDCHG, admin) } Message PSWDCHG <<EOF File $name has been changed! EOF

  16. Sample Daemon Rules Rule { name(nfsd) number(ne,8) page(admin) } Rule { name(pud) number(lt,1) restart(pud) } Rule { name(amd) number(gt,1) page(admin) }

  17. Sample User Rules • Still somewhat experimental Rule { user(!root) number(gt,3) pct(gt,50) mail(CPUHOG, admin) } Message CPUHOG <<EOF User $user has $number processes using $pct% of the CPU on $host. <<EOF

  18. Why Ranger? • Some automatic monitoring is needed • Commercial packages are complex and expensive • Ranger does a lot in a small package • Because it’s cool

  19. Availability • Needs a bit more shakedown at SLAC before distribution • Look for via http://www.slac.stanford.edu/~boeheim • Will be starting a mailing list; send email to be included

More Related