110 likes | 191 Views
This project explores current systems inadequacies, global metrics views, GQM correlations, scalability issues, and project status, aiming to enhance IT performance. It addresses tool surveys and insights from workshops to overcome challenges and improve monitoring processes.
E N D
ThePerformance and ExceptionMonitoring Project Tim Smith IT/PDP
Contents • Requirements • current systems inadequacies • Views + global metrics • GQM + correlations • Framework • Scalabilty issues • Project Status • Tools survey • Details from Alessandro… Tim Smith: FNAL workshop
Current systems inadequacies • Independent alarm/monitoring systems • System snapshot requires multiple displays • Independent agents which: monitor local / monitor remote / restart /alarm • Calculate same info multiply and use differently • Host based – no correlations • Hosts complain about perceived problem not real one • Operator only follows precise instructions • Automation! (+ manual Remedy entry) • Separate static config DBs for alarms and machines Tim Smith: FNAL workshop
Visions of the Future • One tool, many purposes…Views: • End-to-end, user, sysadmin, resource planning • 1000’s of PCs per cluster • Living with failures + scalable solutions! • Assure a service; Quorum of machines NOTfull complement • High level correlations; impact on a service • Quality of Service measures; Global Metrics Tim Smith: FNAL workshop
Global Metrics • Honour Service Definitions • “Availability of usable 3000 CUs batch” • Machines up + FATMEN + LSF + lic. Serv. • “Availability of an interactive facility” • ASIS available + low trivial response time • “Job turnaround time expectations” • “Time to service tape request” + Disk/Network bandwidths + CPU/Memory utilisations Tim Smith: FNAL workshop
Goal / Question / Metric • PDP Services e.g. Monitor quality of Interactive Service • Sufficient nodes? • Low enough load? • Slow to respond to commands? • Contactable via network • Network daemons alive • No nologin • Free ptys Tim Smith: FNAL workshop
Correlations • Examples: • Web server on “SUN cluster” • Interactive Service Tim Smith: FNAL workshop
Framework Diagram Tim Smith: FNAL workshop
Scalability • Avoid bottlenecks by allowing for multiplicity of all components • Guiding principle: to avoid the PEM design being constrained by “possible” performance worries Tim Smith: FNAL workshop
Project Status • Approval as divisional project • Interest in EFF and GRID projects • Documents Produced: • User Requirements • Tools survey • Goal / Question / Metric • Analysis (end April) • Design (end May) • http://cern.ch/proj-pem > Progress > Analysis Tim Smith: FNAL workshop
Tools Survey • Enterprise / Cluster Management • Tivoli, UnicenterTNG, Patrol, PCP, SCADA, Alinka, SCMS, MosixMON • Public Domain Tools • MAT, GAP, Ranger (SLAC), VAMOS (DESY), rls (IN2P3) • Building blocks • SNMP (Scotty, Advent, MRTG, UCD), JDMK • PIKT, NetLogger, bonobo Tim Smith: FNAL workshop