1 / 7

“Operating a Grid Service”

“Operating a Grid Service”. James Casey, IT-GD, CERN Karlsruhe, 20 th October 2005. FZK T1-T2 Workshop. Services in LCG/EGEE. Many existing services run already that have more than local scope RB, MYPX, SE, …

Download Presentation

“Operating a Grid Service”

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. “Operating a Grid Service” James Casey, IT-GD, CERN Karlsruhe, 20th October 2005 FZK T1-T2 Workshop

  2. Services in LCG/EGEE • Many existing services run already that have more than local scope • RB, MYPX, SE, … • SC3 has added new services on top of LCG-2 in response to LCG Baseline Service Working Group • SRM SEs, LFC, gLite FTS • gLite pre-production service has many more candidate services to move into full production • gLite WMS, glite-I/O, FireMan, … • More new services will come for SC4 • Do we know what services are needed for analysis ? • PROOF, xrootd, … • Observation on new services: • rapid change rate required to mature them • Both in terms of software and operational procedures CERN IT-GD

  3. Partners in a grid-wide Service • Many partners involved in managing a service • Deployment team for release management, packaging • And they handle liaison with dev teams for bug fixing • Site admins for first level support and fabric management • CIC-on-duty for monitoring at the grid level • GGUS for end user support • Experiment support teams to aid integration of experiment frameworks • VO Administrator are clients for service and application level monitoring • All partners need certain information in order to manage the service • Probably only the first two teams above get the right level of detail CERN IT-GD

  4. Service Operational Ticklist • First level support procedures • How to start/stop/restart service • How to check it’s up • Which logs are useful to send to CIC/Developers • and where they are • SFT Tests • Client validation • Server validation • Procedure to analyse these • error messages and likely causes • Tools for CIC to spot problems • GIIS monitor validation rules (e.g. only one “global” component) • Definition of normal behaviour • Metrics • CIC Dashboard • Alarms • Deployment Info • RPM list • Configuration details (for yaim) • Security audit • User support procedures (GGUS) • Troubleshooting guides + FAQs • User guides • Operations Team Training • Site admins • CIC personnel • GGUS personnel • Monitoring • Service status reporting • Performance data • Accounting • Usage data • Service Parameters • Scope - Global/Local/Regional • SLAs • Impact of service outage • Security implications • Contact Info • Developers • Support Contact • Escalation procedure to developers • Interoperation • What effect does an upgrade has? CERN IT-GD

  5. gLite FTS ticklist satisfaction • First level support procedures • How to start/stop/restart service • How to check it’s up • Which logs are useful to send to CIC/Developers • and where they are • SFT Tests • Client validation • Server validation • Procedure to analyse these • error messages and likely causes • Tools for CIC to spot problems • GIIS monitor validation rules (e.g. only one “global” component) • Definition of normal behaviour • Metrics • CIC Dashboard • Alarms • Deployment Info • RPM list • Configuration details (for yaim) • Security audit • User support procedures (GGUS) • Troubleshooting guides + FAQs • User guides • Operations Team Training • Site admins • CIC personnel • GGUS personnel • Monitoring • Service status reporting • Performance data • Accounting • Usage data • Service Parameters • Scope - Global/Local/Regional • SLAs • Impact of service outage • Security implications • Contact Info • Developers • Support Contact • Escalation procedure to developers • Interoperation • What effect does an upgrade has? CERN IT-GD

  6. Summary • We need to monitor status of current services … and define the procedure to move a new service into full operation mode • Issue: We mentioned who needed the information • But not who provides it… • Harry will cover some local site issues next • Including hardware and fabric which we haven’t even mentioned CERN IT-GD

  7. Discussion CERN IT-GD

More Related