1 / 24

GridPP: Running a Production Grid

GridPP: Running a Production Grid. Stephen Burke CLRC/RAL On behalf of the GridPP Deployment & Operations Team UK e-Science All-hands, Nottingham, 21 st September 2006. Overview. EGEE, LCG and GridPP Middleware Deployment & Operations Conclusions. EGEE, LCG and GridPP. EGEE.

Download Presentation

GridPP: Running a Production Grid

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. GridPP: Running a Production Grid Stephen Burke CLRC/RAL On behalf of the GridPP Deployment & Operations Team UK e-Science All-hands, Nottingham, 21st September 2006

  2. Overview • EGEE, LCG and GridPP • Middleware • Deployment & Operations • Conclusions Running a Production Grid - All-hands

  3. EGEE, LCG and GridPP

  4. EGEE • Major EU Grid project: 2004-08 (in two phases) • Successor to the European DataGrid (EDG) project, 2001-04 • 32 countries, 91 partners, €37 million + matching funding • Associated with several Grid projects outside Europe • Expected to be succeeded by a permanent European e-infrastructure • Supports many areas of e-science, but currently High Energy Physics is the major user • Biomedical research is also a pioneer • Currently ~3000 users in 200 Virtual Organisations • Currently 195 sites, 28689 CPUs, 18.4 Pb of storage • Values taken from the information system – beware of GIGO! Running a Production Grid - All-hands

  5. EGEE/LCG Google map Running a Production Grid - All-hands

  6. (W)LCG • The computing services for the LHC (Large Hadron Collider) at CERN in Geneva are provided by the LHC Computing Grid (LCG) project • LHC starts running in ~ 1 year • Four experiments, all very large • ~5000 users at 500 sites worldwide, 15 year lifetime • Expect ~15 Pb/year, plus similar volumes of simulated data • Processing requirement is ~100,000 CPUs • Must transfer ~100 Mbyte/sec/site – sustained for 15 years! • Running a series of Service Challenges to ramp up to full scale • LCG uses the EGEE infrastructure, but also the Open Science Grid (OSG) in the US and other Grid infrastructures • Hence WLCG = Worldwide LCG Running a Production Grid - All-hands

  7. Organisation • EGEE sites are organised by region • GridPP is part of UK/Ireland • Also NGS + Grid Ireland • Each region has a Regional Operation Centre (ROC) to look after the sites in the region • Overall operations co-ordination rotates weekly between ROCs • LCG divides sites into Tier 1/2/3 • + CERN as Tier 0 • Function of size and QOS • Tier 1 needs >97% availability, max 24 hour response • Tier 2 95%/72 hours • Tier 3 are local facilities, no specific targets • ROC ≈ Tier 1: RAL is both Running a Production Grid - All-hands

  8. GridPP • Grid for UK Particle Physics • Two phases, 2001-2004-2007 • Proposal for phase 3 to 2011 • Part of EGEE and LCG • Working towards interoperability with NGS • 20 sites, 4354 CPUs, 298 Tb of storage • Currently supports 33 VOs, including some non-PP • But not many non-PP from the UK – any volunteers? • For LCG, sites are grouped into four “virtual” Tier 2s • Plus RAL as Tier 1 • Grouping is largely administrative, the Grid sites remain separate • Runs UK-Ireland ROC (with NGS) • Grid Operations Centre (GOC) @ RAL (with NGS) • Gridwide configuration, monitoring and accounting repository/portal • Operations and User Support shifts (working hours only) Running a Production Grid - All-hands

  9. GridPP sites Running a Production Grid - All-hands

  10. Middleware

  11. Site services • Basis is Globus (still GT2, GT4 soon) and Condor, as packaged in the Virtual Data Toolkit (VDT)– also used by NGS • EGEE/LCG/EDG middleware distribution now under the gLite brand name • Computing Element (CE): Globus gatekeeper + batch system + batch workers • In transition from Globus to Condor-C • Storage Element (SE): Storage Resource Manager (SRM) + GridFTP + other data transports + storage system (disk-only or disk+tape) • Three SRM implementations in GridPP • Berkeley Database Information Index (BDII): LDAP server publishing CE + SE + site + service information according to the GLUE schema • Relational Grid Monitoring Architecture (R-GMA) server: publishing GLUE schema, monitoring, accounting, user information • VOBOX: Container for VO-specific services (aka “edge services”) Running a Production Grid - All-hands

  12. Core services • Workload Management System (WMS), aka Resource Broker: accepts jobs, dispatches them to sites and manages their lifecycle • Logging & Bookkeeping: primarily logs lifecycle events for jobs • MyProxy: stores long-lived credentials • LCG File Catalogue (LFC): maps logical file names to local names on SEs • File Transfer Service (FTS): provides managed, reliable file transfers • BDII: aggregates information from site BDIIs • R-GMA schema/registry: stores table definitions and lists of producers/consumers • VO Membership Service (VOMS) server: stores VO group/role assignments • User Interface (UI): provides user client tools for the Grid services Running a Production Grid - All-hands

  13. Grid services • Some extra services are needed to allow the Grid to be operated effectively • Mostly unique instances, not part of the gLite distribution • Grid Operations Centre DataBase (GOCDB): stores information about each site, including contact details, status and a node list • Queried by other tools to generate configuration, monitoring etc • Accounting (APEL): publishes information about CPU and storage use • Various monitoring tools, including: • gstat (Grid status) - collects data from the information system, does sanity checks • Site Availability Monitoring (SAM) - runs regular test jobs at every site, raises alerts and measures availability over time • GridView – collects and displays information about file transfers • Real Time Monitor – displays job movements, and records statistics • Freedom of Choice for Resources (FCR): allows the view of resources in a BDII to be filtered according to VO-specific criteria, e.g. SAM test failures • Operations portal: aggregates monitoring and operational information, broadcast email tool, news, VO information, … Running a Production Grid - All-hands

  14. SAM monitoring Running a Production Grid - All-hands

  15. GridView Running a Production Grid - All-hands

  16. Middleware issues • We need to operate a large production system with 24*7*365 availability • Middleware development is usually done on small, controlled test systems, but the production system is much larger in many dimensions, more heterogeneous and not under any central control • Much of the middleware is still immature, with a significant number of bugs, and developing rapidly • Documentation is sometimes lacking or out of date • There are therefore a number of issues which must be managed by deployment and operational procedures, for example: • The rapid rate of change and sometimes lack of backward compatibility requires careful management of code deployment • Porting to new hardware, operating systems etc can be time consuming • Components are often developed in isolation, so integration of new components can take time • Configuration can be very complex, and only a small subset of possible configurations produce a working system • Fault tolerance, error reporting and logging are in need of improvement • Remote management and diagnostic tools are generally undeveloped Running a Production Grid - All-hands

  17. Deployment & Operations

  18. Configuration • We have tried many installation & configuration tools over the years • Configuration is complex, but system managers don’t like complex tools! • Most configuration flexibility needs to be “frozen” • Admins don’t understand all the options anyway • Many configuration changes will break something • The more an admin has to type, the more chances for a mistake • Current method preferred by most sites is YAIM (Yet Another Installation Method): • bash scripts • simple configuration of key parameters only • doesn’t always have enough flexibility, but good enough for most cases Running a Production Grid - All-hands

  19. Release management • There is a constant tension between the desire to upgrade to get new features, and the desire to have a stable system • Need to be realistic about how long it takes to get new things into production • We have so far had a few “big bang” releases per year, but these have some disadvantages • Anything which misses a release has to wait for a long time, hence there is pressure to include untested code • Releases can be held up by problems in any area, hence are usually late • They involve a lot of work for system managers, so it may be several months before all sites upgrade • We are now moving to incremental releases, updating each component as it completes integration and testing • Have to avoid dependencies between component upgrades • Releases go first to a 10%-scale pre-production Grid • Updates every couple of weeks • The system becomes more heterogenous • Still some big bangs – e.g. new OS • Seems OK so far - time will tell! Running a Production Grid - All-hands

  20. VO support • If sites are going to support a large number of VOs the configuration has to be done in a standard way • Largely true, but not perfect: adding a VO needs changes in several areas • Configuration parameters for VOs should be available on the operations portal, although many VOs still need to add their data • It needs to be possible to install VO-specific software, and maybe services, in a standard way • Software is ~OK: NFS-shared area, writeable by specific VO members, with publication in the information system • Services still under discussion: concerns about security and support • VOs often expect to have dedicated contacts at sites (and vice versa) • May be necessary in some cases but does not scale • Operations portal stores contacts, but site -> VO may not reach the right people – need contacts by area • Not too bad, but still needs some work to find a good modus vivendi Running a Production Grid - All-hands

  21. Availability • LCG requires high availability, but the intrinsic failure rate is high • Most of the middleware does not deal gracefully with failures • Some failure modes can lead to “black holes” • Must fix/mask failures via operational tools so users don’t see them • Several monitoring tools have been developed, including test jobs run regularly at sites • On-duty operators look for problems, and submit tickets to sites • Currently ~ 50 tickets per week (c.f. 200 sites) • FCR tool allows sites failing specified tests to be made “invisible” • New sites must be certified before they become visible • Persistently failing sites can be decertified • Sites can be removed temporarily for scheduled downtime • Performance is monitored over time • The situation has improved a lot, but we still have some way to go Running a Production Grid - All-hands

  22. Conclusions

  23. Lessons learnt • “Good enough” is not good enough • Grids are good at magnifying problems, so must try to fix everything • Exceptions are the norm • 15,000 nodes * MTBF of 5 years = 8 failures a day • Also 15,000 ways to be misconfigured! • Something somewhere will always be broken • But middleware developers tend to assume that everything will work • It needs a lot of manpower to keep a big system going • Bad error reporting can cost a lot of time • And reduce people’s confidence • Very few people understand how the whole system works • Or even a large subset of it • Easy to do things which look reasonable but have a bad side-effect • Communication between sites and users is an n*m problem • Need to collapse to n+m Running a Production Grid - All-hands

  24. Summary • LHC turns on in 1 year – we must focus on delivering a high QOS • Grid middleware is still immature, developing rapidly and in many cases a fair way from production quality • Experience is that new middleware developments take ~ 2 years to reach the production system, so LHC will start with what we have now • The underlying failure rate is high – this will always be true with so many components, so middleware and operational procedures must allow for it • We need procedures which can manage the underlying problems, and present users with a system which appears to work smoothly at all times • Considerable progress has been made, but there is more to do • GridPP is running a major part of the EGEE/LCG Grid, which is now a very large system operated as a high-quality service, 24*7*365 • We are living in interesting times! Running a Production Grid - All-hands

More Related