gridpp running a production grid n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
GridPP: Running a Production Grid PowerPoint Presentation
Download Presentation
GridPP: Running a Production Grid

Loading in 2 Seconds...

play fullscreen
1 / 24

GridPP: Running a Production Grid - PowerPoint PPT Presentation


  • 118 Views
  • Uploaded on

GridPP: Running a Production Grid. Stephen Burke CLRC/RAL On behalf of the GridPP Deployment & Operations Team UK e-Science All-hands, Nottingham, 21 st September 2006. Overview. EGEE, LCG and GridPP Middleware Deployment & Operations Conclusions. EGEE, LCG and GridPP. EGEE.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'GridPP: Running a Production Grid' - dante-stuart


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
gridpp running a production grid

GridPP: Running a Production Grid

Stephen Burke

CLRC/RAL

On behalf of the GridPP Deployment & Operations Team

UK e-Science All-hands, Nottingham, 21st September 2006

overview
Overview
  • EGEE, LCG and GridPP
  • Middleware
  • Deployment & Operations
  • Conclusions

Running a Production Grid - All-hands

slide4
EGEE
  • Major EU Grid project: 2004-08 (in two phases)
    • Successor to the European DataGrid (EDG) project, 2001-04
    • 32 countries, 91 partners, €37 million + matching funding
    • Associated with several Grid projects outside Europe
    • Expected to be succeeded by a permanent European e-infrastructure
  • Supports many areas of e-science, but currently High Energy Physics is the major user
    • Biomedical research is also a pioneer
    • Currently ~3000 users in 200 Virtual Organisations
  • Currently 195 sites, 28689 CPUs, 18.4 Pb of storage
    • Values taken from the information system – beware of GIGO!

Running a Production Grid - All-hands

egee lcg google map
EGEE/LCG Google map

Running a Production Grid - All-hands

w lcg
(W)LCG
  • The computing services for the LHC (Large Hadron Collider) at CERN in Geneva are provided by the LHC Computing Grid (LCG) project
    • LHC starts running in ~ 1 year
    • Four experiments, all very large
    • ~5000 users at 500 sites worldwide, 15 year lifetime
  • Expect ~15 Pb/year, plus similar volumes of simulated data
  • Processing requirement is ~100,000 CPUs
  • Must transfer ~100 Mbyte/sec/site – sustained for 15 years!
  • Running a series of Service Challenges to ramp up to full scale
  • LCG uses the EGEE infrastructure, but also the Open Science Grid (OSG) in the US and other Grid infrastructures
    • Hence WLCG = Worldwide LCG

Running a Production Grid - All-hands

organisation
Organisation
  • EGEE sites are organised by region
    • GridPP is part of UK/Ireland
      • Also NGS + Grid Ireland
    • Each region has a Regional Operation Centre (ROC) to look after the sites in the region
    • Overall operations co-ordination rotates weekly between ROCs
  • LCG divides sites into Tier 1/2/3
    • + CERN as Tier 0
    • Function of size and QOS
    • Tier 1 needs >97% availability, max 24 hour response
    • Tier 2 95%/72 hours
    • Tier 3 are local facilities, no specific targets
  • ROC ≈ Tier 1: RAL is both

Running a Production Grid - All-hands

gridpp
GridPP
  • Grid for UK Particle Physics
    • Two phases, 2001-2004-2007
    • Proposal for phase 3 to 2011
    • Part of EGEE and LCG
      • Working towards interoperability with NGS
  • 20 sites, 4354 CPUs, 298 Tb of storage
  • Currently supports 33 VOs, including some non-PP
    • But not many non-PP from the UK – any volunteers?
  • For LCG, sites are grouped into four “virtual” Tier 2s
    • Plus RAL as Tier 1
    • Grouping is largely administrative, the Grid sites remain separate
  • Runs UK-Ireland ROC (with NGS)
  • Grid Operations Centre (GOC) @ RAL (with NGS)
    • Gridwide configuration, monitoring and accounting repository/portal
  • Operations and User Support shifts (working hours only)

Running a Production Grid - All-hands

gridpp sites
GridPP sites

Running a Production Grid - All-hands

site services
Site services
  • Basis is Globus (still GT2, GT4 soon) and Condor, as packaged in the Virtual Data Toolkit (VDT)– also used by NGS
  • EGEE/LCG/EDG middleware distribution now under the gLite brand name
  • Computing Element (CE): Globus gatekeeper + batch system + batch workers
    • In transition from Globus to Condor-C
  • Storage Element (SE): Storage Resource Manager (SRM) + GridFTP + other data transports + storage system (disk-only or disk+tape)
    • Three SRM implementations in GridPP
  • Berkeley Database Information Index (BDII): LDAP server publishing CE + SE + site + service information according to the GLUE schema
  • Relational Grid Monitoring Architecture (R-GMA) server: publishing GLUE schema, monitoring, accounting, user information
  • VOBOX: Container for VO-specific services (aka “edge services”)

Running a Production Grid - All-hands

core services
Core services
  • Workload Management System (WMS), aka Resource Broker: accepts jobs, dispatches them to sites and manages their lifecycle
  • Logging & Bookkeeping: primarily logs lifecycle events for jobs
  • MyProxy: stores long-lived credentials
  • LCG File Catalogue (LFC): maps logical file names to local names on SEs
  • File Transfer Service (FTS): provides managed, reliable file transfers
  • BDII: aggregates information from site BDIIs
  • R-GMA schema/registry: stores table definitions and lists of producers/consumers
  • VO Membership Service (VOMS) server: stores VO group/role assignments
  • User Interface (UI): provides user client tools for the Grid services

Running a Production Grid - All-hands

grid services
Grid services
  • Some extra services are needed to allow the Grid to be operated effectively
    • Mostly unique instances, not part of the gLite distribution
  • Grid Operations Centre DataBase (GOCDB): stores information about each site, including contact details, status and a node list
    • Queried by other tools to generate configuration, monitoring etc
  • Accounting (APEL): publishes information about CPU and storage use
  • Various monitoring tools, including:
    • gstat (Grid status) - collects data from the information system, does sanity checks
    • Site Availability Monitoring (SAM) - runs regular test jobs at every site, raises alerts and measures availability over time
    • GridView – collects and displays information about file transfers
    • Real Time Monitor – displays job movements, and records statistics
  • Freedom of Choice for Resources (FCR): allows the view of resources in a BDII to be filtered according to VO-specific criteria, e.g. SAM test failures
  • Operations portal: aggregates monitoring and operational information, broadcast email tool, news, VO information, …

Running a Production Grid - All-hands

sam monitoring
SAM monitoring

Running a Production Grid - All-hands

gridview
GridView

Running a Production Grid - All-hands

middleware issues
Middleware issues
  • We need to operate a large production system with 24*7*365 availability
  • Middleware development is usually done on small, controlled test systems, but the production system is much larger in many dimensions, more heterogeneous and not under any central control
  • Much of the middleware is still immature, with a significant number of bugs, and developing rapidly
    • Documentation is sometimes lacking or out of date
  • There are therefore a number of issues which must be managed by deployment and operational procedures, for example:
    • The rapid rate of change and sometimes lack of backward compatibility requires careful management of code deployment
    • Porting to new hardware, operating systems etc can be time consuming
    • Components are often developed in isolation, so integration of new components can take time
    • Configuration can be very complex, and only a small subset of possible configurations produce a working system
    • Fault tolerance, error reporting and logging are in need of improvement
    • Remote management and diagnostic tools are generally undeveloped

Running a Production Grid - All-hands

configuration
Configuration
  • We have tried many installation & configuration tools over the years
  • Configuration is complex, but system managers don’t like complex tools!
  • Most configuration flexibility needs to be “frozen”
    • Admins don’t understand all the options anyway
    • Many configuration changes will break something
    • The more an admin has to type, the more chances for a mistake
  • Current method preferred by most sites is YAIM (Yet Another Installation Method):
    • bash scripts
    • simple configuration of key parameters only
    • doesn’t always have enough flexibility, but good enough for most cases

Running a Production Grid - All-hands

release management
Release management
  • There is a constant tension between the desire to upgrade to get new features, and the desire to have a stable system
    • Need to be realistic about how long it takes to get new things into production
  • We have so far had a few “big bang” releases per year, but these have some disadvantages
    • Anything which misses a release has to wait for a long time, hence there is pressure to include untested code
    • Releases can be held up by problems in any area, hence are usually late
    • They involve a lot of work for system managers, so it may be several months before all sites upgrade
  • We are now moving to incremental releases, updating each component as it completes integration and testing
    • Have to avoid dependencies between component upgrades
    • Releases go first to a 10%-scale pre-production Grid
    • Updates every couple of weeks
    • The system becomes more heterogenous
    • Still some big bangs – e.g. new OS
    • Seems OK so far - time will tell!

Running a Production Grid - All-hands

vo support
VO support
  • If sites are going to support a large number of VOs the configuration has to be done in a standard way
    • Largely true, but not perfect: adding a VO needs changes in several areas
    • Configuration parameters for VOs should be available on the operations portal, although many VOs still need to add their data
  • It needs to be possible to install VO-specific software, and maybe services, in a standard way
    • Software is ~OK: NFS-shared area, writeable by specific VO members, with publication in the information system
    • Services still under discussion: concerns about security and support
  • VOs often expect to have dedicated contacts at sites (and vice versa)
    • May be necessary in some cases but does not scale
    • Operations portal stores contacts, but site -> VO may not reach the right people – need contacts by area
    • Not too bad, but still needs some work to find a good modus vivendi

Running a Production Grid - All-hands

availability
Availability
  • LCG requires high availability, but the intrinsic failure rate is high
    • Most of the middleware does not deal gracefully with failures
    • Some failure modes can lead to “black holes”
    • Must fix/mask failures via operational tools so users don’t see them
  • Several monitoring tools have been developed, including test jobs run regularly at sites
  • On-duty operators look for problems, and submit tickets to sites
    • Currently ~ 50 tickets per week (c.f. 200 sites)
  • FCR tool allows sites failing specified tests to be made “invisible”
    • New sites must be certified before they become visible
    • Persistently failing sites can be decertified
    • Sites can be removed temporarily for scheduled downtime
  • Performance is monitored over time
    • The situation has improved a lot, but we still have some way to go

Running a Production Grid - All-hands

lessons learnt
Lessons learnt
  • “Good enough” is not good enough
    • Grids are good at magnifying problems, so must try to fix everything
  • Exceptions are the norm
    • 15,000 nodes * MTBF of 5 years = 8 failures a day
      • Also 15,000 ways to be misconfigured!
    • Something somewhere will always be broken
      • But middleware developers tend to assume that everything will work
    • It needs a lot of manpower to keep a big system going
  • Bad error reporting can cost a lot of time
    • And reduce people’s confidence
  • Very few people understand how the whole system works
    • Or even a large subset of it
    • Easy to do things which look reasonable but have a bad side-effect
  • Communication between sites and users is an n*m problem
    • Need to collapse to n+m

Running a Production Grid - All-hands

summary
Summary
  • LHC turns on in 1 year – we must focus on delivering a high QOS
  • Grid middleware is still immature, developing rapidly and in many cases a fair way from production quality
  • Experience is that new middleware developments take ~ 2 years to reach the production system, so LHC will start with what we have now
  • The underlying failure rate is high – this will always be true with so many components, so middleware and operational procedures must allow for it
  • We need procedures which can manage the underlying problems, and present users with a system which appears to work smoothly at all times
    • Considerable progress has been made, but there is more to do
  • GridPP is running a major part of the EGEE/LCG Grid, which is now a very large system operated as a high-quality service, 24*7*365
  • We are living in interesting times!

Running a Production Grid - All-hands