slide1 n.
Skip this Video
Download Presentation
EGEE is a project funded by the European Union under contract IST-2003-508833

Loading in 2 Seconds...

play fullscreen
1 / 32

EGEE is a project funded by the European Union under contract IST-2003-508833 - PowerPoint PPT Presentation

  • Uploaded on

“ LCG Operation During the Data Challenges ” Markus Schulz, IT-GD, CERN “Discussion on Operation Models”. EGEE is a project funded by the European Union under contract IST-2003-508833. Outline. Building LCG-2 Data Challenges (very brief) Problems (not so brief)

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'EGEE is a project funded by the European Union under contract IST-2003-508833' - truman

Download Now An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

“LCG Operation During the Data Challenges” Markus Schulz, IT-GD,“Discussion on Operation Models”

EGEE is a project funded by the European Union under contract IST-2003-508833

  • Building LCG-2
  • Data Challenges (very brief)
  • Problems (not so brief)
  • Operating LCG
    • how it was planned
    • how it happened to be done
    • how it felt
  • What’s next?
  • I will skip many slides to leave room for discussions

Comment /Shout in REALTIME!!!!!

HEPIX 2004 BNL CERN IT-GD 22 October 2004 2

  • December 2003 LCG-2
    • Full set of functionality for DCs, first MSS integration
    • Deployed in January to 8 core sites (less sites less trouble)
    • DCs started in February -> testing in production
    • Large sites integrate resources into LCG (MSS and farms)
    • Introduced a pre-production service for the experiments
    • Alternative packaging (tool based and generic installation guides)
  • Mai 2004 -> now monthly incremental releases
    • Not all releases are distributed to external sites
    • Improved services, functionality, stability and packing step by step
    • Timely response to experiences from the data challenges

HEPIX 2004 BNL CERN IT-GD 22 October 2004 3

lcg 2 status 22 10 2004
LCG-2 Status 22 10 2004

new interested sites should look here: release



82 Sites

~9400 CPUs

~6.5 PByte

HEPIX 2004 BNL CERN IT-GD 22 October 2004 4

integrating sites
Integrating Sites
  • Sites contact GD Group or Regional Center
  • Sites go to therelease page
  • Sites decide on manual or tool based installation (LCFGng)
    • documentation for both available
    • WN and UI from next release on tar-ball based release
      • almost trivial install of WNs and UIs
  • Sites provide security and contact information
  • Sites install and use provided tests for debugging
    • support from regional centers or CERN
  • CERN GD certifies site and adds it to the monitoring and information system
    • sites are daily re-certified and problems traced in SAVANNAH
  • Large sites have integrated their local batch systems in LCG-2
  • Adding new sites is now quite smooth
    • problem is keeping large number of sites correctly configured

worked 80+ times

failed 3-5 times

HEPIX 2004 BNL CERN IT-GD 22 October 2004 5

data challenges
Data Challenges
  • Large scale production effort of the LHC experiments
    • test and validate the computing models
    • produce needed simulated data
    • test experiments production frame works and software
    • test the provided grid middleware
    • test the services provided by LCG-2
  • All experiments used LCG-2 for part of their production

HEPIX 2004 BNL CERN IT-GD 22 October 2004 6

data challenges1

Phase I

    • 120k Pb+Pb events produced in 56k jobs
    • 1.3 million files (26TByte) in Castor@CERN
    • Total CPU: 285 MSI-2k hours (2.8 GHz PC working 35 years)
    • ~25% produced on LCG-2
  • Phase II (underway)
    • 1 million jobs, 10 TB produced, 200TB transferred ,500 MSI2k hours CPU
    • ~15% on LCG-2
Data Challenges
  • Phase I
    • 7.7 Million events fully simulated (Geant 4) in 95.000 jobs
    • 22 TByte
    • Total CPU: 972 MSI-2k hours
    • >40% produced on LCG-2 (used LCG-2, GRID3, NorduGrid)

HEPIX 2004 BNL CERN IT-GD 22 October 2004 7

data challenges2

~30 M events produced

  • 25Hz reached
    • (only once for a full day)
Data Challenges

HEPIX 2004 BNL CERN IT-GD 22 October 2004 8

data challenges3

Phase I

    • 186 M events 61 TByte
    • Total CPU: 424 CPU years (43 LCG-2 and 20 DIRAC sites)
    • Up to 5600 concurrent running jobs in LCG-2
Data Challenges

3-5 106/day





LCG in


1.8 106/day



HEPIX 2004 BNL CERN IT-GD 22 October 2004 9

problems during the data challenges
Problems during the data challenges
  • All experiments encountered on LCG-2 similar problems
  • LCG sites suffering from configuration and operational problems
    • not adequate resources on some sites (hardware, human..)
    • this is now the main source of failures
  • Load balancing between different sites is problematic
    • jobs can be “attracted” to sites that have no adequate resources
    • modern batch systems are too complex and dynamic to summarize their behavior in a few values in the IS
  • Identification and location of problems in LCG-2 is difficult
    • distributed environment, access to many logfiles needed…..
    • status of monitoring tools
  • Handling thousands of jobs is time consuming and tedious
    • Support for bulk operation is not adequate
  • Performance and scalability of services
    • storage (access and number of files)
    • job submission
    • information system
    • file catalogues
  • Services suffered from hardware problems (no fail over services)

HEPIX 2004 BNL CERN IT-GD 22 October 2004 10

DC summary

outstanding middleware issues
Outstanding Middleware Issues
  • Collection: Outstanding Middleware Issues
    • Important: 1st systematic confrontation of required functionalities with capabilities of the existing middleware
      • Some can be patched, worked around,
      • Those related to fundamental problems with underlying models and architectures have to be input as essential requirements to future developments (EGEE)
  • Middleware is now not perfect but quite stable
    • Much has been improved during DC’s
      • A lot of effort still going into improvements and fixes
      • Big hole is missing space management on SE’s
        • especially for Tier 2 sites

HEPIX 2004 BNL CERN IT-GD 22 October 2004 11

operational issues selection
Operational issues (selection)
  • Slow response from sites
    • Upgrades, response to problems, etc.
    • Problems reported daily – some problems last for weeks
  • Lack of staff available to fix problems
    • Vacation period, other high priority tasks
  • Various mis-configurations (see next slide)
  • Lack of configuration management – problems that are fixed reappear
  • Lack of fabric management (mostly smaller sites)
    • scratch space, single nodes drain queues, incomplete upgrades, ….
  • Lack of understanding
    • Admins reformat disks of SE …
  • Provided documentation often not read (carefully)
    • new activity started to develop “hierarchical” adaptive documentation
    • simpler way to install middleware on farm nodes (even remotely in user space)
  • Firewall issues –
    • often less than optimal coordination between grid admins and firewall maintainers
  • PBS problems
    • Scalability, robustness (switching to torque helps)

HEPIX 2004 BNL CERN IT-GD 22 October 2004 12

site mis configurations
Site (mis) - configurations
  • Site mis-configuration was responsible for most of the problems that occurred during the experiments Data Challenges. Here is a non-complete list of problems:
    • – The variable VO <VO> SW DIR points to a non existent area on WNs.
    • – The ESM is not allowed to write in the area dedicated to the software installation
    • – Only one certificate allowed to be mapped to the ESM local account
    • – Wrong information published in the information system (Glue Object Classes not linked)
    • – Queue time limits published in minutes instead of seconds and not normalized
    • – /etc/ not properly configured. Shared libraries not found.
    • – Machines not synchronized in time
    • – Grid-mapfiles not properly built
    • – Pool accounts not created but the rest of the tools configured with pool accounts
    • – Firewall issues
    • – CA files not properly installed
    • – NFS problems for home directories or ESM areas
    • – Services configured to use the wrong/no Information Index (BDII)
    • – Wrong user profiles
    • – Default user shell environment too big
  • Only partly related to middleware complexity

integrated all common small problems into



HEPIX 2004 BNL CERN IT-GD 22 October 2004 13

running services
Running Services
  • Multiple instances of core services for each of the experiments
    • separates problems, avoids interference between experiments
    • improves availability
    • allows experiments to maintain individual configuration (information system)
    • addresses scalability to some degree
  • Monitoring tools for services currently not adequate
    • tools under development to implement control system
  • Access to storage via load balanced interfaces
    • CASTOR
    • dCache
  • Services that carry “state” are problematic to restart on new nodes
    • needed after hardware problems, or security problems
  • “State Transition” between partial usage and full usage of resources
    • required change in queue configuration (faire share, individual queues/VO)
    • next release will come with description for fair share configuration (smaller sites)

HEPIX 2004 BNL CERN IT-GD 22 October 2004 14

DC summary

support during the dcs
Support during the DCs
  • User (Experiment) Support:
    • GD at CERN worked very close with the experiments production managers
    • Informal exchange (e-mail, meetings, phone)
      • “No Secrets” approach, GD people on experiments mail lists and vice versa
        • ensured fast response
      • tracking of problems tedious, but both sites have been patient
      • clear learning curve on BOTH sites
      • LCG GGUS (grid user support) at FZK became operational after start of the DCs
        • due to the importance of the DCs the experiments switch slowly to the new service
      • Very good end user documentation by GD-EIS
      • Dedicated testbed for experiments with next LCG-2 release
        • rapid feedback, influenced what made it into the next release
  • Installation (Site) Support:
    • GD prepared releases and supported sites (certification, re-certification)
    • Regional centres supported their local sites (some more, some less)
    • Community style help via mailing list (high traffic!!)
    • FAQ lists for trouble shooting and configuration issues: TaipeiRAL

HEPIX 2004 BNL CERN IT-GD 22 October 2004 15

support during the dcs1
Support during the DCs
  • Operations Service:
    • RAL (UK) is leading sub-project on developing operations services
    • Initial prototype
      • Basic monitoring tools
      • Mail lists for problem resolution
      • Working on defining policies for operation, responsibilities (draft document)
      • Working on grid wide accounting
    • Monitoring:
      • GridICE (development of DataTag Nagios-based tools)
      • GridPP job submission monitoring
      • Information system monitoring and consitency check
    • CERN GD daily re-certification of sites (including history)
      • escalation procedure under development
      • tracing of site specific problems via problem tracking tool
      • tests core services and configuration

HEPIX 2004 BNL CERN IT-GD 22 October 2004 16

screen shots
Screen Shots

HEPIX 2004 BNL CERN IT-GD 22 October 2004 17

screen shots1
Screen Shots

HEPIX 2004 BNL CERN IT-GD 22 October 2004 18

problem handling plan










Problem HandlingPLAN


Triage: VO / GRID


GGUS (Remedy)



HEPIX 2004 BNL CERN IT-GD 22 October 2004 19

problem handling operation most cases




Problem HandlingOperation (most cases)



Rollout Mailing List
















HEPIX 2004 BNL CERN IT-GD 22 October 2004 20

problem tracking
Problem Tracking
  • Middleware problems: SAVANNAH LCG-OPERATION
  • Re-certification: SAVANNAH LCG-SITES
  • Many (MOST) problems only tracked by e-mail
  • Much confusion on where to put problems
  • Training needed to get reasonable 1st level user support
    • canned answers
    • experts need to focus on more complex tasks
  • Unification of FAQs (RAL, Taipei, Italy, …)

HEPIX 2004 BNL CERN IT-GD 22 October 2004 21

egee impact on operations
EGEE Impact on Operations
  • The available effort for operations from EGEE is now ramping up:
    • LCG GOC (RAL)  EGEE CICs and ROCs, + Taipei
      • Hierarchical support structure
    • Regional Operations Centres (ROC)
      • One per region (9)
      • Front-line support for deployment, installation, users
    • Core Infrastructure Centres (CIC)
      • Four (+ Russia next year)
      • Evolve from GOC – monitoring, troubleshooting, operational “control”
        • “24x7” in a 8x5 world ????
      • Also providing VO-specific and general services
    • EGEE NA3 organizes training for users and site admins
  • “NOW” at HEPiX
    • Address common issues, experiences
  • “Operations and Fabric Workshop”
    • CERN 1-3 Nov

HEPIX 2004 BNL CERN IT-GD 22 October 2004 22

part ii
  • Operation models
    • How much can be delegated to whom?
      • autonomy/ availability
    • What are the consequences?
      • cost for 24/7 with 8x5 staff
    • One/multiple models for all sites/regions?
    • One model for site integration, update, user support, security, operation?
      • latency, efficiency, distribution of workload …..
    • One size fits all?
    • Next slides are meant to stimulate discussions not give answers

HEPIX 2004 BNL CERN IT-GD 22 October 2004 23

cics and rocs and operations
CICs and ROCs and Operations
  • Core Infrastructure Centers (CICs)
    • run services like RBs, Information Indices, VO/VOMS, Catalogues
    • are the distributed Grid Operation Center (GOC)
    • and more….
  • Regional Operation Centers (ROCs)
    • coordinate activities in their region
    • give support to regional RCs
    • coordinate setup/upgrades
    • and more..
  • Resource Centers (RC)
    • computing and storage
  • Operation Management Center (OMC)
    • coordination

HEPIX 2004 BNL CERN IT-GD 22 October 2004 24

model i strict hierarchy
Model I Strict Hierarchy
  • CICs locates a problem with a RC or CIC in a region
    • triggered by monitoring/ user alert
  • CIC enters the problem into the problem tracking tool and assigns it to a ROC
  • ROC receives a notification and works on solving the problem
    • region decides locally what the ROC can to do on the RCs.
      • This can include restarting services etc.
      • The main emphasis is that the region decides on the depth of the interaction.
      • ===> different regions, different procedures
    • CICs NEVER contact a site
      • .====> ROCs need to be staffed all the time
    • ROC does it is fully responsible for ALL the sites in the region

HEPIX 2004 BNL CERN IT-GD 22 October 2004 25

model i strict hierarchy1
Model I Strict Hierarchy
  • Pro:
    • Best model to transfer knowledge to the ROCs
      • all information flows through them
    • Different regions can have their own policies
      • this can reflect different administrative relation of sites in a region.
    • Clear responsibility
      • until it is discovered it is the CICs fault then it is always the ROCs fault
  • Cons:
    • High latency
      • even for trivial operations we have to pass through the ROCs
    • ROCs have to be staffed (reachable) all the time.$$$$
    • Regions will develop their own tools
      • parallel strands, less quality
    • Excluded for handling security

HEPIX 2004 BNL CERN IT-GD 22 October 2004 26

model ii direct com local contr
Model IIDirect Com. Local Contr.
  • ROCs are active in:
    • the follow-up of problems that take longer to handle
    • setup of sites
  • CICs are active in:
    • handling problems that can be solved by simple interactions
      • communicated directly between CICs and RCs
        • ROCs are informed on all interactions between CICs and RCs
        • all problems are entered into the problem tracking tool.
      • restarting of services, etc. are handled by the RCs

HEPIX 2004 BNL CERN IT-GD 22 October 2004 27

model ii direct com local contr1
Model IIDirect Com. Local Contr.
  • Pros:
    • Resources are not lost for trivial reasons
    • Principe of local control is maintained
    • ROCs are in the loop,
      • but weak ROCs can't create too severe delays
    • No complex tools for communication management needed
      • mail + IRC sufficient
  • Cons:
    • RCs need to be reachable at all times
      • not realistic, and very expensive €€€€€€€€€€
    • CICs have to be aware of the level of maturity of O(100) RCs
    • ROCs have to monitor what is going on to learn the trade
    • Language problems between the CICs and sysadmins
    • Unclear responsibility
      • "This was reported" / "Why didn't the CICs fix it them self"

HEPIX 2004 BNL CERN IT-GD 22 October 2004 28

model iii direct com direct contr
Model IIIDirect Com. Direct Contr.
  • Like Model II with some modifications
    • CICs have access to the services on the RCs
      • can, if the RC is not staffed, manage some of the services
      • site publishes at any time
        • whether the local support is reachable or not
        • what actions are permitted by the CICs.
      • all interactions are logged and reported to RC and ROC
        • Some tools that allow very controlled (limited) access like this are under development (GSI enabled remote SUDO)
  • Variation with ROCs only interaction (IIIa)

HEPIX 2004 BNL CERN IT-GD 22 October 2004 29

model iii direct com direct contr1
Model IIIDirect Com. Direct Contr.
  • Pros:
    • Resources are not lost for trivial reasons
    • ROCs are in the loop,
      • but weak ROCs can't create too severe delays
    • One set of tools for remote operation
      • some uniformity ---> chance for better quality
    • Site decides at any time on balance between local/remote operation
    • RCs can be run for (short) time unattended
  • Cons:
    • Set of tools for secure limited remote operation respecting the sites policies has to be put in place
    • ROCs have to monitor what is going to learn the trade
    • Unclear responsibility
      • "This was reported" / "Why didn't the CICs fix it them self"

HEPIX 2004 BNL CERN IT-GD 22 October 2004 30

sample usecases
Sample UseCases
  • User reports jobs failing on one site
  • User reports jobs failing on some/all sites
  • Monitoring shows site dropping in and out of the IS
  • An acute security incident
  • Upgrading to a new version
  • Post mortem after the security incidents
  • …….
  • Good preparation for the Operations Workshop

HEPIX 2004 BNL CERN IT-GD 22 October 2004 31

  • LCG-2 services have been supporting the data challenges
    • Many middleware problems have been found – many addressed
    • Middleware itself is reasonably stable
  • Biggest outstanding issues are related to providing and maintaining stable operations
  • Future middleware has to take this into account:
    • Must be more manageable, trivial to configure and install
    • Management and monitoring must be built into services from the start on
  • Outcome of the workshop in November is crucial for EGEE operation

HEPIX 2004 BNL CERN IT-GD 22 October 2004 32