This presentation is the property of its rightful owner.
Sponsored Links
1 / 32

EGEE is a project funded by the European Union under contract IST-2003-508833 PowerPoint PPT Presentation

  • Uploaded on
  • Presentation posted in: General

“ LCG Operation During the Data Challenges ” Markus Schulz, IT-GD, CERN [email protected] “Discussion on Operation Models”. EGEE is a project funded by the European Union under contract IST-2003-508833. Outline. Building LCG-2 Data Challenges (very brief) Problems (not so brief)

Download Presentation

EGEE is a project funded by the European Union under contract IST-2003-508833

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

Egee is a project funded by the european union under contract ist 2003 508833

“LCG Operation During the Data Challenges” Markus Schulz, IT-GD, [email protected] on Operation Models”

EGEE is a project funded by the European Union under contract IST-2003-508833



  • Building LCG-2

  • Data Challenges (very brief)

  • Problems (not so brief)

  • Operating LCG

    • how it was planned

    • how it happened to be done

    • how it felt

  • What’s next?

  • I will skip many slides to leave room for discussions

Comment /Shout in REALTIME!!!!!

HEPIX 2004 BNL CERN IT-GD 22 October 2004 2



  • December 2003 LCG-2

    • Full set of functionality for DCs, first MSS integration

    • Deployed in January to 8 core sites (less sites less trouble)

    • DCs started in February -> testing in production

    • Large sites integrate resources into LCG (MSS and farms)

    • Introduced a pre-production service for the experiments

    • Alternative packaging (tool based and generic installation guides)

  • Mai 2004 -> now monthly incremental releases

    • Not all releases are distributed to external sites

    • Improved services, functionality, stability and packing step by step

    • Timely response to experiences from the data challenges

HEPIX 2004 BNL CERN IT-GD 22 October 2004 3

Lcg 2 status 22 10 2004

LCG-2 Status 22 10 2004

new interested sites should look here: release



82 Sites

~9400 CPUs

~6.5 PByte

HEPIX 2004 BNL CERN IT-GD 22 October 2004 4

Integrating sites

Integrating Sites

  • Sites contact GD Group or Regional Center

  • Sites go to therelease page

  • Sites decide on manual or tool based installation (LCFGng)

    • documentation for both available

    • WN and UI from next release on tar-ball based release

      • almost trivial install of WNs and UIs

  • Sites provide security and contact information

  • Sites install and use provided tests for debugging

    • support from regional centers or CERN

  • CERN GD certifies site and adds it to the monitoring and information system

    • sites are daily re-certified and problems traced in SAVANNAH

  • Large sites have integrated their local batch systems in LCG-2

  • Adding new sites is now quite smooth

    • problem is keeping large number of sites correctly configured

worked 80+ times

failed 3-5 times

HEPIX 2004 BNL CERN IT-GD 22 October 2004 5

Data challenges

Data Challenges

  • Large scale production effort of the LHC experiments

    • test and validate the computing models

    • produce needed simulated data

    • test experiments production frame works and software

    • test the provided grid middleware

    • test the services provided by LCG-2

  • All experiments used LCG-2 for part of their production

HEPIX 2004 BNL CERN IT-GD 22 October 2004 6

Data challenges1

  • Phase I

    • 120k Pb+Pb events produced in 56k jobs

    • 1.3 million files (26TByte) in [email protected]

    • Total CPU: 285 MSI-2k hours (2.8 GHz PC working 35 years)

    • ~25% produced on LCG-2

  • Phase II (underway)

    • 1 million jobs, 10 TB produced, 200TB transferred ,500 MSI2k hours CPU

    • ~15% on LCG-2

Data Challenges

  • Phase I

    • 7.7 Million events fully simulated (Geant 4) in 95.000 jobs

    • 22 TByte

    • Total CPU: 972 MSI-2k hours

    • >40% produced on LCG-2 (used LCG-2, GRID3, NorduGrid)

HEPIX 2004 BNL CERN IT-GD 22 October 2004 7

Data challenges2

  • ~30 M events produced

  • 25Hz reached

    • (only once for a full day)

Data Challenges

HEPIX 2004 BNL CERN IT-GD 22 October 2004 8

Data challenges3

  • Phase I

    • 186 M events 61 TByte

    • Total CPU: 424 CPU years (43 LCG-2 and 20 DIRAC sites)

    • Up to 5600 concurrent running jobs in LCG-2

Data Challenges

3-5 106/day





LCG in


1.8 106/day



HEPIX 2004 BNL CERN IT-GD 22 October 2004 9

Problems during the data challenges

Problems during the data challenges

  • All experiments encountered on LCG-2 similar problems

  • LCG sites suffering from configuration and operational problems

    • not adequate resources on some sites (hardware, human..)

    • this is now the main source of failures

  • Load balancing between different sites is problematic

    • jobs can be “attracted” to sites that have no adequate resources

    • modern batch systems are too complex and dynamic to summarize their behavior in a few values in the IS

  • Identification and location of problems in LCG-2 is difficult

    • distributed environment, access to many logfiles needed…..

    • status of monitoring tools

  • Handling thousands of jobs is time consuming and tedious

    • Support for bulk operation is not adequate

  • Performance and scalability of services

    • storage (access and number of files)

    • job submission

    • information system

    • file catalogues

  • Services suffered from hardware problems (no fail over services)

HEPIX 2004 BNL CERN IT-GD 22 October 2004 10

DC summary

Outstanding middleware issues

Outstanding Middleware Issues

  • Collection: Outstanding Middleware Issues

    • Important: 1st systematic confrontation of required functionalities with capabilities of the existing middleware

      • Some can be patched, worked around,

      • Those related to fundamental problems with underlying models and architectures have to be input as essential requirements to future developments (EGEE)

  • Middleware is now not perfect but quite stable

    • Much has been improved during DC’s

      • A lot of effort still going into improvements and fixes

      • Big hole is missing space management on SE’s

        • especially for Tier 2 sites

HEPIX 2004 BNL CERN IT-GD 22 October 2004 11

Operational issues selection

Operational issues (selection)

  • Slow response from sites

    • Upgrades, response to problems, etc.

    • Problems reported daily – some problems last for weeks

  • Lack of staff available to fix problems

    • Vacation period, other high priority tasks

  • Various mis-configurations (see next slide)

  • Lack of configuration management – problems that are fixed reappear

  • Lack of fabric management (mostly smaller sites)

    • scratch space, single nodes drain queues, incomplete upgrades, ….

  • Lack of understanding

    • Admins reformat disks of SE …

  • Provided documentation often not read (carefully)

    • new activity started to develop “hierarchical” adaptive documentation

    • simpler way to install middleware on farm nodes (even remotely in user space)

  • Firewall issues –

    • often less than optimal coordination between grid admins and firewall maintainers

  • PBS problems

    • Scalability, robustness (switching to torque helps)

HEPIX 2004 BNL CERN IT-GD 22 October 2004 12

Site mis configurations

Site (mis) - configurations

  • Site mis-configuration was responsible for most of the problems that occurred during the experiments Data Challenges. Here is a non-complete list of problems:

    • – The variable VO <VO> SW DIR points to a non existent area on WNs.

    • – The ESM is not allowed to write in the area dedicated to the software installation

    • – Only one certificate allowed to be mapped to the ESM local account

    • – Wrong information published in the information system (Glue Object Classes not linked)

    • – Queue time limits published in minutes instead of seconds and not normalized

    • – /etc/ld.so.conf not properly configured. Shared libraries not found.

    • – Machines not synchronized in time

    • – Grid-mapfiles not properly built

    • – Pool accounts not created but the rest of the tools configured with pool accounts

    • – Firewall issues

    • – CA files not properly installed

    • – NFS problems for home directories or ESM areas

    • – Services configured to use the wrong/no Information Index (BDII)

    • – Wrong user profiles

    • – Default user shell environment too big

  • Only partly related to middleware complexity

integrated all common small problems into



HEPIX 2004 BNL CERN IT-GD 22 October 2004 13

Running services

Running Services

  • Multiple instances of core services for each of the experiments

    • separates problems, avoids interference between experiments

    • improves availability

    • allows experiments to maintain individual configuration (information system)

    • addresses scalability to some degree

  • Monitoring tools for services currently not adequate

    • tools under development to implement control system

  • Access to storage via load balanced interfaces

    • CASTOR

    • dCache

  • Services that carry “state” are problematic to restart on new nodes

    • needed after hardware problems, or security problems

  • “State Transition” between partial usage and full usage of resources

    • required change in queue configuration (faire share, individual queues/VO)

    • next release will come with description for fair share configuration (smaller sites)

HEPIX 2004 BNL CERN IT-GD 22 October 2004 14

DC summary

Support during the dcs

Support during the DCs

  • User (Experiment) Support:

    • GD at CERN worked very close with the experiments production managers

    • Informal exchange (e-mail, meetings, phone)

      • “No Secrets” approach, GD people on experiments mail lists and vice versa

        • ensured fast response

      • tracking of problems tedious, but both sites have been patient

      • clear learning curve on BOTH sites

      • LCG GGUS (grid user support) at FZK became operational after start of the DCs

        • due to the importance of the DCs the experiments switch slowly to the new service

      • Very good end user documentation by GD-EIS

      • Dedicated testbed for experiments with next LCG-2 release

        • rapid feedback, influenced what made it into the next release

  • Installation (Site) Support:

    • GD prepared releases and supported sites (certification, re-certification)

    • Regional centres supported their local sites (some more, some less)

    • Community style help via mailing list (high traffic!!)

    • FAQ lists for trouble shooting and configuration issues: TaipeiRAL

HEPIX 2004 BNL CERN IT-GD 22 October 2004 15

Support during the dcs1

Support during the DCs

  • Operations Service:

    • RAL (UK) is leading sub-project on developing operations services

    • Initial prototype http://www.grid-support.ac.uk/GOC/

      • Basic monitoring tools

      • Mail lists for problem resolution

      • Working on defining policies for operation, responsibilities (draft document)

      • Working on grid wide accounting

    • Monitoring:

      • GridICE (development of DataTag Nagios-based tools)

      • GridPP job submission monitoring

      • Information system monitoring and consitency check http://goc.grid.sinica.edu.tw/gstat/

    • CERN GD daily re-certification of sites (including history)

      • escalation procedure under development

      • tracing of site specific problems via problem tracking tool

      • tests core services and configuration

HEPIX 2004 BNL CERN IT-GD 22 October 2004 16

Screen shots

Screen Shots

HEPIX 2004 BNL CERN IT-GD 22 October 2004 17

Screen shots1

Screen Shots

HEPIX 2004 BNL CERN IT-GD 22 October 2004 18

Problem handling plan










Problem HandlingPLAN


Triage: VO / GRID


GGUS (Remedy)



HEPIX 2004 BNL CERN IT-GD 22 October 2004 19

Problem handling operation most cases




Problem HandlingOperation (most cases)



Rollout Mailing List
















HEPIX 2004 BNL CERN IT-GD 22 October 2004 20

Problem tracking

Problem Tracking


  • Middleware problems: SAVANNAH LCG-OPERATION

  • Re-certification: SAVANNAH LCG-SITES

  • Many (MOST) problems only tracked by e-mail

  • Much confusion on where to put problems

  • Training needed to get reasonable 1st level user support

    • canned answers

    • experts need to focus on more complex tasks

  • Unification of FAQs (RAL, Taipei, Italy, …)

HEPIX 2004 BNL CERN IT-GD 22 October 2004 21

Egee impact on operations

EGEE Impact on Operations

  • The available effort for operations from EGEE is now ramping up:

    • LCG GOC (RAL)  EGEE CICs and ROCs, + Taipei

      • Hierarchical support structure

    • Regional Operations Centres (ROC)

      • One per region (9)

      • Front-line support for deployment, installation, users

    • Core Infrastructure Centres (CIC)

      • Four (+ Russia next year)

      • Evolve from GOC – monitoring, troubleshooting, operational “control”

        • “24x7” in a 8x5 world ????

      • Also providing VO-specific and general services

    • EGEE NA3 organizes training for users and site admins

  • “NOW” at HEPiX

    • Address common issues, experiences

  • “Operations and Fabric Workshop”

    • CERN 1-3 Nov

HEPIX 2004 BNL CERN IT-GD 22 October 2004 22

Part ii


  • Operation models

    • How much can be delegated to whom?

      • autonomy/ availability

    • What are the consequences?

      • cost for 24/7 with 8x5 staff

    • One/multiple models for all sites/regions?

    • One model for site integration, update, user support, security, operation?

      • latency, efficiency, distribution of workload …..

    • One size fits all?

    • Next slides are meant to stimulate discussions not give answers

HEPIX 2004 BNL CERN IT-GD 22 October 2004 23

Cics and rocs and operations

CICs and ROCs and Operations

  • Core Infrastructure Centers (CICs)

    • run services like RBs, Information Indices, VO/VOMS, Catalogues

    • are the distributed Grid Operation Center (GOC)

    • and more….

  • Regional Operation Centers (ROCs)

    • coordinate activities in their region

    • give support to regional RCs

    • coordinate setup/upgrades

    • and more..

  • Resource Centers (RC)

    • computing and storage

  • Operation Management Center (OMC)

    • coordination

HEPIX 2004 BNL CERN IT-GD 22 October 2004 24

Model i strict hierarchy

Model I Strict Hierarchy

  • CICs locates a problem with a RC or CIC in a region

    • triggered by monitoring/ user alert

  • CIC enters the problem into the problem tracking tool and assigns it to a ROC

  • ROC receives a notification and works on solving the problem

    • region decides locally what the ROC can to do on the RCs.

      • This can include restarting services etc.

      • The main emphasis is that the region decides on the depth of the interaction.

      • ===> different regions, different procedures

    • CICs NEVER contact a site

      • .====> ROCs need to be staffed all the time

    • ROC does it is fully responsible for ALL the sites in the region

HEPIX 2004 BNL CERN IT-GD 22 October 2004 25

Model i strict hierarchy1

Model I Strict Hierarchy

  • Pro:

    • Best model to transfer knowledge to the ROCs

      • all information flows through them

    • Different regions can have their own policies

      • this can reflect different administrative relation of sites in a region.

    • Clear responsibility

      • until it is discovered it is the CICs fault then it is always the ROCs fault

  • Cons:

    • High latency

      • even for trivial operations we have to pass through the ROCs

    • ROCs have to be staffed (reachable) all the time.$$$$

    • Regions will develop their own tools

      • parallel strands, less quality

    • Excluded for handling security

HEPIX 2004 BNL CERN IT-GD 22 October 2004 26

Model ii direct com local contr

Model IIDirect Com. Local Contr.

  • ROCs are active in:

    • the follow-up of problems that take longer to handle

    • setup of sites

  • CICs are active in:

    • handling problems that can be solved by simple interactions

      • communicated directly between CICs and RCs

        • ROCs are informed on all interactions between CICs and RCs

        • all problems are entered into the problem tracking tool.

      • restarting of services, etc. are handled by the RCs

HEPIX 2004 BNL CERN IT-GD 22 October 2004 27

Model ii direct com local contr1

Model IIDirect Com. Local Contr.

  • Pros:

    • Resources are not lost for trivial reasons

    • Principe of local control is maintained

    • ROCs are in the loop,

      • but weak ROCs can't create too severe delays

    • No complex tools for communication management needed

      • mail + IRC sufficient

  • Cons:

    • RCs need to be reachable at all times

      • not realistic, and very expensive €€€€€€€€€€

    • CICs have to be aware of the level of maturity of O(100) RCs

    • ROCs have to monitor what is going on to learn the trade

    • Language problems between the CICs and sysadmins

    • Unclear responsibility

      • "This was reported" / "Why didn't the CICs fix it them self"

HEPIX 2004 BNL CERN IT-GD 22 October 2004 28

Model iii direct com direct contr

Model IIIDirect Com. Direct Contr.

  • Like Model II with some modifications

    • CICs have access to the services on the RCs

      • can, if the RC is not staffed, manage some of the services

      • site publishes at any time

        • whether the local support is reachable or not

        • what actions are permitted by the CICs.

      • all interactions are logged and reported to RC and ROC

        • Some tools that allow very controlled (limited) access like this are under development (GSI enabled remote SUDO)

  • Variation with ROCs only interaction (IIIa)

HEPIX 2004 BNL CERN IT-GD 22 October 2004 29

Model iii direct com direct contr1

Model IIIDirect Com. Direct Contr.

  • Pros:

    • Resources are not lost for trivial reasons

    • ROCs are in the loop,

      • but weak ROCs can't create too severe delays

    • One set of tools for remote operation

      • some uniformity ---> chance for better quality

    • Site decides at any time on balance between local/remote operation

    • RCs can be run for (short) time unattended

  • Cons:

    • Set of tools for secure limited remote operation respecting the sites policies has to be put in place

    • ROCs have to monitor what is going to learn the trade

    • Unclear responsibility

      • "This was reported" / "Why didn't the CICs fix it them self"

HEPIX 2004 BNL CERN IT-GD 22 October 2004 30

Sample usecases

Sample UseCases

  • User reports jobs failing on one site

  • User reports jobs failing on some/all sites

  • Monitoring shows site dropping in and out of the IS

  • An acute security incident

  • Upgrading to a new version

  • Post mortem after the security incidents

  • …….

  • Good preparation for the Operations Workshop

HEPIX 2004 BNL CERN IT-GD 22 October 2004 31



  • LCG-2 services have been supporting the data challenges

    • Many middleware problems have been found – many addressed

    • Middleware itself is reasonably stable

  • Biggest outstanding issues are related to providing and maintaining stable operations

  • Future middleware has to take this into account:

    • Must be more manageable, trivial to configure and install

    • Management and monitoring must be built into services from the start on

  • Outcome of the workshop in November is crucial for EGEE operation

HEPIX 2004 BNL CERN IT-GD 22 October 2004 32

  • Login