1 / 39

ALICE Activities on the EGEE infrastructure

ALICE Activities on the EGEE infrastructure. Stefano Bagnasco (INFN Torino) EGEE User Forum, Manchester (UK) May 10, 2007. The ALICE Computing Model. Three kinds of grid activity Montecarlo Production on the Grid, at Tier-1s and Tier-2s. Scheduled batch analysis on the Grid

benoit
Download Presentation

ALICE Activities on the EGEE infrastructure

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. ALICE Activities on the EGEE infrastructure Stefano Bagnasco (INFN Torino) EGEE User Forum, Manchester (UK) May 10, 2007

  2. The ALICE Computing Model • Three kinds of grid activity • Montecarlo Productionon the Grid, at Tier-1s and Tier-2s. • Scheduled batch analysison the Grid • End-user interactive analysisusing PROOF and GRID • CERN (T0) • Does: first pass reconstruction; calibration and alignment • Stores: one copy of RAW, calibration data and first-pass ESDs • T1s • Does: reconstructions and scheduled batch analysis • Stores: second collective copy of RAW, one copy of all data to be kept, disk replicas of ESDs and AODs • T2s • Does: simulation and end-user analysis • Stores: disk replicas of AODs and ESDs • Resources are shared • No “localization” of groups • Fairshare Group/Site Contribution and Consumption will be regulated by accounting system • Prioritisation of jobs in the central ALICE queue • Data access only through the GRID • No backdoor access to data • No “private” processing on shared resources • No “private” resources outside of the grid ALICE Activities on the EGEE Infrastructure - Stefano Bagnasco

  3. Strategy • Minimize intrusiveness • Limit the impact on the host computer centres • Use standard services whenever possible • Centralize information • Minimise the need to “synchronise” information sources • Single “Task Queue” handling policies and priorities • Site configurations managed centrally • Virtualize resources • Job agents provide a standard environment across different systems • Xrootd as a uniform file access protocol • AliEn shell as a common user interface • Automatize operations • Provide extensive monitoring ALICE Activities on the EGEE Infrastructure - Stefano Bagnasco

  4. Submits job User ALICE central services Site Registers output Yes No Asks work-load Close SE’s & Software Matchmaking Updates TQ Receives work-load Sends job result Retrieves workload packman Submits job agent Sends job agent to site Job submission cycle VO-Box LCG ALICE Job Catalogue ALICE File Catalogue User Job ALICE catalogues Optimizer Env OK? Execs agent Die with grace CE WN Computing Agent RB ALICE Activities on the EGEE Infrastructure - Stefano Bagnasco

  5. VOBOX::SA xrootd (manager) Storage strategy Disk DPM SRM Being deployed Available xrootd (worker) SRM xrootd (worker) Castor SRM Prototype being validated WN xrootd (worker) MSS dCache SRM xrootd emulation (worker) Being deployed MSS ALICE Activities on the EGEE Infrastructure - Stefano Bagnasco

  6. Extensive monitoring • Standard SAM tests to check LCG services availability are incorporated in the VO-box • Available to Grid Support and ALICE (via ML) • Status of the VOBOX, ALICE and WLCG services are monitored through ML • Sites are encouraged to check the status through these pages • Alarm system established • Need to automatize as much as possible ALICE Activities on the EGEE Infrastructure - Stefano Bagnasco

  7. Production management • Production is managed by a small team • Most are also developers • Patricia Mendez Lorenzo, Latchezar Betev, Costin Grigoras, Catalin Cirstoiu, Pablo Saiz, Andreas Joachim Peters, Predrag Buncic, Stefano Bagnasco • …plus a handful of regional experts for France, Germany, Russia, nordic countries, etc. • End users never see the underlying Grid • All accesses are through AliEn • First line support is given by ALICE experts • No direct contact between users and GGUS • The model is currently working, but will need to test it with a large number of users • Need to automatize as much as possible ALICE Activities on the EGEE Infrastructure - Stefano Bagnasco

  8. Efficiency • Central AliEn services availability now more than 95% • Remote sites efficiencies vastly improved since the beginning • JobAgent compensates site problems anyway • Still not using storage at most sites • Failover mechanism for RB automatically handles most RB failures • Largest sources of problems: • RB problems • Storage (just CASTOR, currently) • IS problems • VO-Box failures, unscheduled downtimes,… ALICE Activities on the EGEE Infrastructure - Stefano Bagnasco

  9. Grid Data Challenge • The longest running Data Challenge in ALICE • A comprehensive test of the ALICE Computing model • Running already for 9 months non-stop: approaching data taking regime of operation • Participating: 55 computing centres on 4 continents: 6 Tier 1s, 49 T2s • 7MSI2k • hours  1500 CPUs running continuously • 685K Grid jobs total • 530K production • 53K DAQ • 102K user • 40M events • 0.5PB generated, reconstructed and stored • User analysis ongoing ALICE Activities on the EGEE Infrastructure - Stefano Bagnasco

  10. FTS tests • FTS tests T0->T1 September - December • Design goal 300MB/s reached but not maintained • 0.7PB DAQ data registered ALICE Activities on the EGEE Infrastructure - Stefano Bagnasco

  11. Growth in one year ALICE Activities on the EGEE Infrastructure - Stefano Bagnasco

  12. Last year’s contributions 43% Tier-1s 57% Tier-2s ALICE Activities on the EGEE Infrastructure - Stefano Bagnasco

  13. Last month running profile ALICE Activities on the EGEE Infrastructure - Stefano Bagnasco

  14. Conclusions • Job management is currently OK • Chaotic analysis jobs may be a challenge • Storage is still an issue • But we are close to have a full working solution • VO-Box maintenance has become much easier • Better monitoring, some common problems automatically handled, better understanding of the whole system ALICE Activities on the EGEE Infrastructure - Stefano Bagnasco

  15. Atlas event production on the EGEE infrastructure Xavier Espinal (PIC) EGEE Users Forum, Manchester (UK) 10 May 2007

  16. Outline • Introduction • Grid and simulated production • Events produced and CPU consumption • Simulated production • Review • Job and WCT efficiencies • Common errors • Operations team • Structure, shift system and operations • Next steps • Conclusions Atlas Activities on the EGEE Infrastructure -Xavier Espinal

  17. Grid and simulated production • ATLAS is one of the four LHC experiments: • The activity of the experiment requirements for next quarter is of 400TB of storage and 3.5 Mski2k of CPU. • Simulated events are produced all over the EGEE infrastructure: • Distributed system! Relying on GRID to profit from the distributed resources. • Simulated production is crucial: • To provides simulated data for physics studies. • To validate the computing model, the data model and the ATLAS software suite. • To use it for study and comparison with the future data. • More than 100 Million events produced since Jan-2006 so far. • And is relying on GRID tools… • LCG-UTILS (stage-in and stage-out files) • FTS (Data management) • glite-WMS • LCG-CE (Job handling) • SRM (Storage management) • LFC (file archival) Atlas Activities on the EGEE Infrastructure -Xavier Espinal

  18. Simulated production review • Review: events produced (since 1-Jan-06) Finished jobs per day 30k Disk space crisis Ramp up challenge Operations team started: Nov-2006 Atlas Activities on the EGEE Infrastructure -Xavier Espinal

  19. Job and CPU efficiencies • Efficiencies: job and Wallclock time (WCT) • Loss of WCT is mainly caused by data management problems staging in and out: • Stage-in: sleeps and retries on getting the file (SE problems), hanged commands can take 30 minutes before aborted. • Stage-out: job unsuccessfully stores the output, loosing all the consumed CPU time. • Errors coming from wrong configuration (WN, executors,…) abort the job almost instantaneously and barely consumes CPU. • WCT efficiency is the relevant parameter, waste of CPU is expensive and annoying. • WCT efficiency for EGEE has increased almost continuously: • Average : 76% (2006) 86% (2007, so far) • Job efficiency is also increasing: • Average: 40% (2006) 55% (2007, so far) Atlas Activities on the EGEE Infrastructure -Xavier Espinal

  20. Job and CPU efficiencies • Finished and failed WCT - Period: December’06-now • Loss WCT is under control even for a job efficiency of about 50%. Finished Period efficiencies: WCT = 86% Job = 54% Failed Atlas Activities on the EGEE Infrastructure -Xavier Espinal

  21. Most common errors • Typical WCT error pie chart (Des’06-now) • Stage-out: • lcg-cr failed (SE problems) • Frequently transient errors. • Job finished but all CPU lost • Wrong WN/site configuration • ATLAS specific software error • Stage-in: • lcg-cp failed. • No file replicas found. • md5sum errors. 43% 39% 10% 15% 26% Atlas Activities on the EGEE Infrastructure -Xavier Espinal

  22. Operations team • Operations team take care of the ATLAS production in EGEE. • Operation is based on an organized shift system. • Shifters work together during a whole week: A production coordinator, 2 “senior” shifters and 2 “trainees” are on duty: • Production coordinators: control the task assignation, cloud production and monitor the overall production activity. • Shifters are separated in two working groups: • 1) Workload management: perform job babysitting. • 2) DDM management: control the correct data flow (job inputs and outputs). • Meetings: • Operations meeting is performed weekly by phone conference. • In person every three month (ATLAS software weeks). • Active dedicated mailing list among the members and ATLAS computing community. • eLog web system is used to track the incidences. • Extremely useful and fruitful experience ! Atlas Activities on the EGEE Infrastructure -Xavier Espinal

  23. Next steps • Continuous ramp-up of production is expected: • Not human scalable anymore, so… • …next step is to begin to automate error spotting and reporting. • Keep working in monitoring: • For Workload Management has improved a lot and is extremely useful. • For Data Management is still a bit “dark”. • Last February began a fruitful collaborating with CERN-IT division, to investigate most frequent and relevant errors. • Monthly reported and revised within ATLAS LCG-EGEE Task Force meetings framework. • Keep this collaboration and periodically track “to the end” specific problems. • Automatic task assignment. • Overflooded by the requests. • After task is defined by the physics coordinators this has to be assigned to a certain cloud. • Task assignment has to be done in an intelligent way to minimize data movement. Atlas Activities on the EGEE Infrastructure -Xavier Espinal

  24. Conclusions • Production system showed up that could cope with the requirements during the production ramp up challenge (Nov06-Jan07) • Almost all disk space for ATLAS was quickly filled. • Job and WCT efficiencies has been improving almost continuously: • More experienced team. • Infrastructure and system better known and more debugged. • Improvements in LFC and lcg-utils. • Operations group and shift system has demonstrated to be extremely necessary. • Monitoring pages has been improved a lot and eases the work of the shifters. • Need to improve in: • Data Management, as the staging problems are the dominant ones. • Looking for the new implementation of the SRM (Storage Resource Manager) interface which would solve stability problems, mainly in the stage-in and stage-out of the files needed and produced by the jobs respectively. • glite-WMS is expected to handle the jobs in a more reliable way. • Introduce different implementations in the CE’s that wouldn’t have scalability limitations. • Error spotting and reporting automation. • Automate task assignment. Atlas Activities on the EGEE Infrastructure -Xavier Espinal

  25. The EGEE production team • Many people involved, but not completely dedicated to Production. • Shifters are contributing about 1week/month, yielding a manpower of 1.5 FTEs per week. • EGEE production team: • Production coordinators: Simone Campana and Rod Walker • Monitoring: John Kennedy • Database tools: Suijian Zhou • Shift coordination: Xavier Espinal • Senior Shifters: Silvia Resconi, Mei Wen, Alessandra Doria, John Kennedy, Luis March, Xavier Espinal, Suijian Zhou, Carl Gwilliam, Guido Negri • Trainee shifters: Elisabetta Vilucchi, Agnese Martini, Marcel Schroers, Kondo Gnanvo, Jaroslav Guenther, Miroslav Jahoda, Jordi Nadal, Lukas Masek • French cloud shifters team: Sandrine Laplace, Frederic Derue, Jerome Schwindling, Karim Bernardet,Terront Trujillo Atlas Activities on the EGEE Infrastructure -Xavier Espinal

  26. Egee User Forum, Manchester 9-11 May LHCb DC06 experience: results and lessons learned Roberto Santinelli-CERN www.eu-egee.org

  27. Outlook • LHCb Data Challenge 06 (DC06) • DC06 results • DC06 lessons learned LHCb Activities on the EGEE Infrastructure -Roberto Santinelli

  28. LHCb DC06 • Original LHCb DC06 goals • to produce simulated data for the "physics book" • to make a realistic test of the computing model in order to mimic what LHCb will have to do with real data • Started on May/June 2006 it is still running • By DC06 we mean today the whole set of physics hard production and reprocessing activity on WLCG infrastructure LHCb Activities on the EGEE Infrastructure -Roberto Santinelli

  29. Farm Online DST+RAW+TAG ~12 MB/s each T1 NIKHEF/SARA FTS RAW 60MB/s VO-BOX GRIDKA RAL FTS FTS RAW-mc RAW-mc VO-BOX VO-BOX Raw Redistribution with share (6,7X6)MB/s CNAF LYON FTS FTS VO-BOX VO-BOX PIC FTS VO-BOX The challenge IS NOT the total data-throughput BUT reprocessing data produced and stored (accordingly the computing model) at a different time and location. A new and unknown brand of problems had/has to be faced (data access from WN, replication and staging, data corruption, data availability and integrity at sites etc.etc.) Simulation throughput: 5K/jobs, 450MB each job 20-30MB/s T0 SE Reconstruction + Preselection criteria Applied over Raw data at various T1 DST+RAW+TAG redistributed over all T1s ….but also from T1’s and CERN Today: MC generation From T2’s ……. FTS From 2008 on ward: real data taking LFC VO-BOX (Dirac Services) LHCb Activities on the EGEE Infrastructure -Roberto Santinelli

  30. Results (from the LHCb perspective) Bug on the LHCb Application + T1 Disk Space Clean up: activity temporary frozen March-April ‘07 DC06 starts :June ‘06 ~9700 concurrent jobs running on LCG! Reconstruction Vs Simulation prioritization mechanism in place in Dirac LHCb Activities on the EGEE Infrastructure -Roberto Santinelli

  31. Results (from the LHCb perspective) Xmas break Bug on GAUSS generator Up to 2.5Kjobs concurrent at large T0/T1…… CERN Manchester GridKA QMUL ..but up to 1.5Kjobs at some UK T2 Congratulations! CNAF LHCb Activities on the EGEE Infrastructure -Roberto Santinelli

  32. Results (from the LHCb perspective) During DC06, from June 2006 to March 2007 it has roughly been Produced the 80% of the whole LHCb Production! LHCb Activities on the EGEE Infrastructure -Roberto Santinelli

  33. Results (from egee perspective) UK-IT-SW: main VO January-February-March: main VO lhcb LHCb Activities on the EGEE Infrastructure -Roberto Santinelli

  34. Results (from egee perspective) LHCb used WLCG in the most efficient and extensive way during the last year. LHCb used WLCG Continuously over the last years w/o major interruptions! LHCb Activities on the EGEE Infrastructure -Roberto Santinelli

  35. One of the first GGUS tickets from LHCb more than 2 years ago! Lessons: GGUS and Site responsiveness GGUS useful for traceability, for statistics and for acquiring the know-how about grid-problems. In order to speed up the resolution and/or to put right pressure, LHCb actively report to the COD/ROC infrastructure and keep a direct tie with sysadmins (very efficient approach). GGUS tickets during DC06 LHCb Activities on the EGEE Infrastructure -Roberto Santinelli

  36. Lessons:DataManagementSystem • Data access problems via remote protocol access from WN • For dCache sites srm_get wasn’t staging a file • just retuned a TURL • Authorization issues associated with use of VOMS • 1 associate with Gridmap file • Configuration of gPlazma • Stability of dCache servers under heavy load • Recently discovered dCache problem tied to a bug of dCache • file registered but not physically on disk • Transfer problems (using lcg_utils efficiency <50%) • Many instabilities with gsidcap doors • Though current situation is a lot better • Transfer failures to CASTOR • “Resource Busy” message from CASTOR due to corrupted entry (from previous transfers timed out or failed) that Castor (for consistency) refuses to overwrite. • ROOT (AA) seems to be completely disconnected from SRM • Need to manipulate tURL format before passing to ROOT dcap:gsidcap://… , castor://, # of slashes, rfio format ALICE Activities on the EGEE Infrastructure - Stefano Bagnasco

  37. Lessons:DataManagementSystem • SE instabilities, last quarter of 06 • limited time where all 7 T1 were all running fine reconstruction • Backlog of replication (failover mechanism) 80KDST to me moved to their final destination • Site upgrading Storages but sometimes to unstable version • SRM v2.2 deployment/testing • Relatively late deployment of SRM2.2 w.r.t. real data taking • Interfaces to SRMV2.2 • Bulk operations not clearly supported • file removal, metadata queries, … • Also currently no support for file pinning in GFAL or lcg-utils • Currently in SRM v1.1 no generic way to stage files nor a efficient and coherent way to remove files LHCb Activities on the EGEE Infrastructure -Roberto Santinelli

  38. Lessons: Deployment & Release Procedure • Use of LCG AA • Early client exposure • Allows LHCb to test early in production environment • Quick feedback for client developers • E.g. lcg copy to allow surl to surl copies requested in July’06 only now being rolled out with gLite • Issues associated with compatibility • Recent problems associated with globus libraries & use of lcg-utils • Early exposure to VO in parallel to later certification • Useful to allow VO to test in production environment • E.g. LHCb still not using gLite RB in production • Version of LHCb gLite RB provided at CERN know to have problems • In past spent time testing a deployed version that was known to be problematic • Central VOMS servers • Missing an automatic and efficient mechanism for propagation to sites of changes in the central VOMS server • ( LHCb groups mapping schema has been honored after 6 months the request!) LHCb Activities on the EGEE Infrastructure -Roberto Santinelli

  39. Lessons:Information System • Consistency of info's published by different top-BDIIs (FCR or not FCR, order of various blocks, different pools of sites published) • It prevents for a real load-balanced service spread among different sites. • Latency of the info’s propagation causing flood of CE (free slots published do not reflect real situation on the LRMS) • Effect amplified by using multiple RBs : PIC use case • VOView introduction helps quite a lot • Instability/scalability of the system • lcg-utils failing just because the SE (working) was not published • CE appearing/disappearing from the BDII (while they were OK) • Content of the information • Disk space consumption and space left published by SRM, granularity of information, OS and platform advertised in a coherent way (SL,SL4,SLC4, Scientific Linux 4…) • Splitting (pseudo) static information from dynamic information would be beneficial • amount of information shipped over the network (and hence latency could be reduced) • for improved stability of the system and it allows for major reliability of the access of static information required by DM clients. LHCb Activities on the EGEE Infrastructure -Roberto Santinelli

More Related