1 / 17

ADC Weekly Meeting , May 8 2012 Annecy 2012 Technical Interchange Meeting Highlights

ADC Weekly Meeting , May 8 2012 Annecy 2012 Technical Interchange Meeting Highlights. Simone Campana – CERN IT/ES. Introduction. 2 days meeting (Wed PM to Fri AM) 4 sessions Data Management Production System Analysis Networking Plus one session with invited speakers

tress
Download Presentation

ADC Weekly Meeting , May 8 2012 Annecy 2012 Technical Interchange Meeting Highlights

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. ADC Weekly Meeting, May 8 2012 Annecy 2012 Technical Interchange Meeting Highlights Simone Campana – CERN IT/ES

  2. Introduction • 2 days meeting (Wed PM to Fri AM) • 4 sessions • Data Management • Production System • Analysis • Networking • Plus one session with invited speakers • Intel, EPFL (Miguel Branco) • Many thanks to session conveners for the material ADC Weekly, 8/5/2012

  3. TIM April 2012Data Management Session Highlights and Action Items for ADC weekly, CERN, 8 May 2012 S. Campana, V. Garonne, I. IUeda

  4. Data Management • Storage Federations • xrootd the only realistic solution for the medium term • Use case focuses on failover for data access • More advanced use cases can be explored in future (“repairing” data, file level caching) • CMS experience in pre-production (failed access recovery) • CMS spent lot of time (and will spend more) in CMSSW I/O tuning (reducing #reads and increasing #hits in read-ahead). Key for success in WAN access. • ATLAS experience in USATLAS R&D • Automated tools for WAN tests on top of HC • Integration of xroot federation with Panda is in progress • Many open questions • Security, monitoring, content publication • MB recommended to create topical working groups • ATLAS will try to expand the experience with xroot federations outside the US. ADC Weekly, 8/5/2012

  5. Data Management • Transfer Services • FTS will remain the baseline Transfer Service • FTS3 will cure known architectural issues • Channel concept, plugin support for protocols • FTS3 prototype in June, multi VO testing • Point2Point protocols • gridFTP as baseline, new version and session reuse will help reducing overheads • Xrootdis an alternative. Needs to be supported on all systems (see also discussion on federations) • HTTP is a serious option. Needs more integration and testing • SRM • Functionalities will be slowly replaced • Core set of functionalities will remain (access to MSS) • Positive experience with BestMAN+gridftp+Lustre at OU SWT2 • Interesting analysis from DDM Tracer data. Further studies suggested. ADC Weekly, 8/5/2012

  6. Data Management • Rucio • Architecture and Prototype API now available • Rucio Demo in June, prototype in October • Case sensitivity • Would like to move to Case Sensitive datasets and file names in DDM (UNIX like) • No strong online and offline objections, will try to agree at June SW week • Rucio scope • Proposal presented, but possible issues for the usage of “Campaigns” • Is being re-thought, DDM team will present a new proposal soon • Naming convention for files at sites in Rucio • Controversial discussion (less intuitive organization of files at sites for local access) • Being re-iterated within ADC and with Data Prep and PhysCoord (ICB?) ADC Weekly, 8/5/2012

  7. TIM April 2012ProdSys Session Highlights and Action Items for ADC weekly, CERN, 8 May 2012 K.De, A.Filipcic, A.Klimentov, R.Walker and A.Vaniachine

  8. Production System and Grid Data Processing • Progress since TIM in Dubna • APF status • PY factory to be replaced • Still manual config files • Pending integration with AGIS • Fair share policy implementation • HLT Task Request • Real time definition of tasks and jobs • Multi-cloud production widely used • Tier-2s usage • Short term plans • Jobs submission vs resources heterogeneity • AKTR, et al overload • processing 10+k tasks requests with 90+k output datasets • Previous overload happen about a year ago – at the time of TIM in Dubna • Not clear why these rare events (overload and TIM) are correlated in time • Monitoring and better integration with SSB Alexei Klimentov – TIM Highlights

  9. Dynamic Job Definition (JEDI) • JEDI core foundations • No predefined (and pre-assigned jobs) • Task Request: database templates • “Late” datasets registration • Reassessment of PandaDB and ProdDB • Understand benefits of redundancy • Separation of concerns • Task post processing • If you do not like the name “JEDI” the alternative is “PDJD” … • Panda Dynamic Job Definition Alexei Klimentov – TIM Highlights

  10. Dynamic Evolution for Tasks (DEfT) • Rate of task requests grows exponentially • Linear growth in users and support requests • Growing list of requirements and use cases • New use cases: HLT, FTK, user analysis tasks • First ideas about new architecture and how JEDI and DEfT will be developed • ProdSys technical meeting in Lubljana (June 2012) to discuss JEDI and DEfT development Alexei Klimentov – TIM Highlights

  11. ProdSys session II • Rucio/DDM and ProdSys/PanDA overlaps • What we want to keep and what we want to drop • Multi-core jobs • Ready for full Grid Production in simple scernario • glideinWMSstudies • Work in progress to find limits in various components Alexei Klimentov – TIM Highlights

  12. TIM April 2012Distributed Analysis Session Highlights and Action Items for ADC weekly, CERN, 8 May 2012 F. Barreiro, D. Benjamin, D. Van Der Ster

  13. ATLAS&CMS Common Analysis Framework • Initiative from CERN IT-ES, ATLAS and CMS • Assess potential of using common analysis solution based on PanDAandglideInWMS • Currently at the end of Feasibility Studyhttp://cern.ch/go/9mNC • Compare and analyze experiments’ workflows and architectures • Indentify dependencies, what can be reused and potential show-stoppers • Study and compare sub-components: Server sides, PanDA pilot and pilot factories, GlideInWMS • Evaluate integration scenarios for PanDA and GlideInWMSensuring no loss of functionality • Prepare final document with conclusions and proposal for Proof-of-Concept • To be validated by the experiments • In case of green light used as input for coming Functionality and Operations Studies

  14. Improving Job Efficiency • Server Side Retries • Only 20% of failures are “retriable” • Normally OK at 2nd attempt, 3rd attempt useless • Non retriable failures are mostly “athena” • Well… something else, but masked by athena. • Work will be done for accounting those properly • proot • Main goal is to catch failures and categorize them properly (beside setting correctly the root env) • This is difficult if you do not “own” the event loop • So, now an EventLoop package and its grid driver have been developed ADC Weekly, 8/5/2012

  15. Server Side Tasks • Current issues • Many actions today happen client side => slowness • Data discovery, job splitting, DS registration, retry • No task concept in Panda => complicated bookkeeping • User interest is in task rather than subjobs • Start moving client functionalities to server side • Simplify client tools, centralize functionalities, improve bookkeeping • Introduce Task concept in Panda (Task/Jobset table) • Modify clients to submit tasks/Jobsets (instead of subjobs) • Implement subjob definition server side • Evolve Panda server to handle subjobs and task/jobdef synchronization in DB • Change bookkeeping tools • Interact with task/jobdef table directly • Send retry commands to be executed by the server • Move toward server-side task management • Straightforward once job submission is mover server-side • Missing piece is task chaining

  16. Pilot Plans/Ideas • Moving to “experiments” plugins • Refactor/clean pilot code • Provides a better platform for many contributors • Job recovery simplified • Could be used outside US (UK interest) • Could be used for analysis (to be evaluated) • StageIN/OUT • StageOUT retry to the T1 (instead of local): under development • StageIN retry from another source: leverage xrootdfederation • ErrorDiagnostic class in development and DEBUG mode for pilots • Avoid “grepping” logfiles, modularize etc … • Peeking capability • Many others … help needed. • Common solution initiative should bring in more contributors

  17. Conclusions • A very productive workshop • Some subjects probably deserved a bit more time for discussion • ADC software is nowhere “frozen” • Needs to keep up with the demand • Strong focus on commonalities for long term sustainability • Several ideas/plans will be followed up in the next months in ADCDev and ADCOps • Plus dedicated workshops (e.g. Prodsys in Lubjiana) ADC Weekly, 8/5/2012

More Related