1 / 23

Grid services for CMS at CC-IN2P3

Grid services for CMS at CC-IN2P3. David Bouvet Pierre-Emmanuel Brinette, Pierre Girard, Rolf Rumler CMS visit – 30/11/2007. Content. Deployment status Tier-1 consolidation + tier-2 creation Grid site infrastructure Tier-2 site Grid services for CMS Major concerns Global Operational

ndeweese
Download Presentation

Grid services for CMS at CC-IN2P3

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Grid services for CMS at CC-IN2P3 David BouvetPierre-Emmanuel Brinette, Pierre Girard, Rolf Rumler CMS visit – 30/11/2007

  2. Content • Deployment status • Tier-1 consolidation + tier-2 creation • Grid site infrastructure • Tier-2 site • Grid services for CMS • Major concerns • Global • Operational • Technical David Bouvet - CMS visit

  3. Deployment status • Official version : 3.0.2 Update 36 (SL3) and 3.1 Update 06 (SL4_32) • Site tier-1: • 1 load-balanced Top BDII (2 SL4 machines) for local use only until we validate the official SL4 release. • 2 FTS : v2.0 and v1.5 (to be decommissioned) • 1 central LFC (Biomed) + 2 local LFC • 1 VOMS server (Biomed, Auvergird, Embrace, EGEODE, and  local/regional VOs) • 1 regional MonBox • 1 site BDII • 2 SRM SEs + 1 test SRMV2 SE • 3 LCG CEs: 3.0.5, 3.0.13, 3.0.14 instead of 3.0.19, configured for SL4/32 bits farm • Partially updated • UI/WN SL3: 3.0.22-2 instead of 3.0.27 • UI/WN gLite 3.1: SL4/32 • RLS/RMC and classic SEs phased out (June and September) • Site tier-2: • LCG CE 3.0.11 for SL3, VOs ATLAS, CMS • LCG CE 3.0.14 for SL4, all VOs • publish the 2 SRM SEs of the tier-1 • Site PPS: • test bed for local adaptation (BQS jobmanager, information provider) • 2 LCG-CE with the latest production release : 3.0.19-0 • Used for local test VO (vo.rocfr.in2p3.fr) • FTS v2.0 and LFC for LHC VOs David Bouvet - CMS visit

  4. Tier-1 consolidation + tier-2 creation • Site changes this year • all CEs on the T1 (IN2P3-CC) configured for the SL4/32 bits farm • a new site Tier 2 (IN2P3-CC-T2) dedicated for analysis facilities • just one CE still configured on the T2 for SL3 farm (would disappear) • 2 CEs (1 per VO LHC) • 1 FTS (channel distribution) • 1 LFC (1 for replication of LHCb central LFC) • 1 SE dCache (Managed by Storage group) • 2 “classic” SEs • LDAP server (Auvergrid) • RLS/RMC (Biomed) • Commitment to provide a load balanced Top BDII (SL4) for France • Machine upgrades • to V20Z or better, as SL4 versions of middleware components become available • Spare machines • 1 CE « spare » (in case of hardware problems) • pre-installed VMs TopBDII/sBDII/VOMS/LFC/FTS (for updates or in case of hardware problems) • used during major power outages necessary for power supply upgrades David Bouvet - CMS visit

  5. WN WN WN WN WN WN WN WN Grid site infrastructure VO Box VO LHC Grid Information System VO Box VO LHC V OBox VO LHC MonBox 7 Sites Top BDII VOMS 7 VOs VO Box VO LHC FTS 4 VOs LHC Local LFC 4 VOs LHC Central LFC Biomed Site BDII SRM SRM Computing Element Computing Element Storage Element Storage Element Computing Element Computing Element BQS Global service HPSS DCACHE Regional service Anastasie Local Service Storage Computing David Bouvet - CMS visit

  6. WN WN WN WN WN WN WN WN Grid site infrastructure VO Box VO LHC Grid Information System VO Box VO LHC V OBox VO LHC MonBox 7 Sites Top BDII VOMS 7 VOs VO Box VO LHC FTS 4 VOs LHC Local LFC 4 VOs LHC Central LFC Biomed Local LFC 4 VOs LHC Site BDII SRM SRM Computing Element Computing Element Storage Element Storage Element Computing Element Computing Element BQS Global service HPSS DCACHE Regional service Anastasie Local Service Storage Computing David Bouvet - CMS visit

  7. WN WN WN WN WN WN WN WN Grid site infrastructure VO Box VO LHC Grid Information System VO Box VO LHC V OBox VO LHC MonBox 7 Sites Top BDII VOMS 7 VOs VO Box VO LHC FTS 4 VOs LHC Local LFC 4 VOs LHC Central LFC Biomed FTS 4 VOs LHC Local LFC 4 VOs LHC Site BDII SRM SRM Computing Element Computing Element Storage Element Storage Element Computing Element Computing Element BQS Global service HPSS DCACHE Regional service Anastasie Local Service Storage Computing David Bouvet - CMS visit

  8. Tier-2 site • T1 and T2 is deployed over the same computing center • sharing the same computing farm and using the same LRMS • being able to manage separately the production of each grid site • T1 site policy • T1 job slots = (CMS’ job slots x #CPUT1) / (#CPUT1 + #CPUT2) • VOMS Role « lcgadmin » • VOMS Role « production » • regular users • T2 site policy • T2 job slots = (CMS’ job slots x #CPUT2) / (#CPUT1 + #CPUT2) • VOMS Role « lcgadmin » • VOMS Role « production » • regular users • mapping strategy revisited on our CEs • by prohibiting account overlapping between local sites • by splitting the grid accounts into 2 subsets David Bouvet - CMS visit

  9. T1 only: T1 Site BDII CE01 CE02 CE03 CMS Mapping policy • Site policy • AFS rw access • BQS priorities • Jobs slot max Role production Role lcgadmin All others cms050 cmsgrid cms[001-049] Tier-2 site David Bouvet - CMS visit

  10. T1 + T2 T2 Site BDII T1 Site BDII CE01 CE02 CE03 CE04 CE05 Mapping policy Mapping policy production lcgadmin All others cms050 cmsgrid cms[001-024] production lcgadmin All others cms049 cmsgrid cms[025-048] Site T2 policy Site T1 policy Tier-2 site David Bouvet - CMS visit

  11. Grid services for CMS • Site tier-1: • 2 FTS: v2.0 and v1.5 (to be decommissioned) • 1 SRM SE + 1 test SRMV2 SE • 1 LCG CE: 3.0.5 configured for SL4/32 bits, VO CMS only • partially updated • 2 VOboxes: • cclcgcms: PhEDEx + SQUID • cclcgcms02: will be used for SQUID • Site tier-2: • LCG CE 3.0.11 for SL3, VOs ATLAS, CMS • LCG CE 3.0.14 for SL4, all VOs • publish the 1 SRM SEs of the tier-1 David Bouvet - CMS visit

  12. Grid services for CMS: FTS • 2 FTS servers: v2.0 and v1.5 • v1.5 will be decommissioned after CMS-France green light • v2.0: • one single node for all agents (VO + channels) and all VO • DB backend on Oracle cluster • T1-IN2P3 channels • some T2/T3-IN2P3 channels (e.g.: Belgium, Beijing T2s, IPNL T3) • IN2P3-STAR channels to fit with CMS Data Management requirements (transfers from anywhere to anywhere ⇨ more difficult to solve the problem) • One node will be added for channel distribution David Bouvet - CMS visit

  13. Grid services for CMS: CE • Site tier-1: • 1 LCG CE: 3.0.5 configured for SL4/32 bits, VO CMS only • Site tier-2: • LCG CE 3.0.11 for SL3, VOs ATLAS, CMS • LCG CE 3.0.14 for SL4, all VOs • Major concerns regarding CMS: • we strongly emphasize the use of requirements in job submission instead of hard-coded hostname or IP addresses less impact for CMS when changing the node • CMS critical tests policy seems too restrictive: CE/SE blacklisted with FCR at first test failure possibility to wait 2 tests failure? David Bouvet - CMS visit

  14. BQS WN WN WN WN WN WN WN WN Anastasie Calcul Grid services for CMS:CE deployment plans (1) • Now • Problems • fault tolerance as seen by a VO • updates on-the-fly too difficult • by using spare CE • by temporary migration of a VO to other CEs • because of use of hostname by the VOs • configurations not uniform Computing Element Computing Element Computing Element Computing Element David Bouvet - CMS visit

  15. From… BQS BQS WN WN WN WN WN WN WN WN WN WN WN WN WN WN WN WN Anastasie Anastasie Calcul Calcul Grid services for CMS:CE deployment plans (2) • To ? Computing Element Computing Element Computing Element Computing Element Computing Element Computing Element Computing Element Computing Element David Bouvet - CMS visit

  16. Identified problems load-balancing depends on VOs’ CE selection strategy risk of overloading particular CE solution based on assumption that the VOs use the information system via the IS, show a different cluster by CE (logical split of the BQS cluster) account mapping different users must not share the same account split account pool for tier-1 and tier-2 same user should not be mapped to different accounts Solutions share gridmapdir between CEs use GPFS? use a centralized LCAS/LCMAP database Grid services for CMS:CE deployment plans (3) • Future ? VO-oriented Load-Balancing Computing Element Computing Element Computing Element Computing Element Mapping David Bouvet - CMS visit

  17. And in addition… (1) • Job priorities by VO • temporary solution by modifying jobs already in queue • implement by change to BQS (ongoing development ) • necessary to better understand the VOMS organization of the VO • Job traceability • increase visibility of the grid jobs submitted to the Centre • information needed by Operations: • grid job identifier • mail address of submitter David Bouvet - CMS visit

  18. And in addition… (2) • CE gLite / CREAM • specific developments ongoing (Sylvain Reynaud) • deployment on our PPS? • SL4 • more and more nodes available on SL4 for production • LCG-CE, glite-BDII, UI, WN (installation planed on PPS ASAP) • still awaiting for an official SL4/64 bits WN/UI release • current SL4/32 creates low memory load problems David Bouvet - CMS visit

  19. Major concerns : global • Our first major concerns • many service nodes • many different types of node • few people to administrate • few time before starting • a production quality to be maintained • So far as we can, we have… • to reuse • our practical abilities • existing infrastructure when possible • but, add nodes if that eases operations • to avoid to introduce too much VO specificities • experience acquiring • operational procedures set up • monitoring / administration tools adaptation David Bouvet - CMS visit

  20. Major concerns : global • Improving grid communication will be the challenge • multiple information sources: LCG, EGEE, VOs, regional and internal site communication • too much information or knowledge are still coming from mails or from a mass of meetings ⇨ a lot of progress has been achieved yet • At CC, VO-Site communication improvement • One VO support contact appointed by LHC VO • speak VO language with site • speak Site language with VO • Shown to be a good point to improve the communication • For CMS matters, • grid site administrators systematically discuss with Nelli Pukhaeva • Nelli Pukhaeva knows who is the best CC interlocutor for any CMS request • CMS specific support mailing list: cms-support@cc.in2p3.fr David Bouvet - CMS visit

  21. Major concerns : operational • We set up operational procedures to suppress or, at least, to reduce grid service outage • ex.: CE update might be operated without any outage • set up a new CE and validate it works well out of production • close the old CE and replace it by the new one • take out the old CE when its jobs are ended • But “bad” VO usage can interfere with that • ex.: job submission explicitly refers to a CE by specifying its hostname • it is time-consuming because you must inform the supported VOs before taking any action on the CE • job submission will failed if the CE is out of production • Grid middleware theoretically enables those operations • but problem certainly comes from the fact that M/W doesn’t enable to express requests like: “please submit my job to any allowed CE of site IN2P3-CC” • Intensive access during jobs to a data file in VO_CMS_SW_DIR directory which is on AFS  heavy load on AFS • solution we proposed to Peter Elmer: copy the file in job SCRATCH directoryWe hope this will be solved soon David Bouvet - CMS visit

  22. Major concerns : technical • Dealing with VOMS information • if I was a VO, I would be very enthusiastic about the new possibilities offered by VOMS • but as site… • I need to know what behavior is expected behind a VOMS role/group • I must find a technical solution to translate it in terms of site policy • I must possibly adapt the interface between grid front-end and local services to implement the behavior • CE Jobmanager ↔ BQS • CE Information Provider ↔ BQS • But, this solution raises some scalability problems with big sites • ASAP we must identify together new requirements that will be introduced because of VOMS David Bouvet - CMS visit

  23. Thanks for your attention David Bouvet - CMS visit

More Related