1 / 20

Operational Experience with CMS Tier-2 Sites

Operational Experience with CMS Tier-2 Sites. I. González Caballero (Universidad de Oviedo) for the CMS Collaboration. Data driven: Move big blocks of data in a more or less controlled way Jobs are sent to the data and not vice versa

fisseha
Download Presentation

Operational Experience with CMS Tier-2 Sites

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Operational Experience withCMS Tier-2 Sites I. González Caballero (Universidad de Oviedo) for the CMS Collaboration

  2. Data driven: Move big blocks of data in a more or less controlled way Jobs are sent to the data and not vice versa Tools to handle the date and find where it is become very important Distributed Extensive use of the GRID technology Profit from the two more extended GRID infrastructures: OSG and EGEE Hierarchical: Tier-0 serves data to Tier-1s, which serve data to Tier-2s, which serve data to Tier-3s Different workflows occur in different tiers Different degrees of service and compromise are expected from different tiers Some relevant aspects of CMS Computing Model Some figures for CMS • Event Size (MB):RAW: 1 - RECO: 0.5 - AOD: 0.1 • CPU required (SI2k/event):Sim.: 90 – Rec.: 25 – Análisis: 0.25 Operational Experience with CMS Tier-2 Sites - CHEP 2009

  3. detector simulation CMS Computing Model reconstruction Tier-2 event filter (selection & reconstruction) analysis processed data event summary data raw data batch physics analysis event reprocessing les.robertson@cern.ch analysis objects (extracted by physics topic) event simulation interactive physics analysis Tier-2 Operational Experience with CMS Tier-2 Sites - CHEP 2009

  4. MC Production requires… • GRID environment • Working Storage Element that understands SRM • Ability to transfer data to Tier-1s • CMS software (CMSSW) installed at the site Centrally controlled activity CMS Computing Model: Tier-2 tasks • Tier-2s account for 1/3 of the total CMS resources • More than 40 sites in 22 countries • They are expected to provide resources for: • Production of all the simulation the collaboration needs • User Data Analysis Data Analysis requires… • GRID environment • CMS software (CMSSW) • Working Storage Element: • That understands SRM • With enough space to host the datasets needed • Ability to transfer data fromTier-1s User driven activity  Bursty Operational Experience with CMS Tier-2 Sites - CHEP 2009

  5. A CMS Tier-2 needs the following GRID infrastructure: A GRID computing cluster: OSG or EGEE A storage cluster: CASTOR, dCache, DPM, GPFS… With an SRMv2 frontend: StoRM GRID interfaces to both clusters Local monitoring tools: batch, storage, accounting, … Plus the following CMS Services: PhEDEx: To manage Data Transfers Connects sites through SRMv2 FTS service at Tier-1s is used to schedule transfers A dedicated mid size machine FroNTier: Squid to cache locally alignment and calibration constants A small size machine every 800 slots CMS Tier-2 requirements Related talk by R. Egeland: PhEDEx Data Service (Thur, 16:30 ) • Besides it may operate some other services • A login facility for local users: User Interfaces, interactive access to data locally stored,… • Local non mandatory GRID and CMS services to improve the local users experience: Local WMS, CRAB Server, local Data Bookkeeping Service (DBS),... Operational Experience with CMS Tier-2 Sites - CHEP 2009

  6. A full metric to commission links (up and down) has been developed Based on expected data bandwidths and data transfer quality To avoid sites with problems overloading good performing sites Only commissioned links may be used to transfer CMS data CMS Model is very dependent on an efficient data transfer system CMS has a very flexible transfer topology Any Tier-2 downloads and uploads data from any Tier-1 Tier-2 to Tier-2 transfers are also allowed (though not encouraged) Interesting for Tier-2s associated with the same physics groups This additional complexity in the operation of the Tier-2 network: Multiple SRM connections must be managed by the sites The different latencies make optimization difficult Operators are geographically spread in different time zones difficulting communications T0 (CERN) T1 (ASGC) T1 (FZK) T1 (PIC) T1 (FNAL) T1 (CNAF) T2 (…) T2 (…) T2 (…) T2 (…) T2 (ES) CMS Data Handling: Transfers at CMS Tier-2s Operational Experience with CMS Tier-2 Sites - CHEP 2009

  7. CMS Data Handling: Commisioning links at Tier-2 • A big effort has been put by CMS Facility Operations to improve the amount of active links • The downlink mesh is almost full • Around 50% of the uplinks have been commissioned • At least two uplinks are mandatory for every Tier-2 • The Debugging Data Transfers effort, still ongoing, is helping Tiert-2s to fill the mesh • Also working on reducing dataset transfer latencies so data can be used sooner at sites For more details see the poster from J. Letts: Debugging Data Transfers in CMS (Thur - 024) Operational Experience with CMS Tier-2 Sites - CHEP 2009

  8. PhEDEx takes care of the transfers using a subscription mechanism Transfers use SRMv2 scheduled with FTS A set of agents take care of the different activities needed: download, upload, data consistency checks, etc… PhEDEx also provides data validation and monitoring tools The Tier-2s need to set a UI machine PhEDEx software is centrally distributed through apt-get Local operators need to configure the agents Tuning them is not always trivial Lots of documentation and examples available and public A XML file (the Trivial File Catalog) takes care of converting LFN  PFN CMS Data Handling: Transfers to and from Tier-2 Transfers to CMS Tier-2 Last year: 14,035 TB Transfers from CMS Tier-2 Last year: 4,787 TB Operational Experience with CMS Tier-2 Sites - CHEP 2009

  9. CMS Data Handling: Tier-2 Storage distribution • MC Space – 20 TB • For MC produced samples before they are transferred to the Tier-1s • Central Space - 30 TB • Intended for RECO samples of Primary Datasets • Physics Group Space - 60-90TB • Assigned to 1-3 physics groups • Space allocated by physics data manager • Local Storage Space - 30TB-60TB • Intended to benefit the geographically associated community • User Space – 0.5-1TB per user • Each CMS user is associated to a CMS Tier-2 site • Big outputs from user jobs can be staged out to this area • Temporary Space - < 1TB For more details see the poster from T. Kress: CMS Tier-2 Resource Management (Mon - 089 ) Operational Experience with CMS Tier-2 Sites - CHEP 2009

  10. Datasets for Central Spaceis managed by the Data Operations team by subscribing the assigned samples PAGs and DPGs usually appoint one or two persons responsible for subscribing data to the Physics Group Space at their “associated” Tier-2s PhEDEx keeps track of the “property” of each dataset for this two disk areas: Easy to follow the correct use of data at the sites MC Spaceis filled by the production jobs Data is requested for deletion as soon as it is transferred to Tier-1s Each single user in CMS can make a request for a dataset to be placed on the Local Storage Space at any Tier-2 Sites are free to manage the use of the User Space the way the prefer: quotas, mail, etc… Users are usually close to the Tier-2 CMS Data Handling: Selecting the data at the Tier-2 • CMS created the role of the Data Manager at each site with special rights: • Reviews every single transfer or deletion request… • …and approves or denies it • The Data Manager makes sure: • The data is in accordance with the site commitments with the Physics and Detector Groups • There is enough space at the local Storage Element to store the data • At big sites this can be a quite time consuming activity Operational Experience with CMS Tier-2 Sites - CHEP 2009

  11. Software Installation is centrally managed by CMS The VO sgm role is used and is expected to have the highest priority on the queues Due to some limitations in rpm under SLC4, CMSSW installation needs a 64 bit node The installation of old CMSSW releases needs big amounts of memory in the installation node Improvement in newer releases reduce this requirement to O(100MB) The CMSSW procedure needs write access for all software managers Map all sgm grid logins to a single account The installation area has to be shared among all Worker Nodes Data access from the WNs CMSSW understands the Trivial File Catalog so it is used to convert LFNs to PFNs POSIX/dCache/RFIO protocols are supported Production Workflow: A nominal Tier-2 is expected to reserve half of its CPUs for MC Production Managed through the VO production role GRID access for local users: A User Interface needs to be set With CRAB manually installed on it Really easy to install using a tar file and a automatic configuration script Computing at CMS Tier-2s Related talk by D. Spiga: Automatization of User Analysis Workflow in CMS (Thur, 17:10 ) Operational Experience with CMS Tier-2 Sites - CHEP 2009

  12. Operating CMS Tier-2s: Central Aspects • Operating the more than 40 CMS Tier-2 sites is a complex task: • Geographically spread around the globe… in different time zones • With wide variety of sizes, technologies, bandwidths… • Good means to communicate important news, configuration changes, requirements and problems is important: • Special Hypernews forum dedicated to Tier-2s • At least one local operator at every site needs to follow • A Savanah squad per site has been created • Each problem found at a site is assigned to the squad • A new metric has been developed to establish the site capability to contribute efficiently to CMS: Site Readiness • Based on the number of commissioned links, fake analysis jobs (JobRobot) and Site Availability Monitoring (SAM) tests • Sites are then classified as READY, NOT-READY or WARNING (in danger to become NOT-READY) See the poster by J. Flix (Thur 040): The commissioning of CMS sites: improving the site reliability Operational Experience with CMS Tier-2 Sites - CHEP 2009

  13. Workflows can be monitored through the CMS Dashboard Almost any aspect of analysis and production jobs can be checked: Successful/cancelled/aborted jobs By user, by site, by application, by dataset, by CE,… By GRID or Application error code All aspects of data handling can be monitored through the wide variety of PhEDEx Web Server plots and tables: Transfer rates and volumes Quality of the transfers Errors detected and reasons for those errors Latencies, routing details, … SAM tests and Site Readiness offer its own set of tools integrated in the Dashboard Monitoring CMS Tier-2s • Many tools have been developed to monitor the different aspects of a Tier-2 from the point of view of CMS • For both local and central operators Operational Experience with CMS Tier-2 Sites - CHEP 2009

  14. CMS Tier-2 Workflows: Production x 109 • Production uses a special tool developed by CMS: ProdAgent • Completely centralized • No local operator intervention in the operation • Data is produced at Tier-2s and automatically uploaded to Tier-1 More than 2 billion events producedduring the last 12 months See the poster by F. Van Lingen (Tue 014): CMS production and processing system - Design and experiences Operational Experience with CMS Tier-2 Sites - CHEP 2009

  15. CMS Tier-2 Workflows: User Analysis More than 7.5 million user analysis jobs executed at Tier-2s • On produced samples • And on real data: Cosmics recorded with full and no magnetic field 39.6% 60.4% Operational Experience with CMS Tier-2 Sites - CHEP 2009

  16. Future plans… • The main goal in the near future is to completely integrate all the CMS Tier-2s into CMS computing operations • Using dedicated task forces to help sites meet the Site Readiness metrics • Improve the availability and reliability of the sites to increase further the efficiency of both analysis and production activities • Complete the data transfer mesh by commissioning the missing links • Specially Tier-2  Tier-1 links • And continuously checking the • Improve the deployment of CMS Software loosening the requisites at the sites • Install CRAB Servers at more sites: • CRAB Server takes care of some user routine interactions with the GRID improving the user experience • Improves the accounting and helps spotting problems and bugs in CMS software • A new powerful machine and special software needs to be installed by local operators • CMS is building the tools to allow users to share their data with other users or groups • This will impact on the way data is handled at the sites Operational Experience with CMS Tier-2 Sites - CHEP 2009

  17. Conclusions • Tier-2 sites play a very important role in the CMS Computing Model: They are expected to provide one third of the CMS computing resources • CMS Tier-2 sites handle a mix of centrally controlled activity (MC production) and chaotic workflows (user analysis) • CPU needs to be appropriately set to ensure enough resources are given to each workflow • CMS has built the tools to facilitate the day by day handling of data at the sites • The PhEDEx servers located at every site helps transferring data in an unattended way • A Data Manager appointed at every site links CMS central data operations with the local management • CMS has established metrics to validate the availability and readiness of the Tier-2s to contribute efficiently to the collaboration computing needs • By verifying the ability to transfer and analyze data • A big number of monitoring tools have been developed by CMS to monitor every aspect of a Tier-2 in order to better identify and correct the problems that may appear • CMS Tier-2s have proved to be already well prepared for massive data MC production, dynamic data transfer, and efficient data serving to local GRID clusters • CMS Tier-2s have proved to be able to provide our physicists with the infrastructure and the computing power to perform their analysis efficiently CMS Tier-2s have a crucial role to play in the coming years in the experiment,and are already well prepared for the LHC collisions and the CMS data taking Operational Experience with CMS Tier-2 Sites - CHEP 2009

  18. The End¡Thank you very much!

  19. DRAFT • Abstract: In the CMS computing model, about one third of the computing resources are located at Tier-2 sites, which are distributed across the countries in the collaboration. These sites are the primary platform for user analyses; they host datasets that are created at Tier-1 sites, and users from all CMS institutes submit analysis jobs that run on those data through grid interfaces. They are also the primary resource for the production of large simulation samples for general use in the experiment. As a result, Tier-2 sites have an interesting mix of organized experiment-controlled activities and chaotic user-controlled activities. CMS currently operates about 40 Tier-2 sites in 22 countries, making the sites a far-flung computational and social network. We describe our operational experience with the sites, touching on our achievements, the lessons learned, and the challenges for the future. Operational Experience with CMS Tier-2 Sites - CHEP 2009

  20. Part of the analysis jobs run on real data: Cosmics at full and no magnetic field Jobs run at each Tier-2 during the last 12 months Total jobs: 7589336 60.4 OK 39.6 ERR CMS Tier-2 Workflows: User Analysis Operational Experience with CMS Tier-2 Sites - CHEP 2009

More Related