Cms stress test report marco verlato infn padova
Sponsored Links
This presentation is the property of its rightful owner.
1 / 14

CMS Stress Test Report Marco Verlato (INFN-Padova) PowerPoint PPT Presentation

  • Uploaded on
  • Presentation posted in: General

INFN-GRID Testbed Meeting 17 Gennaio 2003. CMS Stress Test Report Marco Verlato (INFN-Padova). Motivations and goals. Purpose of the “stress test”: Verify how EDG middleware is good for CMS Production Verify the portability of CMS Production environment on a grid environment

Download Presentation

CMS Stress Test Report Marco Verlato (INFN-Padova)

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

INFN-GRID Testbed Meeting

17 Gennaio 2003

CMS Stress Test Report

Marco Verlato (INFN-Padova)

Motivations and goals

  • Purpose of the “stress test”:

    • Verify how EDG middleware is good for CMS Production

    • Verify the portability of CMS Production environment on a grid environment

    • Produce a reasonable amount of the PRS requested events

  • Goals

    • Aim for 1 million events (only FZ files, no Objectivity)

    • Measure performances, efficiencies and reasons of job failures

    • Try to make the system stable

  • Organization

    • Operations started November 30th and ended at Xmas (~3 weeks)

    • The joint effort involved CMS, EDG and LCG people (~50 people, 17 from INFN)

    • Mailing list: <>

Software and middleware

  • CMS Software used is the official production one

    • CMKIN and CMSIM: installed as rpm on all the sites

  • EDG Middleware releases:

    • 1.3.4 (before 9/12)

    • 1.4.0 (after 9/12)

  • Tools used (on EDG “User Interface”)

    • Modified IMPALA/BOSS system to allow for Grid submission of jobs

    • Scripts and ad-hoc tools to:

      • Replicate files

      • Collect monitoring information from EDG and from the jobs






Job output filtering

Runtime monitoring












Write data

CMS sw






CMS sw




  • The production is managed from 4 UI’s:

    • Bologna / CNAF

    • Ecole Polytechnique

    • Imperial College

    • Padova

      reduces the bottleneck due to the BOSS DB

  • Several RB’s seeing the same Computing and Storage Elements:

    • CERN (dedicated to CMS)(EP UI)

    • CERN (common to all applications)(backup!)

    • CNAF (common to all applications)(Padova UI)

    • CNAF (dedicated to CMS)(CNAF UI)

    • Imperial College (dedicated to CMS and BABAR)(IC UI)

      reduces the bottleneck due to intensive use of the RB and the 512-owner limit in Condor-G


Data management

  • Two practical approaches:

    • Bologna, Padova: FZ files (~230 MB sized) are directly stored at CNAF, Legnaro

    • EP, IC: FZ files are stored where they have been produced and later replicated to a dedicated SE at CERN.Goal: to test the creation of replicas of files

  • All sites use disk for the file storage, but:

    • CASTOR at CERN: FZ files replicated to CERN are also automatically copied into CASTOR (thanks to a new staging daemon from WP2)

    • HPSS in Lyon: FZ files stored in Lyon are automatically copied into HPSS

Online Monitoring (MDS based)

Events vs. time (CMKIN)

Events vs. time (CMSIM)

~7 sec/event average

~2.5 sec/event peak (12-14 dec)

Final results (preliminary!)

Main issues


  • 29/11 – 2/12: reasonably smooth

  • 3/12 – 5/12: “inefficiency” due to CMS week

  • 6/12: RC problems begin; new collections created; Nagios monitoring online

  • 7/12 – 8/12: II in very bad shape

  • 9/12 – 10/12: deployment of 1.4.0; still problems with RC; CNAF and Legnaro resources not available; problems with CNAF RB

  • 11/12: Top level MDS stuck because of a CE in Lyon

  • 14/12 – 15/12: II stuck, most submitted jobs aborted

  • 16/12: failure in grid-mapfile update due to NIKHEF VO ldap server not reachable


  • Job failures are dominated by:

    • Standard output of job wrapper does not contain useful data:

      • many different causes

      • does affect mainly “long jobs”

      • some patches with possible solutions implemented

    • Replica Catalog stops responding: no real solution, but we will soon use RLS

    • Information System (GRIS,GIIS,dbII): hopefully R-GMA will solve these problems

  • Lots of smaller problems (Globus, Condor-G, machine configuration, defective disks, etc.)

  • Short term actions:

    • EDG-1.4.3 released the 14/1 and deployed on PRODUCTION testbed

    • Test is going on in “no-stress” mode:

      • in parallel with the review preparation (testbed will remain stable)

      • it will measure the effect of new GRAM-PBS script and JSS-Maradona patches

  • Login