Cms stress test report marco verlato infn padova
This presentation is the property of its rightful owner.
Sponsored Links
1 / 14

CMS Stress Test Report Marco Verlato (INFN-Padova) PowerPoint PPT Presentation


  • 57 Views
  • Uploaded on
  • Presentation posted in: General

INFN-GRID Testbed Meeting 17 Gennaio 2003. CMS Stress Test Report Marco Verlato (INFN-Padova). Motivations and goals. Purpose of the “stress test”: Verify how EDG middleware is good for CMS Production Verify the portability of CMS Production environment on a grid environment

Download Presentation

CMS Stress Test Report Marco Verlato (INFN-Padova)

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Cms stress test report marco verlato infn padova

INFN-GRID Testbed Meeting

17 Gennaio 2003

CMS Stress Test Report

Marco Verlato (INFN-Padova)


Motivations and goals

Motivations and goals

  • Purpose of the “stress test”:

    • Verify how EDG middleware is good for CMS Production

    • Verify the portability of CMS Production environment on a grid environment

    • Produce a reasonable amount of the PRS requested events

  • Goals

    • Aim for 1 million events (only FZ files, no Objectivity)

    • Measure performances, efficiencies and reasons of job failures

    • Try to make the system stable

  • Organization

    • Operations started November 30th and ended at Xmas (~3 weeks)

    • The joint effort involved CMS, EDG and LCG people (~50 people, 17 from INFN)

    • Mailing list: <[email protected]>


Software and middleware

Software and middleware

  • CMS Software used is the official production one

    • CMKIN and CMSIM: installed as rpm on all the sites

  • EDG Middleware releases:

    • 1.3.4 (before 9/12)

    • 1.4.0 (after 9/12)

  • Tools used (on EDG “User Interface”)

    • Modified IMPALA/BOSS system to allow for Grid submission of jobs

    • Scripts and ad-hoc tools to:

      • Replicate files

      • Collect monitoring information from EDG and from the jobs


Cms stress test report marco verlato infn padova

CE

SE

RefDB

BOSS

DB

Job output filtering

Runtime monitoring

parameters

RC

UI

IMPALA

data

registration

JDL

WN

CE

JobExecuter

dbUpdator

Write data

CMS sw

SE

GRID

SERVICES

CE

CE

CMS sw

SE

SE


Resources

Resources

  • The production is managed from 4 UI’s:

    • Bologna / CNAF

    • Ecole Polytechnique

    • Imperial College

    • Padova

      reduces the bottleneck due to the BOSS DB

  • Several RB’s seeing the same Computing and Storage Elements:

    • CERN (dedicated to CMS)(EP UI)

    • CERN (common to all applications)(backup!)

    • CNAF (common to all applications)(Padova UI)

    • CNAF (dedicated to CMS)(CNAF UI)

    • Imperial College (dedicated to CMS and BABAR)(IC UI)

      reduces the bottleneck due to intensive use of the RB and the 512-owner limit in Condor-G


Resources1

Resources


Data management

Data management

  • Two practical approaches:

    • Bologna, Padova: FZ files (~230 MB sized) are directly stored at CNAF, Legnaro

    • EP, IC: FZ files are stored where they have been produced and later replicated to a dedicated SE at CERN.Goal: to test the creation of replicas of files

  • All sites use disk for the file storage, but:

    • CASTOR at CERN: FZ files replicated to CERN are also automatically copied into CASTOR (thanks to a new staging daemon from WP2)

    • HPSS in Lyon: FZ files stored in Lyon are automatically copied into HPSS


Online monitoring mds based

Online Monitoring (MDS based)


Events vs time cmkin

Events vs. time (CMKIN)


Events vs time cmsim

Events vs. time (CMSIM)

~7 sec/event average

~2.5 sec/event peak (12-14 dec)


Final results preliminary

Final results (preliminary!)


Main issues

Main issues


Chronology

Chronology

  • 29/11 – 2/12: reasonably smooth

  • 3/12 – 5/12: “inefficiency” due to CMS week

  • 6/12: RC problems begin; new collections created; Nagios monitoring online

  • 7/12 – 8/12: II in very bad shape

  • 9/12 – 10/12: deployment of 1.4.0; still problems with RC; CNAF and Legnaro resources not available; problems with CNAF RB

  • 11/12: Top level MDS stuck because of a CE in Lyon

  • 14/12 – 15/12: II stuck, most submitted jobs aborted

  • 16/12: failure in grid-mapfile update due to NIKHEF VO ldap server not reachable


Conclusions

Conclusions

  • Job failures are dominated by:

    • Standard output of job wrapper does not contain useful data:

      • many different causes

      • does affect mainly “long jobs”

      • some patches with possible solutions implemented

    • Replica Catalog stops responding: no real solution, but we will soon use RLS

    • Information System (GRIS,GIIS,dbII): hopefully R-GMA will solve these problems

  • Lots of smaller problems (Globus, Condor-G, machine configuration, defective disks, etc.)

  • Short term actions:

    • EDG-1.4.3 released the 14/1 and deployed on PRODUCTION testbed

    • Test is going on in “no-stress” mode:

      • in parallel with the review preparation (testbed will remain stable)

      • it will measure the effect of new GRAM-PBS script and JSS-Maradona patches


  • Login