Matthias Kasemann CERN/DESY. The CMS Computing System: getting ready for Data Analysis. CMS achievements 2006. Magnet & Cosmics Test (August 06) Detector Lowering (January 07). CMS achievements 2006 : Physics TDRs.

Matthias Kasemann CERN/DESY

The CMS Computing System:getting ready for Data Analysis

CMS achievements 2006

Magnet & Cosmics Test (August 06)

Detector Lowering (January 07)

CMS achievements 2006: Physics TDRs

  • Feb 2006: Volume I of the P-TDR; describes detector performance and software.

  • Jun 2006: Volume II describes the physics performance.

  • The two volumes constitute the culmination of our plans for data analysis in CMS with up to 30 fb-1 of data.

    • The special study of detector commissioning and data analysis during the startup of CMS, has been deferred to 2007.

  • This activity mobilized hundreds of collaborators during the past two years, and many useful lessons have been learned.

CMS: Computing highlights 2006

  • Main computing/software milestones:

    • Magnet Test Cosmic challenge (Apr 06)

    • Computing Software and Analysis Challenge 06 (Nov 06)

  • 2006: a year of fundamental software changes

    • New simulation and reconstruction software packages released

      • Very positive feedback from users

    • Developed procedures for release integration, building and distribution.

      • Control release tools, Hypernews, Nightly builds, Tag collector, WorkBook,…

    • Design control of all interfaces and data formats in place

      • CMSSW framework, framework-light, ROOT available for data access

  • Integration with CMS detector and commissioning activities

    • Strong connections with various detector groups – key for commissioning

    • Validation software packages and validation procedure in place – crucial for startup preparation

Major Milestone in 2006: CSA06

  • Combined Computing, Software, and Analysis challenge (CSA06)

  • A “25% of 2008” data challenge of the CMS data handling model, computing operations

    • Integrated test of full end-to-end chain of the complete system, from (simulated) raw data to analysis at Tier-1 and Tier-2 centers.

    • Launched on Oct 2, 2006; many months of preparation and following the development of about 0.5M lines of software in the new CMSSW framework.

    • 6 weeks later having achieved all technical goals of the challenge. Code ran with negligible crash rate, without any memory problems on all samples

  • By the end of CSA06: Tier-0 centre reconstructed > 200M events; >1 Petabyte of data shipped across network between Tier-0, Tier-1, and Tier-2 centers.

    • Excellent collaboration with IT department was an important factor in the success of the challenge

    • World-wide distributed system of regional Tier1 and Tier2 centers

CSA06: T0 Goals & Achievements

  • Prompt Reconstruction at 40 Hz

    • 50 Hz for 2 weeks, then 100 Hz

    • Peak rate: >300 Hz for >10 hours

    • 207M events total

  • Uptime: 80% of best 2 weeks

    • Achieved 100% of 4 weeks

  • Use of Frontier for DB access to prompt reconstruction conditions

    • The CSA challenge was the first opportunity to test this on a large scale with developed reconstruction software

    • Initial difficulties encountered during commissioning, but patches and reduced logging allowed full inclusion into CSA

  • CPU use

    • Max CPU efficiency: 96% of 1400 CPUs over ~12 hours

  • Explored realistic T0 operations, upgrading and intervening on a running system

CSA06: T0T1 Transfers

Last week’s averages hit350MB/s (daily) 650MB/s (hourly)i.e. exceeded 2008 levels for ~10 days (with some backlog observed)

  • Goal was to sustain 150 MB/s to T1s

    • Twice the expected 40 Hz output rate

Monthly T1 Transfer plot

signals start

Target rate

Min bias only @ start

T0 rate: 54 110 170 160 Hz

CSA06: Individual T0 - T1 Performance

Goals Achievements

  • 6 of 7 Tier-1s exceed 90% availability for 30 days

  • U.S. T1 (FNAL) hit 2X goal

  • 5 sites stored data to MSS (tape)

CSA06: Jobs Execution on the Grid

  • > 50K jobs/day submitted on all but one day in final week

    • > 30K/day robot jobs

    • 90% job completion efficiency

    • Robot jobs have same mechanics as user job submissions via CRAB

    • Mostly T2 centers as expected

      • OSG carries large proportion

    • Scaling issues encountered, but subsequently solved

CSA06: Prompt Tracker Alignment

  • Determine new alignment:

  • Run “HIP” algorithm on multiple CPUs at CERN over dedicated alignment skim from T0

  • 1 Million events ~4h on 20CPU

  • Write new alignment into offline

  • DB at T0 (ORCOFF)

  • distribute offline DB to T1/T2’s

TIB DS modules - positions

results 2 days after AlCaReco!

Closing the loop:

analysis of re-reconstructed Z  m+m- data at T1/T2 site:

Three scenarios:


(grid jobs at T1-PIC)

1 GLB + 1 tracker track

2 GLB tracks

1 GLB + 1 STA track

CSA07: Physics Analysis Demonstrations

  • These demonstrations proved to be useful training exercises for collaborators in the new software and computing tools.

  • Muon:

    • Extraction of W

    • Di-Muon reconstruction efficiency

      • Z, J/+-

      • Northwestern and Purdue groups and T2 activity

  • Tau:

    • Selection of Ztau tau l+jet

    • Tau mis-id study from Z+jet

    • Tau tagging efficiency

CSA06 Summary

  • All goals were met

    • T0 prompt reconstruction of RECO, AOD, AlCaReco, and with Frontier access @100% efficiency for 207M events

    • Export to T1 @ 150 MB/s and higher

    • Data reduction (skim) production at T1s performed, transferred to T2s

    • Re-reconstruction demonstrated at 6 T1 centers

    • Job load exceeded 50K/day

    • Alignment/Calibration/Physics analyses widely demonstrated

  • CSA06 was a huge enterprise

    • Commissioned the CMS data-handling workflow @ 25% scale

    • Everything worked down to the final analysis plots

    • Many lessons can be drawn for the future as we prepare for data-handling operations, and more things to commission

      • DAQ Storage Manager  T0

      • Support of global data-taking during detector commissioning

Some Lessons from CSA06

  • CMS needs some development work to ease the operations load

  • Strong engagement with OSG, WLCG and sites was extremely useful

    • Grid service and site problems were addressed promptly.  

    • FTS at CERN was carefully monitored, response when needed

    • CASTOR support at CERN was excellent

    • Support from CERN IT was key for success and very instrumental

  • Data management needs an automatic way to ensure consistency across all components

  • Scale testing continues to be an extremely important activity

CMS Outlook and Perspectives for 2007

  • Lower all the detector, and commission it underground.

  • Prepare final distributed computing and software system and physics analysis capability.

  • Initial* CMS detector will be ready for collisions at 900 GeV at the end of 2007.

  • Low luminosity detector will be ready for collisions at design energy in mid-2008.

  • Initial* CMS detector is the low luminosity detector minus ECAL endcaps and pixels. Install both during 07/08 winter shutdown.

CMS computing goals in 2007

  • Demonstrate Physics Analysis performance using final software with high statistics.

    • Major MC production of up to 200M events started last week

    • Analysis starts in June, finishes by September

  • Regular data taking: Detector – HLT – TAPE - T0 - T1

    • At regular intervals, 3-4 days per months, starting May

    • Month of October: MTCC3

      Readout of (successively more) components, data will be processed and distributed to T1

Computing Commissioning Plans 2007

Start large MC


  • February

    • Deploy PhEDEx 2.5

    • T0-T1, T1-T1, T1-T2 independent transfers

    • Restart job robot

    • Start work on SAM

    • FTS full deployment

  • March

    • SRM v2.2 tests start

    • T0-T1(tape)-T2 coupled transfers (same data)

    • Measure data serving at sites (esp. T1)

    • Production/analysis share at sites verified

  • April

    • Repeat transfer tests with SRM v2.2, FTS v2

    • Scale up job load

    • gLite WMS test completed (synch. with Atlas)

  • May

    • Start ramping up to CSA07

  • July

    • CSA07

Event Filter tests

Start Analysis

Start Global

data-taking runs



GlobalDetector Run

LHC Eng. run

Motivations for CSA07

There are two important goals for 2007, the last year of preparations for physics and analysis

1) Scaling

We need to reach 100% of system scale and functionality by spring of 2008

  • CSA06 demonstrated between 25% and 50% depending on the metric

    2) We need to transition to sustainable operations

    This spans all areas of computing

  • Data management

  • Job processing

  • User Support

  • Site configuration and consistency

    In the past functionality was valued higher than the operations load

  • As we prepare for long term support this emphasis needs to change

CSA07 Goals: Increase Scale

CMS demonstrated 25% performance in 2006. We have two more factors of 2 to ramp up before data taking in 2008

  • The data transfer between Tier-0 and Tier-1 reached about 50% of scale

    • Very successful test, but some signs of system stress were visible

  • Job submission rate reached 25%.

    We plan another formal challenge in 2007

  • A > 50% challenge in the summer of 2007

    • Extend the system to include the HLT farm

    • Add elements like simulation production

    • Increase user load

    • Run concurrent with other experiments stressing the system

CMS Computing Model & Resources

CMS Tier-1 centers:

CSA07 Workflow

CSA07 success metrics

CSA07 Goals for Tier-1s

In the Computing Model the Tier-1 centers perform 4 functions:

  • Archive Data, both real and simulation from Tier-2 centers

  • Execute skimming and selection for users and groups on the data

  • Re-reconstruction of raw data

  • Serving data samples to Tier-2 centers for further analysis

    As we transition to operations we should bring the Tier-1 centers into alignment with their core functionality

CSA07: expectations of Tier-2s

MC Production at Tier-2s

  • were a significant contributor to the 25M events/month for CSA06

  • When the experiment is running the Tier-2s are the only dedicated simulation resources and the expectations is 100M per month

    • Now CMS produces 30M events/months, goal for CSA07 is 50M

      Analysis submission

  • The Tier-2s are expected to support communities

    • Either local groups or regions of interest

    • Only implemented in a couple of specific communities

  • Unlike Tier-1 data subscriptions and processing expectations, which are largely specified by the experiment centrally, the Tier-2s have control over the data and the activity

    CMS will work to improve the reliability and availability of the Tier-2 centers

Tier-2 Analysis goals in 2007

Tier-2s are the primary analysis resource controlled by physicists

  • The activities are intended to be controlled by user communities

    Up to now most of the analysis has been hosted at the Tier-1 sites

    CMS will enlarge analysis support by hosting important physics samples exclusively at Tier-2 centers

  • We have roughly 10-15 sites that have sufficient disk and CPU resources to support multiple datasets

    • Skims in CSA06 were about ~500GB

    • The largest of the raw samples was ~8TB

  • Force the migration of analysis to Tier-2s by hosting data at Tier-2s

Transition to operations in 2007, Goals

We plan to measure the transition to operations with concrete metrics

Site availability: SAM tests (Site Availability Monitor)

  • Put CMS functions in the site functional testing

    • Analysis submissions

    • Production

    • Frontier

    • Data Transfer

  • Measure the site availability

  • The WLCG goal for the Tier-1 in early 2007 is 90%

    • We should establish a goal for Tier-2s, 80% seams reasonable

  • Goals for summer of 07 would be 95% and 90% respectively

Prepare CMS for Analysis: Summary

  • 2006 was a very successful year for CSM software and computing

  • 2007 promises to be a very busy year for Computing and Offline

  • Commissioning, Integration remains major task in 2007

    • To balance the needs for physics, computing, detector will be a logistics challenge

  • Transition to Operations has started; data operations group formed

  • Facilities will be ramping up resources to be ready for pilot run and the 2008 physics run

  • An increased number of CMS people will be involved in the facilities, commissioning and operations to prepare for CMS analysis

