Uscms tier 2 in wisconsin
1 / 15

USCMS Tier-2 in Wisconsin - PowerPoint PPT Presentation

  • Uploaded on

USCMS Tier-2 in Wisconsin. UW High Energy Physics Dan Bradley Sridhara Dasu Vivek Puttabuddhi Steve Rader Don Reeder Wesley Smith UW Computer Science Miron Livny Sean Murphy Erik Paulson Alain Roy + The Condor Team. Users of Wisconsin Tier-2.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'USCMS Tier-2 in Wisconsin' - van

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Uscms tier 2 in wisconsin
USCMS Tier-2 in Wisconsin

  • UW High Energy Physics

  • Dan Bradley

  • Sridhara Dasu

  • Vivek Puttabuddhi

  • Steve Rader

  • Don Reeder

  • Wesley Smith

  • UW Computer Science

  • Miron Livny

  • Sean Murphy

  • Erik Paulson

  • Alain Roy

  • +

  • The Condor Team

Users of wisconsin tier 2
Users of Wisconsin Tier-2

  • Focus has been on trigger studies and datasets more easily produced outside of official production channels.

  • CMS Users come from institutions world-wide

  • Login to use our resources

Datasets in uw hep dcache
Datasets in UW-HEP dCache

  • At this time, only locally simulated data:

    11 TB

    39 datasets

    360k files

Reconstructed events
Reconstructed Events

Digitized Local Datasets

Production History for Local and Official Digitization

Analysis objects
Analysis Objects

  • Primary analysis job is L1 Trigger Ntuple maker


Campus Condor Flocks

Glow equipment @ hep
GLOW Equipment @ HEP

  • GLOW equipment in 3 racks:

  • Storage Servers:

These are in addition to older 70 node 2.4 GHz Xeon CPU servers and 10 TB RAID5 servers

Condor configuration
Condor Configuration

  • UW-HEP

    • peaceful preemption of resource claims achieved with MaxJobRetirementTime=4 days

    • This requires Condor 6.7.x, which is still a development branch.


    • Each group has highest priority on a fixed set of machines. (Achieved with machine RANK)

    • Wish list: hierarchical matchmaking so groups can divide resources internally.

    • Idle machines are distributed via Condor’s usual fair sharing algorithm (with preemption).

      • Bulk of our resources are used by others on GLOW

    • Some groups use long-running job slots so their jobs are suspended instead of being killed during preemption. Others use Condor’s checkpoint libs.

Grid services
Grid Services

  • Currently grid3 based.

    • Gatekeeper:

    • Overflow jobs flock to GLOW and CS

    • All compute nodes are currently RH EL3 compatible, but we cannot rely on common Linux version in future.

    • Recently solved several stability problems and have been sustaining modest load (100-200 running jobs) with no problems.

    • AFS provides cross-campus shared filesystem for grid jobs.

      Will upgrade to OSG in mid to late May.

Dcache issues
dCache Issues

  • Our SRM service is not yet functional

    • It used to work in RPM 1.2.2-4, so something went wrong in our upgrade to 1.2.2-7-3.

    • Interim solution has been 3rd party gridftp to Fermilab, but this service has degraded compared to what used to be available via (20 MB/s sustained vs. negligible).

  • Problems scaling digi with PU ( DSTs)

    • CPU usage drops to 0 at around 100 digi jobs and degrades badly at even a fraction of that.

    • xrootd running on all same hardware accessing the same files runs 280 jobs without noticeable CPU fall-off.

    • dCache was moving 70 times as much data per event!? However, we can’t get it to scale to a level where pool nodes are load bound anyway.

    • Clearly we have something badly tuned!

Internal services
Internal Services

  • Extensive Nagios and NRG-based monitoring



    • e.g. alerts about degraded RAIDs, dCache pool services, load on gatekeeper, temperature in machine room, etc.

  • System software managed by kickstart + cfengine

    • Has worked well for us, but it puts us at odds with the dream of a common ROCKS-based solution.

  • JugMaster data production


    • Basic idea: “persistent DAG in a database”

    • Provides highly scalable queue that may in turn submit jobs to multiple Condor schedds in a fault-tolerant way.

    • This service will become less important as MCPS is used.

Immediate todo list
Immediate TODO list

  • We have today only a taste of Tier-2 like operations, with ~17 users.

    • We need to understand why some users are less active. Is it something we can fix?

    • We must solve our SRM deployment issues and get connected to the PhEDEx network.

      • Ability to export/import official data

    • Scale grid based simulation production to the levels previously sustained (>500 simultaneous jobs)

      • Resume locally managed production if necessary (i.e. if Craig is too busy)

Tier 2 recruiting purchasing
Tier-2 Recruiting/Purchasing

  • Crucial item is to fully staff Tier-2

    • System Manager: S. Rader 50% leveraged from UW DOE and 50% Tier-2 appointment

      • Use savings to hire a technical assistant

      • Good experience in the past: Raj, Iyer, Vivek

    • Operations: new person recruited

      • We will train this person to lead operations

        • Physics analysis jobs and simulation support

    • We will begin recruitment of a software developer to work on DISUN issues with Condor team

  • Equipment

    • Must demonstrate that we can efficiently and fully utilize resources that we have before new purchases

    • Later this year we will commission a new server room for Tier-2 equipment

      • Expect new equipment purchases in Fall

Related activities
Related Activities

  • Trigger Fault Studies

    • Working with Knowledge Management group: B. Chen, L. Chen, R. Ramakrishnan

    • Trying to automatically detect when trigger behavior changes unexpectedly.

    • Can identify some possible causes.

  • Rapid-response Adaptive Computing Environment: RACE

    • Support burst computations for high priority tasks.

    • Want to claim full UW campus grid on short notice, ~7000K SI2000 in 2007.

    • Provide Glidein-style workspace.


  • VDT

    • Foundation for Grid3 and OSG

    • Integrating Globus, Condor, MonALISA, and many others.

  • NMI Middleware Initiative

    • Synergy with VDT.

    • Common build and testing system for many platforms

  • Condor

    • Condor-G, Condor-C, DAGMan