Uscms tier 2 in wisconsin
This presentation is the property of its rightful owner.
Sponsored Links
1 / 15

USCMS Tier-2 in Wisconsin PowerPoint PPT Presentation


  • 77 Views
  • Uploaded on
  • Presentation posted in: General

USCMS Tier-2 in Wisconsin. UW High Energy Physics Dan Bradley Sridhara Dasu Vivek Puttabuddhi Steve Rader Don Reeder Wesley Smith UW Computer Science Miron Livny Sean Murphy Erik Paulson Alain Roy + The Condor Team. Users of Wisconsin Tier-2.

Download Presentation

USCMS Tier-2 in Wisconsin

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Uscms tier 2 in wisconsin

USCMS Tier-2 in Wisconsin

  • UW High Energy Physics

  • Dan Bradley

  • Sridhara Dasu

  • Vivek Puttabuddhi

  • Steve Rader

  • Don Reeder

  • Wesley Smith

  • UW Computer Science

  • Miron Livny

  • Sean Murphy

  • Erik Paulson

  • Alain Roy

  • +

  • The Condor Team


Users of wisconsin tier 2

Users of Wisconsin Tier-2

  • Focus has been on trigger studies and datasets more easily produced outside of official production channels.

  • CMS Users come from institutions world-wide

  • Login to use our resources


Datasets in uw hep dcache

Datasets in UW-HEP dCache

  • At this time, only locally simulated data:

    11 TB

    39 datasets

    360k files

http://www.hep.wisc.edu/cgi-bin/cms/CMSJug.cgi


Reconstructed events

Reconstructed Events

Digitized Local Datasets

Production History for Local and Official Digitization


Analysis objects

Analysis Objects

  • Primary analysis job is L1 Trigger Ntuple maker


Condor

Condor

Campus Condor Flocks


Glow equipment @ hep

GLOW Equipment @ HEP

  • GLOW equipment in 3 racks:

  • Storage Servers:

These are in addition to older 70 node 2.4 GHz Xeon CPU servers and 10 TB RAID5 servers


Condor configuration

Condor Configuration

  • UW-HEP

    • peaceful preemption of resource claims achieved with MaxJobRetirementTime=4 days

    • This requires Condor 6.7.x, which is still a development branch.

      GLOW

    • Each group has highest priority on a fixed set of machines. (Achieved with machine RANK)

    • Wish list: hierarchical matchmaking so groups can divide resources internally.

    • Idle machines are distributed via Condor’s usual fair sharing algorithm (with preemption).

      • Bulk of our resources are used by others on GLOW

    • Some groups use long-running job slots so their jobs are suspended instead of being killed during preemption. Others use Condor’s checkpoint libs.


Grid services

Grid Services

  • Currently grid3 based.

    • Gatekeeper: cmsgrid.hep.wisc.edu

    • Overflow jobs flock to GLOW and CS

    • All compute nodes are currently RH EL3 compatible, but we cannot rely on common Linux version in future.

    • Recently solved several stability problems and have been sustaining modest load (100-200 running jobs) with no problems.

    • AFS provides cross-campus shared filesystem for grid jobs.

      Will upgrade to OSG in mid to late May.


Dcache issues

dCache Issues

  • Our SRM service is not yet functional

    • It used to work in RPM 1.2.2-4, so something went wrong in our upgrade to 1.2.2-7-3.

    • Interim solution has been 3rd party gridftp to Fermilab, but this service has degraded compared to what used to be available via cmsgridftp.fnal.gov (20 MB/s sustained vs. negligible).

  • Problems scaling digi with PU ( DSTs)

    • CPU usage drops to 0 at around 100 digi jobs and degrades badly at even a fraction of that.

    • xrootd running on all same hardware accessing the same files runs 280 jobs without noticeable CPU fall-off.

    • dCache was moving 70 times as much data per event!? However, we can’t get it to scale to a level where pool nodes are load bound anyway.

    • Clearly we have something badly tuned!


Internal services

Internal Services

  • Extensive Nagios and NRG-based monitoring

    • http://noc.hep.wisc.edu/nagios/

    • http://noc.hep.wisc.edu/nrg/

    • e.g. alerts about degraded RAIDs, dCache pool services, load on gatekeeper, temperature in machine room, etc.

  • System software managed by kickstart + cfengine

    • Has worked well for us, but it puts us at odds with the dream of a common ROCKS-based solution.

  • JugMaster data production

    • http://www.hep.wisc.edu/cgi-bin/cms/CMSJug.cgi

    • Basic idea: “persistent DAG in a database”

    • Provides highly scalable queue that may in turn submit jobs to multiple Condor schedds in a fault-tolerant way.

    • This service will become less important as MCPS is used.


Immediate todo list

Immediate TODO list

  • We have today only a taste of Tier-2 like operations, with ~17 users.

    • We need to understand why some users are less active. Is it something we can fix?

    • We must solve our SRM deployment issues and get connected to the PhEDEx network.

      • Ability to export/import official data

    • Scale grid based simulation production to the levels previously sustained (>500 simultaneous jobs)

      • Resume locally managed production if necessary (i.e. if Craig is too busy)


Tier 2 recruiting purchasing

Tier-2 Recruiting/Purchasing

  • Crucial item is to fully staff Tier-2

    • System Manager: S. Rader 50% leveraged from UW DOE and 50% Tier-2 appointment

      • Use savings to hire a technical assistant

      • Good experience in the past: Raj, Iyer, Vivek

    • Operations: new person recruited

      • We will train this person to lead operations

        • Physics analysis jobs and simulation support

    • We will begin recruitment of a software developer to work on DISUN issues with Condor team

  • Equipment

    • Must demonstrate that we can efficiently and fully utilize resources that we have before new purchases

    • Later this year we will commission a new server room for Tier-2 equipment

      • Expect new equipment purchases in Fall


Related activities

Related Activities

  • Trigger Fault Studies

    • Working with Knowledge Management group: B. Chen, L. Chen, R. Ramakrishnan

    • Trying to automatically detect when trigger behavior changes unexpectedly.

    • Can identify some possible causes.

  • Rapid-response Adaptive Computing Environment: RACE

    • Support burst computations for high priority tasks.

    • Want to claim full UW campus grid on short notice, ~7000K SI2000 in 2007.

    • Provide Glidein-style workspace.


Middleware

Middleware

  • VDT

    • Foundation for Grid3 and OSG

    • Integrating Globus, Condor, MonALISA, and many others.

  • NMI Middleware Initiative

    • Synergy with VDT.

    • Common build and testing system for many platforms

  • Condor

    • Condor-G, Condor-C, DAGMan


  • Login