1 / 28

US CMS Testbed

US CMS Testbed. Large Hadron Collider. Supercollider on French-Swiss border Under construction, completion in 2006. (Based on slide by Scott Koranda at NCSA). Detector / Experiment for LHC Search for Higgs Boson, other fundamental forces. Compact Muon Solenoid. Still Under Development.

fmattingly
Download Presentation

US CMS Testbed

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. US CMS Testbed

  2. Large Hadron Collider • Supercollider on French-Swiss border • Under construction, completion in 2006. (Based on slide by Scott Koranda at NCSA)

  3. Detector / Experiment for LHC Search for Higgs Boson, other fundamental forces Compact Muon Solenoid

  4. Still Under Development • Developing software to process enormous amount of data generated • For testing and prototyping, the detector is being simulated now • Simulating events (particle collisions) • We’re involved in the United States portion of the effort

  5. Storage and Computational Requirements • Simulating and reconstructing millions of events per year, batches of around 150,000 (about 10 CPU months) • Each event requires about 3 minutes of processor time • A single run will generate about 300 GB of data

  6. Before Condor-G and Globus • Runs are hand assigned to individual sites • Manpower intensive to organize run distribution and collect results • Each site has staff managing their runs • Manpower intensive to monitor jobs, CPU availability, disk space, etc.

  7. Before Condor-G and Globus • Use existing tool (MCRunJob) to manage tasks • Not “Grid-Aware” • Expects reliable batch system

  8. UW High Energy Physics: A special case • Was a site being assigned runs • Modified local configuration to flock to UW Computer Science Condor pool • When possible used standard universe to increase available computers • During one week used 30,000 CPU hours.

  9. Our Goal • Move the work onto “the Grid” using Globus and Condor-G

  10. Why the Grid? • Centralize management of simulation work • Reduce manpower at individual sites

  11. Why Condor-G? • Monitors and manages tasks • Reliability in unreliable world

  12. Lessons Learned • The grid will fail • Design for recovery

  13. The Grid Will Fail • The grid is complex • The grid is new and untested • Often beta, alpha, or prototype. • The public Internet is out of your control • Remote sites are out of your control

  14. The Grid is Complex • Our system has 16 layers • A minimal Globus/Condor-G system has 9 layers • Most layers stable and transparent • MCRunJob > Impala > MOP > condor_schedd > DAGMan > condor_schedd > condor_gridmanager > gahp_server > globus-gatekeeper > globus-job-manager > globus-job-manager-script.pl > local batch system submit > local batch system execute > MOP wrapper > Impala wrapper > actual job

  15. Design for Recovery • Provide recovery at multiple levels to minimize lost work • Be able to start a particular task over from scratch if necessary • Never assume that a particular step will succeed • Allocate lots of debugging time

  16. Now • Single master site sends jobs to distributed worker sites. • Individual sites provide configured Globus node and batch system • 300+ CPUs across a dozen sites. • Condor-G acts as reliable batch system and Grid front end

  17. How? MOP. • Monte Carlo Distributed Production System • Pretends to be local batch system for MCRunJob • Repackages jobs to run on a remote site

  18. CMS Testbed Big Picture Master Site Worker MCRunJob Globus MOP Condor DAGMan Real Work Condor-G

  19. DAGMan, Condor-G, Globus, Condor • DAGMan - Manages dependencies • Condor-G - Monitors the job on master site • Globus - Sends jobs to remote site • Condor - Manages job and computers at remote site

  20. Automatically recovers from machine and network problems on execute cluster. Recovery: Condor

  21. Automatically monitors for and retries a number of possibly transient errors. Recovers from down master site, down worker sites, down network. After a network outage can reconnect to still running jobs. Recovery: Condor-G

  22. If a particular task fails permanently, notes it and allows easy retry. Can automatically retry, we don’t. Recovery: DAGMan

  23. Globus software under rapid development Use old software and miss important updates Use new software and deal with version incompatibilities Globus

  24. Our first run gave us two weeks to do about 10 days of work (given available CPUs at the time). We had problems Power outage (several hours), network outages (up to eleven hours), worker site failures, full disks, Globus failures Fall of 2002: First Test

  25. The system recovered automatically from many problems Relatively low human intervention Approximately one full time person It Worked!

  26. Improved automatic recovery for more situations Generated 1.5 million events (about 30 CPU years) in just a few months Currently gearing up for even larger runs starting this summer Since Then

  27. Expanding grid with more machines Use Condor-G’s scheduling capabilities to automatically assign jobs to sites Officially replace previous system this summer. Future Work

  28. http://www.cs.wisc.edu/condor adesmet@cs.wisc.edu Thank You!

More Related