1 / 66

US CMS Software and Computing Project US CMS Collaboration Meeting at FSU, May 2002

US CMS Software and Computing Project US CMS Collaboration Meeting at FSU, May 2002. Lothar A T Bauerdick/Fermilab Project Manager. Scope and Deliverables. Provide Computing Infrastructure in the U.S. — that needs R&D Provide software engineering support for CMS

alayna
Download Presentation

US CMS Software and Computing Project US CMS Collaboration Meeting at FSU, May 2002

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. US CMS Software and Computing ProjectUS CMS Collaboration Meeting at FSU, May 2002 Lothar A T Bauerdick/Fermilab Project Manager

  2. Scope and Deliverables • Provide Computing Infrastructure in the U.S. — that needs R&D • Provide software engineering support for CMS • Mission is to develop and build “User Facilities” for CMS physics in the U.S. • To provide the enabling IT infrastructure that will allow U.S. physicists to fully participate in the physics program of CMS • To provide the U.S. share of the framework and infrastructure software • Tier-1 center at Fermilab provides computing resources and support • User Support for “CMS physics community”, e.g. software distribution, help desk • Support for Tier-2 centers, and for physics analysis center at Fermilab • Five Tier-2 centers in the U.S. • Together will provide same CPU/Disk resources as Tier-1 • Facilitate “involvement of collaboration” in S&C development • Prototyping and test-bed effort very successful • Universities will “bid” to host Tier-2 center • taking advantage of existing resources and expertise • Tier-2 centers to be funded through NSF program for “empowering Universities” • Proposal to the NSF submitted Nov 2001

  3. Project Milestones and Schedules • Prototyping, test-beds, R&D started in 2000“Developing the LHC Computing Grid” in the U.S. • R&D systems, funded in FY2002 and FY2003 • Used for “5% data challenge” (end 2003) release Software and Computing TDR (technical design report) • Prototype T1/T2 systems, funded in FY2004 • for “20% data challenge” (end 2004) end “Phase 1”, Regional Center TDR, start deployment • Deployment: 2005-2007, 30%, 30%, 40% costs • Fully Functional Tier-1/2 funded in FY2005 through FY2007 • ready for LHC physics run  start of Physics Program • S&C Maintenance and Operations: 2007 on

  4. US CMS S&C Since UCR • Consolidation of the project, shaping out the R&D program • Project Baselined in Nov 2001: Workplan for CAS, UF, Grids endorsed • CMS has become “lead experiment” for Grid work  Koen, Greg, Rick • US Grid Projects PPDG, GriPhyN and iVDGL • EU Grid Projects DataGrid, Data Tag • LHC Computing Grid Project • Fermilab UF team, Tier-2 prototypes, US CMS testbed • Major production efforts, PRS support • Objectivity goes, LCG comes • We do have a working software and computing system!  Physics Analysis • CCS will drive much of the common LCG Application are • Major challenges to manage and execute the project • Since fall 2001 we knew LHC start would be delayed  new date April 2007 • Proposal to NSF in Oct 2001, things are probably moving now • New DOE funding guidance (and lack thereof from NSF) is starving us in 2002-2004 • Very strong support for the Project from individuals in CMS, Fermilab, Grids, FA

  5. Other New Developments • NSF proposal guidance AND DOE guidance are (S&C+M&O) • That prompted a change in US CMS line management Program Manager will oversee both Construction Project and S&C Project • New DOE guidance for S&C+M&O is much below S&C baseline + M&O request • Europeans have achieved major UF funding, significantly larger relative to U.S.LCG started, expects U.S. to partner with European projects • LCG Application Area possibly imposes issues on CAS structure • Many developments and changes that invalidate or challenge much of what PM tried to achieve • Opportunity to take stock of where we stand in US CMS S&Cbefore we try to understand where we need to go

  6. Vivian has left S&C • Thanks and appreciation • for Vivian’s work • of bringing the UF project • to the successful baseline • New scientist position • opened at Fermilab • for UF L2 manager • and physics! • Other assignments • Hans Wenzel Tier-1 Manager • Jorge RodrigezU.Florida pT2 L3 manager • Greg Graham CMS GIT Production Task Lead • Rick Cavenaugh US CMS Testbed Coordinator

  7. Project Status • User Facilities status and successes: • US CMS Prototypes systems: Tier-1, Tier-2, testbed • Intense collaboration with US Grid project, Grid-enabled MC production system • User Support: facilities, software, operations for PRS studies • Core Application Software status and successes: • See Ian’s talk • Project Office started • Project Engineer hired, to work on WBS, Schedule, Budget, Reporting, Documenting • SOWs in place w/ CAS Universities —MOUs, subcontracts, invoicing is coming • In process of signing the MOUs • Have a draft of MOU with iVDGL on prototype Tier-2 funding

  8. Successful Base-lining Review • “The Committee endorses the proposed project scope, schedule, budgets and management plan” • Endorsement for the “Scrubbed” project plan following the DOE/NSF guidance$3.5MDOE + $2MNSF in FY2003 and $5.5DOE + $3MNSF in FY2004!

  9. CMS Produced Data in 2001 • Simulated EventsTOTAL = 8.4 M • Caltech • 2.50 M • FNAL • 1.65 M • Bristol/RAL • 1.27 M • CERN • 1.10 M • INFN • 0.76 M • Moscow • 0.43 M • IN2P3 • 0.31 M • Helsinki • 0.13 M • Wisconsin • 0.07 M • UCSD • 0.06 M • UFL • 0.05 M • Reconstructed w/ Pile-UpTOTAL = 29 TB • These fully simulated data samples are essential for physics and trigger studies •  Technical Design Report for DAQ and Higher Level Triggers • TYPICAL EVENT SIZES • Simulated • 1 CMSIM event= 1 OOHit event = 1.4 MB Reconstructed • 1 “1033” event = 1.2 MB • 1 “2x1033” event = 1.6 MB • 1 “1034” event = 5.6 MB • CERN • 14 TB • FNAL • 12 TB • Caltech • 0.60 TB • Moscow • 0.45 TB • INFN • 0.40 TB • Bristol/RAL • 0.22 TB • UCSD • 0.20 TB • IN2P3 • 0.10 TB • Wisconsin • 0.05 TB • Helsinki • - • UFL • 0.08 TB

  10. Production Operations • Production Efforts are Manpower intensive! • Fermiulab Tier-1 Production Operations ∑ = 1.7 FTEsustained effort to fill those 8 roles+ the system support people that need to help if something goes wrong!!! • At Fermilab (US CMS, PPDG) • Greg Graham, Shafqat Aziz, Yujun Wu, Moacyr Souza, Hans Wenzel, Michael Ernst, Shahzad Muzaffar + staff • At U Florida (GriPhyN, iVDGL) • Dimitri Bourilkov, Jorge Rodrigez, Rick Cavenaugh + staff • At Caltech (GriPhyN, PPDG, iVDGL, USCMS) • Vladimir Litvin, Suresh Singh et al • At UCSD (PPDG, iVDGL) • Ian Fisk, James Letts + staff • At Wisconsin • Pam Chumney, R. Gowrishankara, David Mulvihill + Peter Couvares, Alain Roy et al • At CERN (USCMS) • Tony Wildish + many

  11. US CMS Prototypes and Test-beds • Tier-1 and Tier-2 Prototypes and Test-beds operational • Facilities for event simulationincluding reconstruction • Sophisticated processing for pile-up simulation • User cluster and hosting of data samples for physics studies • Facilities and Grid R&D

  12. Tier-1 equipment Chimichanga Chocolat Chalupa Winchester Raid IBM - servers snickers Dell -servers CMSUN1

  13. Tier-1 Equipment frys(user) gyoza(test) Popcorns (MC production)

  14. Using the Tier-1 system User System • Until the Grid becomes reality (maybe soon!) people who want to use computing facilities at Fermilab need to obtain an account • That requires registration as a Fermilab user (DOE requirement) • We will make sure that turn-around times are reasonably short, did not hear complains yet • Go to http://computing.fnal.gov/cms/click on the "CMS Account" button that will guide you through the process • Step 1: Get a Valid Fermilab ID • Step 2: Get a fnalu account and CMS account • Step 3: Get a Kerberos principal and krypto card • Step 4: Information for first-time CMS account users http://consult.cern.ch/writeup/form01/ • Got > 100 users, currently about 1 new user per week

  15. US CMS User Cluster To be released June 2002! nTuple, Objy analysis etc R&D on “reliable i/a service”OS: Mosix? batch system: Fbsng? Storage: Disk farm? FRY1 100Mbps FRY2 FRY3 FRY4 BIGMAC SWITCH FRY5 GigaBit SCSI 160 FRY6 FRY7 FRY8 RAID 250 GB

  16. User Access to Tier-1 Data IDE RAID 1TB • Hosting of Jets/Met data • Muons will be coming soon AMD server AMD/Enstore interface Objects Enstore STKEN Silo Snickers Network > 10 TB Users Working on providing Powerful disk cache Host redirection protocol allows to add more servers --> scaling+ load balancing

  17. US CMS T2 Prototypes and Test-beds • Tier-1 and Tier-2 Prototypes and Test-beds operational

  18. California Prototype Tier-2 Setup • UCSD Caltech

  19. Benefits Of US Tier-2 Centers • Bring computing resources close of user communities • Provide dedicated resources to regions (of interest and geographical) • More control over localized resources, more opportunities to pursue physics goals • Leverage Additional Resources, which exist at the universities and labs • Reduce computing requirements of CERN (supposed to account for 1/3 of total LHC facilities!) • Help meet the LHC Computing Challenge • Provide diverse collection of sites, equipment, expertise for development and testing • Provide much needed computing resources • US-CMS plans for about 2 FTE at each Tier-2 site + Equipment funding • supplemented with Grid, University and Lab funds (BTW: no I/S costs in US CMS plan) • Problem: How do you run a center with only two peoplethat will have much greater processing power than CERN has currently ? • This involved facilities and operations R&Dto reduce the operations personnel required to run the centere.g. investigating cluster management software

  20. U.S. Tier-1/2 System Operational • CMS Grid Integration and Deployment on U.S. CMS Test Bed • Data Challenges and Production Runson Tier-1/2 Prototype Systems • “Spring Production 2002” finishing Physics, Trigger, Detector studies • Produce 10M events and 15 TB of dataalso 10M mini-biasfully simulated including pile-upfully reconstructed • Large assignment to U.S. CMS • Successful Production in 2001: • 8.4M events fully simulated, including pile-up, 50% in the U.S. • 29TB data processed13TB in the U.S.

  21. US CMS Prototypes and Test-beds • All U.S. CMS S&C Institutions are involved in DOE and NSF Grid Projects • Integrating Grid softwareinto CMS systems • Bringing CMS Productionon the Grid • Understanding the operational issues • CMS directly profit from Grid Funding • Deliverables of Grid Projects become useful for LHC in the “real world” • Major success: MOP, GDMP

  22. Grid-enabled CMS Production • Successful collaboration with Grid Projects! • MOP (Fermilab, U.Wisconsin/Condor): • Remote job execution Condor-G, DAGman • GDMP (Fermilab, European DataGrid WP2) • File replication and replica catalog (Globus) • Successfully used on CMS testbed • First real CMS Production use finishing now!

  23. Recent Successes with the Grid • Grid Enabled CMS Production Environment NB: MOP = “Grid-ified” IMPALA, vertically integrated CMS application • Brings together US CMS with all three US Grid Projects • PPDG: Grid developers (Condor, DAGman), GDMP (w/ WP2), • GriPhyN: VDT, in the future also the virtual data catalog • iVDGL: pT2 sites and US CMS testbed • CMS Spring 2002 production assignment of 200k events to MOP • Half-way through, next week transfer back to CERN • This is being considered a major success — for US CMS and Grids! • Many bugs in Condor and Globus found and fixed • Many operational issues that needed and still need to be sorted out • MOP will be moved into production Tier-1/Tier-2 environment

  24. Successes: Grid-enabled Production • Major Milestone for US CMS and PPDG • From PPDG internal review of MOP: • “From the Grid perspective, MOP has been outstanding. It has both legitimized the idea of using Grid tools such as DAGMAN, Condor-G, GDMP, and Globus in a real production environment outside of prototypes and trade show demonstrations. Furthermore, it has motivated the use of Grid tools such as DAGMAN, Condor-G, GDMP, and Globus in novel environments leading to the discovery of many bugs which would otherwise have prevented these tools from being taken seriously in a real production environment. • From the CMS perspective, MOP won early respect for taking on real production problems, and is soon ready to deliver real events. In fact, today or early next week we will update the RefDB at CERN which tracks production at various regional centers. This has been delayed because of the numerous bugs that, while being tracked down, involved several cycles of development and redeployment. The end of the current CMS production cycle is in three weeks, and MOP will be able to demonstrate some grid enabled production capability by then. We are confident that this will happen. It is not necessary at this stage to have a perfect MOP system for CMS Production; IMPALA also has some failover capability and we will use that where possible. However, it has been a very useful exercise and we believe that we are among the first team to tackle Globus and Condor-G in such a stringent and HEP specific environment.”

  25. Successes: File Transfers • In 2001 were observing typical rates for large data transfers,e.g. CERN - FNAL 4.7 GB/hour • After network tuning, using Grid Tools (Globus URLcopy) we gain a factor 10! • Today we are transferring 1.5 TByte of simulated data from UCSD to FNAL • at rates of 10 MByte/second! That almost saturates the network I/f out of Fermilab (155Mbps) and at UCSD (FastEthernet)… • The ability to transfer a TeraByte in a day is crucial for the Tier-1/Tier-2 system • Many operational issues remain to be solved • GDMP is a grid tool for file replication, developed jointly b/w US and EU • “show case” application for EU Data Grid WP2: data replication • Needs more work and strong support  VDT team (PPDG, GriPhyN, iVDGL) • e.g. CMS “GDMP heartbeat” for debugging new installations and monitoring old ones. • Installation and configuration issues — releases of underlying software like Globus • Issues with site security and e.g. Firewall • Uses Globus Security Infrastructure, which demands”VO” Certification Authority infrastructure for CMS • Etc pp… • This needs to be developed, tested, deployed and shows that the USCMS testbed is invaluable!

  26. DOE/NSF Grid R&D Funding for CMS

  27. Farm Setup As long as ample storage is available problem scales well • Almost any computer can run the CMKIN and CMSIM steps using the CMS binary distribution system (US CMS DAR) • This step is “almost trivially” put on the Grid — almost…

  28. e.g. on the 13.6 TF - $53M TeraGrid? Site Resources Site Resources 26 HPSS HPSS 4 24 External Networks External Networks 8 5 Caltech Argonne External Networks External Networks NCSA/PACI 8 TF 240 TB SDSC 4.1 TF 225 TB Site Resources Site Resources HPSS UniTree TeraGrid/DTF: NCSA, SDSC, Caltech, Argonne www.teragrid.org

  29. Farm Setup for Reconstruction • The first step of the reconstruction is Hit Formatting, where simulated data is taken from the Fortran files, formatted and entered into the Objectivity data base. • Process is sufficiently fast and involves enough data that more than 10-20 jobs will bog down the data base server.

  30. Pile-up simulation! • Unique at LHC due to high Luminosity and short bunch-crossing time • Up to 200 “Minimum Bias” events overlayed on interesting triggers • Lead to “Pile-up” in detectors  needs to be simulated! This makes a CPU-limited task (event simulation) VERY I/O intensive!

  31. Farm Setup for Pile-up Digitization • The most advanced production step is digitization with pile-up • The response of the detector is digitized the physics objects are reconstructed and stored persistently and at full luminosity 200 minimum bias events are combined with the signal events Due to the large amount of minimum bias events multiple Objectivity AMS data servers are needed. Several configurations have been tried.

  32. Objy Server Deployment: Complex 4 Production Federations at FNAL. (Uses catalog only to locate database files.) 3 FNAL servers plus several worker nodes used in this configuration. 3 federation hosts with attached RAID partitions 2 lock servers 4 journal servers 9 pileup servers

  33. Example of CMS Physics Studies • Resolution studies for jet reconstruction • Full detector simulation essential to understand jet resolutions • Indispensable to design realistic triggers and understand rates at high lumi QCD 2-jet events, no FSR No pile-up, no tracks recon, no HCAL noise QCD 2-jet events with FSR Full simulation w/ tracks, HCAL noise

  34. Pile up & Jet Energy Resolution • Jet energy resolution • Pile-up contribution to jet are large and have large variations • Can be estimated event-by-event from total energy in event • Large improvement if pile-up correction applied (red curve) • e.g. 50% 35% at ET = 40GeV • Physics studies depend on full detailed detector simulation realistic pile-up processing is essential!

  35. Tutorial at UCSD • Very successful 4-day tutorial with ~40 people attending • Covering use of CMS software, including CMKIN/CMSIM, ORCA, OSCAR, IGUANA • Covering physics code examples from all PRS groups • Covering production tools and environment and Grid tools • Opportunity to get people together • UF and CAS engineers with PRS physicists • Grid developers and CMS users • The tutorials have been very well thought through very useful for self-study, so they will be maintained • It is amazing what we already can do with CMS software • E.g. impressive to see IGUANA visualization environment, including “home made” visualizations • However, our system is (too?, still too?) complex • We maybe need more people taking a day off and go through the self- guided tutorials

  36. FY2002 UF Funding • Excellent initial effort and DOE support for User Facilities • Fermilab established as Tier-1 prototype and major Grid node for LHC computing • Tier-2 sites and testbeds are operational and are contributing to production and R&D • Headstart for U.S. efforts has pushed CERN commitment to support remote sites • The FY2002 funding has given major headaches to PM • DOE funding $2.24M was insufficient to ramp the Tier-1 to base-line size • The NSF contribution is unknown as of today • According to plan we should have more people and equipment at Fermilab T1 • Need some 7 additional FTEs and more equipment funding • This has been strongly endorsed by the baseline reviews • All European RC (DE, FR, IT, UK, even RU!) have support at this level of effort

  37. Plans For 2002 - 2003 • Finish Spring Production challenge until June • User Cluster, User Federations • Upgrade of facilities ($300k) • Develop CMS Grid environment toward LCG Production Grid • Move CMS Grid environment from testbed to facilities • Prepare for first LCG-USUF milestone, November? • Tier-2, -iVDGL milestones w/ ATLAS, SC2002 • LCG-USUF Production Grid milestone in May 2003 • Bring Tier-1/Tier-2 prototypes up to scale • Serving user community: User cluster, federations,Grid enabled user environment • UF studies with persistency framework • Start of physics DCs and computing DCs • CAS: LCG “everything is on the table, but the table is not empty” • persistency framework - prototype in September 2002 • Release in July 2003 • DDD and OSCAR/Geant4 releases • New strategy for visualization / IGUANA • Develop distributed analysis environment w/ Caltech et al

  38. Funding for UF R&D Phase • There is lack of funding and lack of guidance for 2003-2005 • NSF proposal guidance AND DOE guidance are (S&C+M&O) • New DOE guidance for S&C+M&O is much below S&C baseline + M&O request • Fermilab USCMS projects oversight has proposed minimal M&O for 2003-2004and large cuts for S&C given the new DOE guidance • The NSF has “ventilated the idea” to apply a rule of 81/250*DOE funding • This would lead to very serious problems in every year of the projectwe would lack 1/3 of the requested funding ($14.0M/$21.2M)

  39. DOE/NSF funding shortfall

  40. FY2003 Allocation à la Nov2001

  41. Europeans Achieved Major UF Funding • Funding for European User Facilities in their countriesnow looks significantly larger than UF funding in the U.S. • This statement is true relative to size of their respective communities • It is in some cases event true in absolute terms!! • Given our funding situation:are we going to be a partner for those efforts? • BTW: USATLAS proposes major cuts in UF/Tier-1 “pilot flame” at BNL

  42. How About The Others: DE

  43. How About The Others: IT

  44. How About The Others: RU

  45. How About The Others: UK

  46. How About The Others: FR

  47. FY2002 - FY2004 Are Critical in US • Compared to European efforts the US CMS UF efforts are very small • In FY2002 The US CMS Tier-1 is sized at 4kSI CPU and 5TB storage • The Tier-1 effort is 5.5 FTE. In addition there is 2 FTE CAS and 1 FTE Grid • S&C base-line 2003/2004: Tier-1 effort needs to be at least $1M/year above FY2002to sustain the UF R&D and become full part of the LHC Physics Research Grid • Need some some 7 additional FTEs, more equipment funding at the Tier-1 • Part of this effort would go directly into user support • Essential areas are insufficiently covered now, need to be addressed in 2003 the latest • Fabric managent • Storage resource mgmt • Networking • System configuration management • Collaborative tools • Interfacing to Grid i/s • System management & operations support • This has been strongly endorsed by the S&C baseline review Nov 2001 • All European RC (DE, FR, IT, UK, even RU!) have support at this level of effort

  48. The U.S. User Facilities Will Seriously Fall BackBehind European Tier-1 EffortsGiven The Funding Situation! To Keep US Leadership and Not to put US based Science at Disadvantage Additional Funding Is Required at least $1M/year at Tier-1 Sites

  49. LHC Computing Grid Project • $36M project 2002-2004, half equipment half personnel: “Successful” RRB • Expect to ramp to >30 FTE in 2002, and ~60FTE in 2004 • About $2M / year equipment • e.g. UK delivers 26.5% of LCG funding AT CERN ($9.6M) • The US CMS has requested $11.7M IN THE US + CAS $5.89 • Current allocation (assuming CAS, iVDGL) would be $7.1 IN THE US • Largest personnel fraction in LCG Applications Area • “All” personnel to be at CERN • “People staying at CERN for less than 6 months are counted at a 50% level, regardless of their experience.” • CCS will work on LCG AA projects  US CMS will contribute to LCG • This brings up several issues that US CMS S&C should deal with • Europeans have decided to strongly support LCG Application Area • But at the same time we do not see more support for the CCS efforts • CMS and US CMS will have to do at some level a rough accounting of LCG AA vs CAS and LCG facilities vs US UF

  50. Impact of LHC Delay • Funding shortages in FY2001 and FY2002 already lead to significant delays • Others have done more --- we seriously are understaffed and do not do enough now • We lack 7FTE already this year, and will need to start hiring only in FY2003 • This has led to delays and will further delay our efforts • Long-term • do not know, too uncertain predictions of equipment costs to evaluate possible costs savings due to delays by roughly a year • However, schedules become more realistic • Medium term • Major facilities (LCG) milestones shift by about 6 months • 1st LCG prototype grid moved to end of 2002 --> more realistic now • End of R&D moves from end 2004 to mid 2005 • Detailed schedule and work plan expected from LCG project and CMS CCS (June) • No significant overall costs savings for R&D phase • We are already significantly delayed, and not even at half the effort of what other countries are doing (UK, IT, DE, RU!!) • Catch up on our delayed schedule feasible, if we can manage to hire 7 people in FY2003and manage to support this level of effort in FY2004 • Major issue with lack of equipment funding • Re-evaluation of equipment deployment will be done during 2002 (PASTA)

More Related