1 / 14

Proposal to BNL/DOE to Use ACCRE Farm for PHENIX Real Data Reconstruction in 2008 - 2010

Proposal to BNL/DOE to Use ACCRE Farm for PHENIX Real Data Reconstruction in 2008 - 2010. Charles Maguire for the VU PHENIX group Carie Kennedy, Paul Sheldon, and Alan Tackett for ACCRE. Executive Summary.

sortego
Download Presentation

Proposal to BNL/DOE to Use ACCRE Farm for PHENIX Real Data Reconstruction in 2008 - 2010

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Proposal to BNL/DOE to Use ACCRE Farm forPHENIX Real Data Reconstruction in 2008 - 2010 Charles Maguire for the VU PHENIX groupCarie Kennedy, Paul Sheldon, and Alan Tackett for ACCRE Analysis Meeting

  2. Executive Summary • A proposal to BNL/DOE is being made to fund PHENIX data reconstruction at Vanderbilt’s ACCRE farm during 2008-2010 • Two funding scenarios are envisioned with 3-year total costs of $200K and $300K, respectively, depending on the scope of the work at ACCRE • This proposal builds on the experiences gained at Vanderbilt in Run6 and Run7 doing near-real time data reconstruction • A concurrent proposal is being submitted to the CMS-HI collaboration in a competition to site their U.S. compute center at ACCRE • CMS-HI compute center ~5 times larger than the PHENIX $300K proposal • Time scale 2008 - 2011, competing with MIT and Iowa bids, to be decided Feb. ‘08 • PHENIX will gain • Benefit of VU-subsidized costs and time-leveraged computing at ACCRE • Efficient use of manpower in a large PHENIX group, great service from ACCRE • Advantages in keeping pace with CMS-HEP’s technological breakthroughs • If DOE invests in ACCRE for CMS-HI, PHENIX may share upgrades (tech solutions) Analysis Meeting

  3. Computing and Research at VU • ACCRE: Advanced Computing Center for Research and Education • $8.7 M capital project initially funded by Vanderbilt Provost’s office in 2004 • $2.0 M subsequent infusions from NIH and NSF • Currently ACCRE has more than 1600 CPUs, with space and power to grow to 4000 CPUs • Additional CPUs at ACCRE to be purchased as part of new faculty hire start-up funding packages • Expecting in 2008 a dedicated Internet2 connection speed of 2.5 Gbits/second • ACCRE is implementing a dedicated link of 10 Gbits/second to ESNET POP in Nashville • ACCRE has a fast internal network for disk I/O, plus tape archive facility Vanderbilt University will now deliver 1 Gbits/second to faculty desktop machines as justified • RHI and HEP Research at Vanderbilt • PHENIX group at VU currently at 10 members: 3 faculty, 3 post-docs, 4 graduate students2 of the students are currently deputy production managers (part of Run7 reco team at ACCRE) • Group has formally joined CMS-HI as of May, 2007Anticipating an eventual 40% group FTE presence in CMS-HI (compute center a major factor) • VU HEP group in CMS as of 2006, Paul Sheldon overall group leader for all CMS at VanderbiltThe HEP group supports the installation of CMSSW at Vanderbilt for both HEP and HI use • The HEP group is also the leader of the NSF-funded REDDNet project to deploy 500 TBytes diskREDDNet will make use of the L-Store toolkit for high performance disk I/O (couples to ROOT I/O) Analysis Meeting

  4. ACCRE Organization Steering Committee Paul Sheldon, Chair Dave Piston Ron Schrimpf First External Advisory Committee Paul Avery Dick Landis Doug Post Internal Advisory Committee Dennis Hall (University) Jeff Balser (VUMC), (Co-Chairs) Faculty Advisory Group (ongoing) Robert Weller, Chair Faculty Study Group (short-term) Marylyn Ritchie, Chair ACCRE Staff Management Team Technical Director: Alan Tackett Finance/User Management: Carie Lee Kennedy Education/Outreach: Rachel Gibbons Technical Staff: Mathew Binkley, Bobby Brown, Kevin Buterbaugh, Laurence Dawson, Santiago de Ledesma, Andrew Lyons, Kelly McCauley Support: Gretchen Green Analysis Meeting

  5. FDT FDT 45 MB/s 45 MB/s Near-Real Time Run 7 PHENIX Data ReconstructionAt ACCRE, April - June 2007 45TB and 200 CPUs Continuously Available for Run7 Reconstruction RCFRHIC computing facility Reconstruction 200 jobs/cycle PRDFs nanoDSTs 12 hours/job 3 1 770 GBytesper cycle PRDFs Raw data files GridFTP to VU 30 MBytes/sec 2 nanoDSTs Reco output to RCFGridFTP 23 MB/sec Dedicated GridFTP ServerFirebird 4 4.4TB of buffer disk space

  6. Run7 Data I/O and Reconstruction Experience • I/O processes were completely automated (impossible to do otherwise) • Over 100 PERL scripts written at VU to automate and monitor the activityThese scripts coordinated computer operations on 4 different systems • 5810 PRDF file segments automatically transferred from BNL to local server • Contained 30.3 TBytes = ~275M events • GridFTP receiving server at Vanderbilt was completely stable after a defective vendor-supplied disk management software was replaced in mid-April • Saw sustained 100 MBytes/second I/O on this server (input from BNL + output to BNL) • Reconstruction (also completely automated and web-monitored) • PRDFs arriving in April but final reconstruction build not ready until June 4 • Mostly due to the special circumstance for Run7 • Four new detector subsystems were brought on-line for Run 7 reconstruction • Plenty of disk space at ACCRE to archive the PRDFs until JuneNo PRDFs were actually deleted until August 1 • Main difficulty started in mid-June after first sets of reco output arrived at RCF • GridFTP transfers began to fail to NFS disks systems which had suddenly become very busy • Wrote “fault-tolerant” scripts to re-start GridFTP from the failure transfer pointWrote a “horse race” competition script to locate the current least busy, destination disk at RCF • Processed in 3 weeks the PRDF files which were received during 11 weeks Analysis Meeting

  7. Proposal Scenarios www.phenix.bnl.gov/phenix/WWW/p/draft/maguire/bnl_sep07/doeSupplemental08September2007.pdf.gz • Proposed to reconstruct 25-30% of PHENIX Min Bias Data By 2010 • First scenario ($200K total) is only the Central Arm data done in near-real time • Second scenario ($300K total) will do both Central and Muon Arm dataPRDFs are already at ACCRE, so only more CPUs are needed, not much more disk • Needed number of CPUs and disk (tape) space is scaled from Run7 Au+Au project • Alternate scenarios also possible (not in proposal document) • Alternate scenarios depend on decisions of production managers at BNLNot adamant on doing one kind of project (“factory manager/home office” analogy) • Do near-real time Level2 reco (as we did in Run6), depending if it is not done elsewhere • Do quick reco of 10% of the data, and then wait ~2 months to do a repass and more • Do traditional slow playback of data stored on tape, … • Important to us to have a multi-year plan and commitment • Some disk/tape resources can be “lent” by ACCRE in first year, and paid in second year • If we are selected as the CMS-HI compute center, then we want to know and plan well for what will the scope of our responsibilities to both projects during the same year • Out-year plans could still be modified based on prior year experiences Analysis Meeting

  8. Proposal Costswww.phenix.bnl.gov/phenix/WWW/p/draft/maguire/bnl_sep07/doeSupplemental08September2007.pdf.gz • Time Scales • Assume a 8-10% growth per year in scope, i. e. Run 8 at about 1.5 - 2 times Run7 • ACCRE will allow us to time-leverage the acquired CPUsFor example, if the PHENIX work will be all done in 4 months (near-real time), then we can use at least three times as many CPUs as in our nominal fair-share quota • At any given time, there is always the opportunity to use more CPUs if they are not being used others exhausting their fair-share quotas • Capital Costs • Total cost to purchase 110 (204 scenario 2) processors is $68,200 ($126,480) • Total cost to purchase 70 (77 scenario 2) TBytes is $49,000 ($53,900) • Operating Costs • Operating costs (power, air-conditioning, 24/7 service support) are charged on a FTE basisaccording to the amount and kind of hardware purchased by a group • The FTE amount for scenario 1 in the 3rd year is 0.40 personsThe FTE amount for scenario 2 in the 3rd year is 0.54 persons • The FTE salary charge at VU for this category is $65,000/FTE + 25.5% fringeOverhead is added at 53.5% giving a 3rd year support cost of $103K ($141K scenario 2) Analysis Meeting

  9. Proposal Scenario 1 (Central Arm reco only) www.phenix.bnl.gov/phenix/WWW/p/draft/maguire/bnl_sep07/doeSupplemental08September2007.pdf.gz Analysis Meeting

  10. Proposal Scenario 2 (Add muon arm reco) www.phenix.bnl.gov/phenix/WWW/p/draft/maguire/bnl_sep07/doeSupplemental08September2007.pdf.gz Analysis Meeting

  11. Proposal Additional Details www.phenix.bnl.gov/phenix/WWW/p/draft/maguire/bnl_sep07/doeSupplemental08September2007.pdf.gz • Half FTE to Develop Globus Grid Middleware for PHENIX • Other half support from Vanderbilt • Real person (already identified) expert in software • Early project to develop Grid-based simulation project submission • Will work with ACCRE techs to advance the IBP depot disk systemIBP: Internet backplane protocol, designed for Tier3 sites in CMSEnables remote disk access in a distributed university environment • Half FTE cost at $32.5K per year + fringe and overhead • Relationship to CMS-HI Compute Center Proposal • No double counting of hardware resources (CPUs, disk, tape) • RHIC and LHC run out-of-phase (winter shutdowns for LHC) • Intense input transfer rates from RHIC and LHC to be in different seasons • The 10 Gbits/second specified for CMS should vastly exceed PHENIX’s need • CMS-HI ramp-up rate projected as 25%, 50%, 100% in 2009-2011 • Raw data volume for CMS-HI pegged at 300 TBytes (~1 month running) Analysis Meeting

  12. Advantages For PHENIX • Run 6 and Run 7 Experiences • Early looks at the performance of the detectors, good PR for PHENIX to BNL and DOENear real time decisions possible in principle • Pressure to get reconstruction libraries in order sooner rather than later • Funding considerations • DOE may be disposed to give additional money to university groups if a good case can be made (and I think it can here) that a significant subsidy is being made for operations costs • There is sufficient manpower at Vanderbilt to carry out this work. There are cost savings in bringing the CPUs to the available manpower, instead of the manpower to the CPUs • The training of students in this work will give (has already given) benefit to PHENIX in preparing future deputy (or full) production managers • It is the future model of large scale data reconstruction and analysis • The idea of shared computing responsibilities is central to the LHC computing model • Fast network speeds, up to 10 Gbits/second, no longer mandate a centralized facility • If the CMS-HI compute center is located at ACCRE, there will be additional technical support able to work on common problems of data I/O and management • Even if the CMS-HI compute center is not at ACCRE, ACCRE will still be reaping the advances in technology brought about by the CMS-HEP group here Analysis Meeting

  13. Three Year Planning Proposal 2008-2010Budget Policies • DOE Treats RHIC Computing as a Single Line Item in BNL’s Budget • DOE will not fund university groups separately to do computing for PHENIX or STAR • It is BNL management’s decision how to disburse the RHIC computing line item • BNL management listens to recommendations from PHENIX and STAR • Hence PHENIX must first approve a recommendation to fund computing at ACCRE • ACCRE Will Not Provide PHENIX with “Free Computing” After 2007 • Strict Federal indirect cost overhead rules (after ACCRE’s start-up “grace” period) • ACCRE cannot give away to PHENIX/DOE what it will charge NIH, NSF researchersAll research group are treated with the same accounting rules • Is There a Net Gain for PHENIX Computing at a University? • Yes, Vanderbilt is not charging the real ops cost of computing, a subsidy is in effect.Vanderbilt is committed long term to a 50% sharing on total support costs at ACCRE.A university provost will always tell you that he/she loses money on research.Research at a major university is a loss-leader to attract the best students and faculty,to build the university’s reputation, and thence to attract more endowment contributions.That, and generating Ph.D.’s, is why DOE is happy to see new faculty lines in groups. Analysis Meeting

  14. Executive Summary Again • A proposal to BNL/DOE is being made to fund PHENIX data reconstruction at Vanderbilt’s ACCRE farm during 2008-2010 • Two funding scenarios are envisioned with 3-year total costs of $200K and $300K, respectively, depending on the scope of the work at ACCRE • Alternate scope scenarios possible year-by-year depending on PHENIX production needs • A 3-year plan optimizes resource allocations and readiness for second and third year efforts • This proposal builds on the experiences gained at Vanderbilt in Run6 and Run7 doing near-real time data reconstruction • A concurrent proposal is being submitted to the CMS-HI collaboration in a competition to site their U.S. compute center at ACCRE • This proposal will cite the PHENIX efforts in Run6 and Run7 as advantages over two competitors • CMS-HI compute center ~5 times larger than the PHENIX $300K proposal • Time scale 2008 - 2011, competing with MIT and Iowa bids, to be decided Feb. ‘08 • PHENIX will gain • Benefit of VU-subsidized costs and time-leveraged computing at ACCRE • Efficient use of now-expert manpower in a large PHENIX group, great service from ACCRE • Advantages in keeping pace with CMS-HEP’s technological breakthroughs • If DOE invests in ACCRE for CMS-HI, PHENIX may share upgrades (tech solutions) Analysis Meeting

More Related