1 / 25

Computing review preparations

Computing review preparations. The Charge. Are the resources required to transmit and store the data adequately understood? Are the resources required to process the data to a form suitable for physics analyses adequately understood?

Download Presentation

Computing review preparations

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Computing review preparations

  2. The Charge Are the resources required to transmit and store the data adequately understood? Are the resources required to process the data to a form suitable for physics analyses adequately understood? Is the plan for developing software processes and framework adequately understood?

  3. PHENIX  sPHENIX sPHENIX framework adopted from PHENIX leveraging • 18 years of experience in taking, reconstructing and analyzing data • Current framework started in 2003 with continuous improvements • Matured reconstruction/analysis framework, no difference between online/offline reconstruction • Flexibility to accommodate changing subsystems (PHENIX did not have 2 runs with identical configuration) • Coordinated Analysis (“analysis train”) standard in PHENIX since 2004 • Resulting in hundreds of publications • Our Users bought into it (unlike athena from Atlas)

  4. sPHENIX  PHENIX • PHENIX largest dataset is 3PB, sPHENIX ~30 times larger • Number of Events ~5 times larger (Run14 AuAu dataset ~20B events, all runs combined ~50B events) • Shift from small events with fast reconstruction (phenix production is often i/o bound) to large events with slow reconstruction, more cpu bound regime • Simulation based on Geant4, run and analyzed within same framework • PHENIX configuration changed year by year, sPHENIX in year 1 will look like sPHENIX in year 5 (at least that’s the plan)

  5. Are the resources required to transmit and store the data adequately understood?

  6. Keeping the general PHENIX concept ATP SEB Buffer Box ATP SEB ATP Buffer Box SEB ATP SEB ATP Buffer Box SEB ATP SEB ATP Buffer Box SEB ATP SEB ATP Buffer Box SEB ATP DCM2 DCM2 DCM2 DCM2 FEM FEM FEM FEM FEM FEM DCM2 SEB DCM DCM DCM DCM DCM DCM DCM DCM DCM DCM DCM DCM DCM DCM DCM DCM DCM DCM DCM DCM DCM DCM DCM DCM DCM DCM DCM DCM DCM DCM DCM DCM DCM This has served us well for the past 16 years Especially the buffer boxes have played a significant role in accommodating the data rates 2003… 2010 (before better networks, better cache disks…) Event Builder 10+ Gigabit Crossbar Switch Buffer Box To HPSS Buffer Box RHC Computing Facility Data Concentration Interaction Region Rack Room Rack Room Rack Room March 9, 2016 Los Alamos 6

  7. Buffer Boxes (2007) Average rate From 2007 7 • Level the ebb and flow from the collider (decaying lumi during a store, downtime, etc) – transfer the average rather than peak throughput • Ride out short interruptions of the HPSS tape storage system • Make data available locally for online monitoring and calibrations for ~70 hours

  8. We achieved LHC-era data rates in 2004 PHENIX has never been afraid of high data rates and volumes I personally claim a portion of that fame… We fully intend not to sacrifice data and “take all we can” Current design goal is 15KHz of event rate This is possible right now, but will be less of an issue in 2020+ We did a 1.5GByte/s fully compressed data stream years before the LHC turn-on My opening slide from the CHEP in 2006 8

  9. Before you wonder… We will virtually always max out our maximum event rate of ~15KHz The absolute data rates are determined almost exclusively by the event multiplicity, that is, by how many particles are produced in a collision, on average More particles in an event, larger event size. We cannot “trade” smaller event sizes for more events That’s why the Au+Au runs produce by far the largest datasets – most particles produced. 9

  10. The large data producers in 200 GeV Au+Au Monolithic Active Pixel Sensors (MAPS) ~ 35GBit/s Intermediate Silicon Strip Tracker (INTT) ~ 7GBit/s Compact Time Projection Chamber (TPC) ~ 80Gbit/s Calorimeters (primarily Emcal, hadronic cal.) ~ 8GBit/s ~ 130GBit/s After applying RHIC x sPHENIX duty factor ~ 100GBit/s 10 “short-term duty factor” in steady-state running

  11. 100 GBit/s. This is 2022. • Tape drives • 2022-era LTO9 -> 500 MB/s -> 24 LTO9 drives? • LTO10??? -> significantly fewer • Total data volumes ( 70% addt’l duty factor, 20TB/tape, silo contains 10000 tapes) • - 16 “physics weeks” Au+Au -> 80 PBytes -> 4000 LTO9 tapes • - 23 “physics weeks” p+p / p+Au 60 PBytes -> 3000 tapes • - 23 “physics weeks” Au + Au 115 PBytes -> 5600 tapes • - 23 “physics weeks” p + p 42 PBytes -> 2100 tapes • - like 2024 5600 tapes Network • We have plenty of “dark fiber” – any issues lighting them up? 11

  12. Just a pat on our backs • We have been optimizing our handling of tapes (file sizes, access patterns, stagings) • Now we get the highest performance out of tape storage in rcf which has been recognized in computing conferences (LTO5 speed is 140MB/sec, we get 139.5MB/sec for extended periods of time using 12 drives) • From what I heard IBM is using our performance to show what is possible • IBM is interested in helping to store 400PB

  13. Are the resources required to process the data to a form suitable for physics analyses adequately understood?

  14. Are the resources required to process the data to a form suitable for physics analyses adequately understood? See the next talk

  15. Is the plan for developing Software processes and framework adequately understood?

  16. What do we have • Online Monitoring • PHENIX ran at 6kHz, a factor of 2.5 higher rate is no problem • Automated Calibration • Fast online reconstruction • Frameworks are adjusted for specific needs but interfaces to analysis modules are identical • Code is simple enough so collaborators can easily contribute

  17. Monitor 1 Monitor 2 Draw Client 1 ET-Pool Monitor 3 DAQ Draw Client 2 Monitor 4 Draw Client ... Monitor... File Monitor 21 Online Monitoring Histograms saved run by run for archiving and html output Shift Crew, detector experts deal only with clients. Multiple clients possible Combine multiple Monitors in single plot (e.g. Vertex) Monitors autostarted by cron jobs (Run4 I guess)

  18. OnCal • No nice flow chart yet • Runs over just taken data • Provide calibrations “good enough” for pattern recognition which does the heavy lifting in terms of cpu • Runs also Q/A, results stoed in DB for lookup by reconstruction

  19. Calibrations lookup Conditions DB Odbc – independent of underlying sql implementation calibs stored as blobs – fast lookup and retrieval PHENIX needs 2-3 servers, precision in secs Insert Time Bank Id Validity range (begin/end)

  20. Fast reconstruction flow Buffer space, transfers per day? Hpss Buffer box dCache dCache Calibration processes Reconstruction Conditions DB

  21. User Input Managers Output Managers DST Fun4AllServer data objects Raw Data (PRDF) DST Simulated PRDF Raw Data (PRDF) Analysis Modules HepMC HepMC/Oscar Empty Histogram Manager Calibrations Root File PostGres DB File Structure of Fun4All Lightweight, standard C++, modular, flexible

  22. 15000 condor slots @rcf User Input: library source root macro output directory Web signup GateKeeper: Compilation Tagging Verification The Analysis Taxi Output gpfs • Fully automatized • Modules running over same • dataset are combined to save • resources • Provides immediate access to all • PHENIX datasets since Run3 • Turnaround time typically hours • Vastly improves PHENIX • analysis productivity • Relieves users from managing • thousands of condor jobs • Keeps records for analysis “paper • trail” Submit analysis jobs 8PB dCache system All datasets since Run3 available online

  23. Development process • No dedicated software group, collaboration provides manpower • We work in small teams • Constant communication • Code is hosted by github Microsoft • Nightly builds • Continuous integration (Jenkins) coming

  24. Code development • Extensive use of code checking tools • Cppcheck • Coverity • Insure • valgrind • Code management: Git – pull requests, followed by (fast) code review • Written up coding conventions • Extensive use of 3rd party software (e.g. genfit, rave)

  25. Message to convey • We do have the infra structure (frameworks) to run our stuff which allows us to concentrate on the real problems • The frameworks can be adapted to future needs (multi threading) but are simple enough for collaborators to contribute • We need buy in from rcf (more of a direct handle on resources than just some generic condor pool) • And – yes – we know what we are doing and have an understanding of our needs

More Related