1 / 22

The DØ Computing Model

The DØ Computing Model. Overview The picture Planning history Status of acquisitions Performance More detail On the current operation On the R & D General Status Future plan. High bandwidth into robot. Overview. The data handling system SAM  ENSTORE  Robot(s)

sabina
Download Presentation

The DØ Computing Model

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The DØ Computing Model • Overview • The picture • Planning history • Status of acquisitions • Performance • More detail • On the current operation • On the R & D • General Status • Future plan Wyatt Merritt DØ Collaboration Meeting Plenary Session

  2. High bandwidth into robot Overview • The data handling system • SAM  ENSTORE  Robot(s) • The offline user computing systems • dØmino - O (20 TB) disk • linux analysis server(s) - O (2 TB) disk • linux development machines - O (0.2 TB) • build cluster • ClueDØ • remote linux machines • non-development desktops • Associated systems • Fermilab production farm (raw data reconstruction) • Remote production farms (simulation) • Database servers Wyatt Merritt DØ Collaboration Meeting Plenary Session

  3. Linux Compute Server Linux Farms Analysis Cluster 1 Detector dØmino ~ 1 TB Robot ClueDØ ~ 0.2 TB Database Servers NT Desktops 12.5 Mb/s 150 Mb/s lxbld 27 TB Handled remotely Monte Carlo ClueDØ Server High speed Network

  4. Planning history • Original plan January ‘97 • DØ Internal Review February ‘97 • External review: Von Rüden Committee • Mar ‘97, Oct ‘97, Jun ‘98, Jan ‘99, Jun ‘99 • Funding profile (DMNAG - Joint with CDF) approved ‘97 • Plan updates • January ‘99 for VR IV • Global Computing Model reports (‘98-’99)[Addition of Analysis Servers to plan] • Plan implementation ‘97 - ‘01 • Run II Computing and Software Project: co-leaders + Computing Planning Board Wyatt Merritt DØ Collaboration Meeting Plenary Session

  5. Analysis cpu Dømino: 192 proc O2000 complete (except add memory) Desktops: responsibility of institutions Analysis Clusters/Servers - 1 purchased of (6?) Reconstruction cpu 200 processors acquired of 400 planned[ 40 Hz cap @ current reco cpu perf. ; 80 Hz @ target reco perf] Disk storage 30 TB total - complete (plan was 15 TB) See allocation slide Status of acquisitions Wyatt Merritt DØ Collaboration Meeting Plenary Session

  6. Disk space in the offline systems Total available disk space: 30 Tbyte 3 Tbytes are on: D0test, d0lxac1, d0lxbld 27 Tbytes are on D0MINO ( all units are Tbytes) Disk space on D0MINO Available Allocated Used Scratch, releases & other config. 1 1 1 SAM cache 6 6 variable DST/mDST 12 12 variable Project disks 4 2.6 ~2.0? Tmp ( group space) 2 0.9 ? contingency 2 TOTAL 27 22.5 Wyatt Merritt DØ Collaboration Meeting Plenary Session

  7. Robotic tape storage 1 ADIC robot (750 TB capacity) - complete 18 Mammoth II tape drives - will be retired 6 LTO drives - now 2 STK robots (600 TB capacity) - FY02 9 STK 9940 drives - FY02 Post shutdown stopgap - use existing STKen w/ 4 drives Database servers - complete 2 SUN systems w/ 600 GB disk Status of acquisitions cont’d Wyatt Merritt DØ Collaboration Meeting Plenary Session

  8. Farm production stats dØmino cpu & mem stats AC1 cpu & mem stats SAM & encp stats Disk usage stats Conclusion: Chief needs More memory for Dømino More reliable tape drives More farm nodes More linux cpu Open questions - DB server upgrades? Performance Wyatt Merritt DØ Collaboration Meeting Plenary Session

  9. Farm Production Statistics • See web link from Main DØ Computing for weekly reportsWeek of 08/31 - 09/06:800,000 evts proc / 140,000 from data collected in that week1.9 M events collected in that week • Problems in this week:encp problem (code change from ENSTORE)disk failure on dØbbin (the farm IO server)several other problems as well... Wyatt Merritt DØ Collaboration Meeting Plenary Session

  10. Code release model Mapping activities to systems ClueD0 operation Remote farm operation Role of the ORB The Current Operation Wyatt Merritt DØ Collaboration Meeting Plenary Session

  11. Weekly test releases Production releases every three months Weekly subsystem coordinators meeting:Minutes to d0rug mailing list Rules for interface changes Schedules for big disruptive changes (e.g. switch to KAI 4.0) The code release model Wyatt Merritt DØ Collaboration Meeting Plenary Session

  12. Code development: your Linux box, if possible; d0mino is the backup solution Large sample processing: a SAM station d0mino, lxac1, special farm allocation (gtr) , (ClueD0 - in R&D) Small sample processing: create derived DS on SAM station, transfer to desktop Office/Web browsing : use your desktop! Remote users: new position to address needs Mapping activities to systems Wyatt Merritt DØ Collaboration Meeting Plenary Session

  13. Disk usage Home areas - backed up; you can ask for up to 250MB (possibility of more for good reason) BUT NFS-mounted - don’t use for data files! TMP areas - not backed up. Code development and / or data files, allocated per institution. 37 institutions are using it so far. A good place to start off if you are not working with a well-defined project. PRJ areas - not backed up. Code development and / or data files, allocated per project. 3 large pools: commissioning, algorithm development, simulation, plus physics and ID groups and some smaller projects. Web pages - DØ Main Computing ( SAM Data Handling section) --> General description of where data samples are stored in our system Mapping activities to systems Wyatt Merritt DØ Collaboration Meeting Plenary Session

  14. ClueD0 Operation • The current population is:111 nodes with 138 CPUs and a total memory of 37GB396 Users • Rules for joining and policies can be found at:http://www-clued0.fnal.gov/clued0/http://www-clued0.fnal.gov/clued0/policies.html • Current difficulties from the lack of Redhat 7.1 builds are being actively worked on Wyatt Merritt DØ Collaboration Meeting Plenary Session

  15. Monte Carlo Production Status • Current Software – mcp07 • p07.00.05a Generator, DØgstar, Døsim • P08.12.00 Døreco, recoanalyze • 950 kevents generated at reco level • Run IIB Simulation is a major effort • Will move to p08.13.00 to remove memory leak • Future Releases – p09.10.00 • Problem running DØgstar under investigation • Plate level available • p10 certification will be available by the end of the month Wyatt Merritt DØ Collaboration Meeting Plenary Session

  16. The Offline Resources Board • Charge: Allocate offline resources according to the experiment’s priorities • Project & tmp disk • Sample priorities for simulation on remote farms • Partitions in SAM cache • Batch queues • Chair: Nick Hadley • Web Pagehttp://www-d0.fnal.gov/Run2Physics/orb/d0_private/orb_home.html • Institutions which have no tmp disk allocation and have active users • email to hadley@fnal.gov - 18 GB will be allocated Wyatt Merritt DØ Collaboration Meeting Plenary Session

  17. Analysis clusters - one in service ClueD0 servers ( a relocated analysis cluster) - software being tested; networking strategy being developed Compute servers for dØmino (a user-accessible farm) - 2 nodes available for tests Remote farms for raw data reconstruction and analysis Remote desktop analysis R & D Wyatt Merritt DØ Collaboration Meeting Plenary Session

  18. Desktop seats Backup tapes Remote simulation capacity Disk for Dømino via budget code - issues How to allocate between project & tmp? Lifetime for contribution? Unit of contribution : 1 rack of disk Analysis cluster for Feynman via budget code Similar issues Analysis cluster for ClueDØ - all the above issues + SAM bandwidth, networking, sysadmin, ... Institutional contributions Wyatt Merritt DØ Collaboration Meeting Plenary Session

  19. Online Max rate tested 40 Hz to tape Max rate sustained for a shift, to date ~25 Hz to tape Max rate expected with next iteration 60 Hz to tape Final limitation: tape budget (FY02 = ~ 400 TB ) Running p 10 on the farms Processes raw data @ 23 sec/event Thanks to Alg Group - worked out of box on raw data Limits: ~ 2-3 Hz w/ current nodes & cpu perf of reco Output size: HUGE - writing too much tape, breaking DB model, using more than allocated network and disk resources all down the line General Status - Where are the limits/problems? Wyatt Merritt DØ Collaboration Meeting Plenary Session

  20. Expected Farm Performance Wyatt Merritt DØ Collaboration Meeting Plenary Session

  21. SAM/ENSTORE status Working for many months with servers on automatic recovery Not all features complete (pick events) 5 GB interfaces  can deliver 150 MB/sec to dØmino Robot status Design rates met, but robustness severely limited by M II drive error rate - plan switchover by end of shutdown General Status - Where are the limits/problems? Wyatt Merritt DØ Collaboration Meeting Plenary Session

  22. Major purchases still in FY02 New robot and reliable drives New farm nodes More memory for dØmino *Some* linux cpu Continue R&D for linux analysis strategies Hope to establish effectiveness and practicality of the three proposed models: AC, CS, AC@DØ Operational improvements SAM personnel @ DØ RECO: continue with current release schedules; emphasize quality control and testing for releases;push on cpu, memory, output size issues Future Plan Wyatt Merritt DØ Collaboration Meeting Plenary Session

More Related