The d computing model
Download
1 / 22

The DØ Computing Model - PowerPoint PPT Presentation


  • 90 Views
  • Uploaded on

The DØ Computing Model. Overview The picture Planning history Status of acquisitions Performance More detail On the current operation On the R & D General Status Future plan. High bandwidth into robot. Overview. The data handling system SAM  ENSTORE  Robot(s)

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' The DØ Computing Model' - sabina


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
The d computing model
The DØ Computing Model

  • Overview

    • The picture

  • Planning history

  • Status of acquisitions

  • Performance

  • More detail

    • On the current operation

    • On the R & D

    • General Status

  • Future plan

Wyatt Merritt DØ Collaboration Meeting Plenary Session


Overview

High bandwidth into robot

Overview

  • The data handling system

    • SAM  ENSTORE  Robot(s)

  • The offline user computing systems

    • dØmino - O (20 TB) disk

    • linux analysis server(s) - O (2 TB) disk

    • linux development machines - O (0.2 TB)

      • build cluster

      • ClueDØ

      • remote linux machines

    • non-development desktops

  • Associated systems

    • Fermilab production farm (raw data reconstruction)

    • Remote production farms (simulation)

    • Database servers

Wyatt Merritt DØ Collaboration Meeting Plenary Session


Linux Compute Server

Linux Farms

Analysis

Cluster 1

Detector

dØmino

~ 1 TB

Robot

ClueDØ

~ 0.2 TB

Database

Servers

NT Desktops

12.5 Mb/s

150 Mb/s

lxbld

27 TB

Handled remotely

Monte Carlo

ClueDØ

Server

High speed Network


Planning history
Planning history

  • Original plan January ‘97

    • DØ Internal Review February ‘97

  • External review: Von Rüden Committee

    • Mar ‘97, Oct ‘97, Jun ‘98, Jan ‘99, Jun ‘99

    • Funding profile (DMNAG - Joint with CDF) approved ‘97

  • Plan updates

    • January ‘99 for VR IV

    • Global Computing Model reports (‘98-’99)[Addition of Analysis Servers to plan]

  • Plan implementation ‘97 - ‘01

    • Run II Computing and Software Project: co-leaders + Computing Planning Board

Wyatt Merritt DØ Collaboration Meeting Plenary Session


Status of acquisitions

Analysis cpu

Dømino: 192 proc O2000 complete (except add memory)

Desktops: responsibility of institutions

Analysis Clusters/Servers - 1 purchased of (6?)

Reconstruction cpu

200 processors acquired of 400 planned[ 40 Hz cap @ current reco cpu perf. ; 80 Hz @ target reco perf]

Disk storage

30 TB total - complete (plan was 15 TB)

See allocation slide

Status of acquisitions

Wyatt Merritt DØ Collaboration Meeting Plenary Session


Disk space in the offline systems

Total available disk space: 30 Tbyte

3 Tbytes are on: D0test, d0lxac1, d0lxbld

27 Tbytes are on D0MINO

( all units are Tbytes)

Disk space on D0MINO

Available

Allocated

Used

Scratch, releases & other config.

1

1

1

SAM cache

6

6

variable

DST/mDST

12

12

variable

Project disks

4

2.6

~2.0?

Tmp ( group space)

2

0.9

?

contingency

2

TOTAL

27

22.5

Wyatt Merritt DØ Collaboration Meeting Plenary Session


Status of acquisitions cont d

Robotic tape storage

1 ADIC robot (750 TB capacity) - complete

18 Mammoth II tape drives - will be retired

6 LTO drives - now

2 STK robots (600 TB capacity) - FY02

9 STK 9940 drives - FY02

Post shutdown stopgap - use existing STKen w/ 4 drives

Database servers - complete

2 SUN systems w/ 600 GB disk

Status of acquisitions cont’d

Wyatt Merritt DØ Collaboration Meeting Plenary Session


Performance

Farm production stats

dØmino cpu & mem stats

AC1 cpu & mem stats

SAM & encp stats

Disk usage stats

Conclusion: Chief needs

More memory for Dømino

More reliable tape drives

More farm nodes

More linux cpu

Open questions - DB server upgrades?

Performance

Wyatt Merritt DØ Collaboration Meeting Plenary Session


Farm production statistics
Farm Production Statistics

  • See web link from Main DØ Computing for weekly reportsWeek of 08/31 - 09/06:800,000 evts proc / 140,000 from data collected in that week1.9 M events collected in that week

  • Problems in this week:encp problem (code change from ENSTORE)disk failure on dØbbin (the farm IO server)several other problems as well...

Wyatt Merritt DØ Collaboration Meeting Plenary Session


The current operation

Code release model

Mapping activities to systems

ClueD0 operation

Remote farm operation

Role of the ORB

The Current Operation

Wyatt Merritt DØ Collaboration Meeting Plenary Session


The code release model

Weekly test releases

Production releases every three months

Weekly subsystem coordinators meeting:Minutes to d0rug mailing list

Rules for interface changes

Schedules for big disruptive changes (e.g. switch to KAI 4.0)

The code release model

Wyatt Merritt DØ Collaboration Meeting Plenary Session


Mapping activities to systems

Code development: your Linux box, if possible; d0mino is the backup solution

Large sample processing: a SAM station

d0mino, lxac1, special farm allocation (gtr) , (ClueD0 - in R&D)

Small sample processing: create derived DS on SAM station, transfer to desktop

Office/Web browsing : use your desktop!

Remote users: new position to address needs

Mapping activities to systems

Wyatt Merritt DØ Collaboration Meeting Plenary Session


Mapping activities to systems1

Disk usage the backup solution

Home areas - backed up; you can ask for up to 250MB (possibility of more for good reason) BUT NFS-mounted - don’t use for data files!

TMP areas - not backed up. Code development and / or data files, allocated per institution. 37 institutions are using it so far. A good place to start off if you are not working with a well-defined project.

PRJ areas - not backed up. Code development and / or data files, allocated per project. 3 large pools: commissioning, algorithm development, simulation, plus physics and ID groups and some smaller projects.

Web pages - DØ Main Computing ( SAM Data Handling section) --> General description of where data samples are stored in our system

Mapping activities to systems

Wyatt Merritt DØ Collaboration Meeting Plenary Session


Clued0 operation
ClueD0 Operation the backup solution

  • The current population is:111 nodes with 138 CPUs and a total memory of 37GB396 Users

  • Rules for joining and policies can be found at:http://www-clued0.fnal.gov/clued0/http://www-clued0.fnal.gov/clued0/policies.html

  • Current difficulties from the lack of Redhat 7.1 builds are being actively worked on

Wyatt Merritt DØ Collaboration Meeting Plenary Session


Monte carlo production status
Monte Carlo Production Status the backup solution

  • Current Software – mcp07

    • p07.00.05a Generator, DØgstar, Døsim

    • P08.12.00 Døreco, recoanalyze

    • 950 kevents generated at reco level

    • Run IIB Simulation is a major effort

    • Will move to p08.13.00 to remove memory leak

  • Future Releases – p09.10.00

    • Problem running DØgstar under investigation

    • Plate level available

    • p10 certification will be available by the end of the month

Wyatt Merritt DØ Collaboration Meeting Plenary Session


The offline resources board
The Offline Resources Board the backup solution

  • Charge: Allocate offline resources according to the experiment’s priorities

    • Project & tmp disk

    • Sample priorities for simulation on remote farms

    • Partitions in SAM cache

    • Batch queues

  • Chair: Nick Hadley

  • Web Pagehttp://www-d0.fnal.gov/Run2Physics/orb/d0_private/orb_home.html

  • Institutions which have no tmp disk allocation and have active users

    • email to hadley@fnal.gov - 18 GB will be allocated

Wyatt Merritt DØ Collaboration Meeting Plenary Session


Analysis clusters - one in service the backup solution

ClueD0 servers ( a relocated analysis cluster) - software being tested; networking strategy being developed

Compute servers for dØmino (a user-accessible farm) - 2 nodes available for tests

Remote farms for raw data reconstruction and analysis

Remote desktop analysis

R & D

Wyatt Merritt DØ Collaboration Meeting Plenary Session


Institutional contributions

Desktop seats the backup solution

Backup tapes

Remote simulation capacity

Disk for Dømino via budget code - issues

How to allocate between project & tmp?

Lifetime for contribution?

Unit of contribution : 1 rack of disk

Analysis cluster for Feynman via budget code

Similar issues

Analysis cluster for ClueDØ - all the above issues + SAM bandwidth, networking, sysadmin, ...

Institutional contributions

Wyatt Merritt DØ Collaboration Meeting Plenary Session


General status where are the limits problems

Online the backup solution

Max rate tested 40 Hz to tape

Max rate sustained for a shift, to date ~25 Hz to tape

Max rate expected with next iteration 60 Hz to tape

Final limitation: tape budget (FY02 = ~ 400 TB )

Running p 10 on the farms

Processes raw data @ 23 sec/event

Thanks to Alg Group - worked out of box on raw data

Limits: ~ 2-3 Hz w/ current nodes & cpu perf of reco Output size: HUGE - writing too much tape, breaking DB model, using more than allocated network and disk resources all down the line

General Status - Where are the limits/problems?

Wyatt Merritt DØ Collaboration Meeting Plenary Session


Expected farm performance
Expected Farm Performance the backup solution

Wyatt Merritt DØ Collaboration Meeting Plenary Session


General status where are the limits problems1

SAM/ENSTORE status the backup solution

Working for many months with servers on automatic recovery

Not all features complete (pick events)

5 GB interfaces  can deliver 150 MB/sec to dØmino

Robot status

Design rates met, but robustness severely limited by M II drive error rate - plan switchover by end of shutdown

General Status - Where are the limits/problems?

Wyatt Merritt DØ Collaboration Meeting Plenary Session


Future plan

Major purchases still in FY02 the backup solution

New robot and reliable drives

New farm nodes

More memory for dØmino

*Some* linux cpu

Continue R&D for linux analysis strategies

Hope to establish effectiveness and practicality of the three proposed models: AC, CS, AC@DØ

Operational improvements

SAM personnel @ DØ

RECO: continue with current release schedules; emphasize quality control and testing for releases;push on cpu, memory, output size issues

Future Plan

Wyatt Merritt DØ Collaboration Meeting Plenary Session


ad