Atlas dc2 pile up jobs on lcg
This presentation is the property of its rightful owner.
Sponsored Links
1 / 19

ATLAS DC2 Pile-up Jobs on LCG PowerPoint PPT Presentation


  • 60 Views
  • Uploaded on
  • Presentation posted in: General

ATLAS DC2 Pile-up Jobs on LCG. Atlas DC Meeting February 2005. Pile-up tasks. Jobs defined in 3 tasks: 210 dc2.003002.lumi10.A2_z_mumu.task 307 dc2.003026.lumi10.A0_top.task 308 dc2.003004.lumi10.A3_z_tautau.task

Download Presentation

ATLAS DC2 Pile-up Jobs on LCG

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Atlas dc2 pile up jobs on lcg

ATLAS DC2 Pile-up Jobs on LCG

Atlas DC Meeting

February 2005

[email protected]


Pile up tasks

Pile-up tasks

  • Jobs defined in 3 tasks:

    • 210 dc2.003002.lumi10.A2_z_mumu.task

    • 307 dc2.003026.lumi10.A0_top.task

    • 308 dc2.003004.lumi10.A3_z_tautau.task

  • Input files with min. bias were distributed to selected sites using DQ, 700GB

  • Each job used 8 input files with min. bias (~250MB each), downloaded from closeSE, and 1 input file with signal

  • 1 GB RAM per job required

[email protected]


5 sites involved

5 sites involved

golias25.farm.particle.cz:2119/jobmanager-lcgpbs-lcgatlasprod lcg00125.grid.sinica.edu.tw:2119/jobmanager-lcgpbs-infinite lcgce01.triumf.ca:2119/jobmanager-lcgpbs-atlas lcgce02.ifae.es:2119/jobmanager-lcgpbs-atlas t2-ce-01.roma1.infn.it:2119/jobmanager-lcgpbs-infinite

Number of jobs per site

[email protected]


Atlas dc2 pile up jobs on lcg

[email protected]


Atlas dc2 pile up jobs on lcg

[email protected]


Atlas dc2 pile up jobs on lcg

[email protected]


Status

Status

JOBSTATUS NJOBS

failed 3702

finished 5703

pending 323

running 64

21 jobs have JOBSTATUS finished and CURRENTSTATE ABORTED

- probably initial tests, ENDTIME = 23-SEP-04, 30-SEP-04 and 07-OCT-04

[email protected]


Why so big differences in the efficiency

Why so big differences in the efficiency?

PRAGUE: 48% TW: 70%

ATTEMPT NJOBS

1 2442

2 466

3 244

4 291

5 130

6 71

7 66

8 52

9 48

10 26

11 7

ATTEMPT NJOBS

1 2662

2 361

3 184

  • Other differences:

    • RB on TW

    • lexor running on UI on TW

    • many signal files stored on SE on TW

[email protected]


Failures

Failures

  • Not easy to get cause of failure from proddb

    • VALIDATIONDIAGNOSTIC quite difficult to parse by script:

    • <workernode>t2-wn-36.roma1.infn.it</workernode><retcode>1</retcode><time>0m2.360s</time><error>STAGE-IN failed: WARNING: No FILE or RFIO access for existing replicasWARNING: Replication of sfn://castorftp.cnaf.infn.it/castor/cnaf.infn.it/grid/lcg/atlas/datafiles/dc2/simul/dc2.003002.simul.A2_z_mumu/dc2.003002.simul.A2_z_mumu._01477.pool.root.1 to close SE failed: Error in replicating PFN sfn://castorftp.cnaf.infn.it/castor/cnaf.infn.it/grid/lcg/atlas/datafiles/dc2/simul/dc2.003002.simul.A2_z_mumu/dc2.003002.simul.A2_z_mumu._01477.pool.root.1 to t2-se-01.roma1.infn.it: lcg_aa: File existslcg_aa: File existsGiving up after attempting replication TWICE.WARNING: Could not stage input file sfn://castorftp.cnaf.infn.it/castor/cnaf.infn.it/grid/lcg/atlas/datafiles/dc2/simul/dc2.003002.simul.A2_z_mumu/dc2.003002.simul.A2_z_mumu._01477.pool.root.1: Gridftp copy failed from gsiftp://castorftp.cnaf.infn.it/castor/cnaf.infn.it/grid/lcg/atlas/datafiles/dc2/simul/dc2.003002.simul.A2_z_mumu/dc2.003002.simul.A2_z_mumu._01477.pool.root.1 to file:/home/atlassgm/globus-tmp.t2-wn-36.17931.0/WMS_t2-wn-36_018404_https_3a_2f_2flcg00124.grid.sinica.edu.tw_3a9000_2fKv9HpVIUkMLTBBe-Ia3xLA/dc2.003002.simul.A2_z_mumu._01477.pool.root: the server sent an error response: 550 550 /castor/cnaf.infn.it/grid/lcg/atlas/datafiles/dc2/simul/dc2.003002.simul.A2_z_mumu/dc2.003002.simul.A2_z_mumu._01477.pool.root.1: Invalid argument.EDGFileCatalog: level[Always] Disconnected</error><stageOut>No log for stageout phase</stageOut>

    • mw failures:

      • <JobInfo>Job RetryCount (0) hit</JobInfo>

[email protected]


Some jobs with many attempts

Some Jobs with many Attempts

JOBDEFINITIONID=459795

  • Attempt 1: 09-NOV-04

    • <workernode>t2-wn-42.roma1.infn.it</workernode><retcode>1</retcode><time>0m43.250s</time><error>Transformation error: -------- Problem report -------[Unknown Problem]AthenaPoolConve... ERROR (PersistencySvc) pool::PersistencySvc::UserDatabase::connectForRead: FID is not existing in the catalog================================-------- Problem report -------[Unknown Problem]PileUpEventLoopMgrWARNING Original event selector has no events================================</error><stageOut>No log for stageout phase</stageOut>

  • ...

  • Attempt 11: 15-DEC-04

    • <workernode>goliasx76.farm.particle.cz</workernode><retcode>1</retcode><time>0m41.460s</time><error>Transformation error: -------- Problem report -------[Unknown Problem]AthenaPoolConve... ERROR (PersistencySvc) pool::PersistencySvc::UserDatabase::connectForRead: FID is not existing in the catalog================================-------- Problem report -------[Unknown Problem]PileUpEventLoopMgrWARNING Original event selector has no events================================</error><stageOut>No log for stageout phase</stageOut>

[email protected]


Atlas dc2 pile up jobs on lcg

  • JOBDEFINITIONID=456843

    • Attempt 1:

    • <workernode>t2-wn-37.roma1.infn.it</workernode><retcode>1</retcode><time>0m2.830s</time><error>STAGE-IN failed: WARNING: No FILE or RFIO access for existing replicasWARNING: Replication of srm://lcgads01.gridpp.rl.ac.uk//datafiles/dc2/simul/dc2.003002.simul.A2_z_mumu/dc2.003002.simul.A2_z_mumu._02629.pool.root.6 to close SE failed: Error in replicating PFN srm://lcgads01.gridpp.rl.ac.uk//datafiles/dc2/simul/dc2.003002.simul.A2_z_mumu/dc2.003002.simul.A2_z_mumu._02629.pool.root.6 to t2-se-01.roma1.infn.it: lcg_aa: File existslcg_aa: File existsGiving up after attempting replication TWICE.WARNING: Could not stage input file srm://lcgads01.gridpp.rl.ac.uk//datafiles/dc2/simul/dc2.003002.simul.A2_z_mumu/dc2.003002.simul.A2_z_mumu._02629.pool.root.6: Get TURL failed: lcg_gt: Communication error on sendEDGFileCatalog: level[Always] Disconnected</error><stageOut>No log for stageout phase</stageOut>

    • Attempt 2:

    • <workernode>lcg00172.grid.sinica.edu.tw</workernode><retcode>2</retcode><time>0m23.660s</time><error>Transformation error: -------- Problem report -------[SOFTWARE]AthenaCrash================================</error><stageOut>No log for stageout phase</stageOut>

    • ...

    • Attempt 9:

    • <workernode>goliasx44.farm.particle.cz</workernode><retcode>2</retcode><time>0m23.340s</time><error>Transformation error: -------- Problem report -------[SOFTWARE]AthenaCrash================================</error><stageOut>No log for stageout phase</stageOut>

[email protected]


Atlas dc2 pile up jobs on lcg

  • JOBDEFINITIONID=504139

    • Attempt 1:

    • <workernode>t2-wn-48.roma1.infn.it</workernode><retcode>2</retcode><time>66m58.650s</time><error>Transformation error: -------- Problem report -------[SOFTWARE]AthenaCrash================================</error><stageOut>No log for stageout phase</stageOut>

    • Attempt 2:

    • <workernode>lcg00144.grid.sinica.edu.tw</workernode><retcode>2</retcode><time>66m56.800s</time><error>Transformation error: -------- Problem report -------[SOFTWARE]AthenaCrash================================</error><stageOut>No log for stageout phase</stageOut>

    • the same up to attempt 5

    • Attempt 6: mw failure

    • Attempt 7:

    • <workernode>goliasx60.farm.particle.cz</workernode><retcode>0</retcode><time>152m53.780s</time>

  • ???

[email protected]


Jobs properties

Jobs properties

  • no exact relation between a job in the oracle db and an entry in the PBS log file

  • STARTTIME and ENDTIME are just hints

  • Some jobs on golias:

    • 1232 finished jobs in December registered in proddb

    • 1299 selected jobs from PBS logs in December, cuts on CPU time and virtual memory values

  • Nodes: 3.06 GHz Xeon, 2GB RAM

  • Histos based on information from PBS log files

[email protected]


Atlas dc2 pile up jobs on lcg

some jobs (6) successfully ran on machine with only 1GB RAM

but the wallTime was 20h – probably a lot of swapping

[email protected]


Atlas dc2 pile up jobs on lcg

[email protected]


Atlas dc2 pile up jobs on lcg

[email protected]


Atlas dc2 pile up jobs on lcg

  • WN -> SE -> NFS server

  • WN has the same NFS mount – could it be used directly?

[email protected]


Atlas dc2 pile up jobs on lcg

[email protected]


Conclusions

Conclusions

  • no job name in the local batch system – difficult to identify

  • version of the lexor executor should be in the proddb

  • proddb: very slow response, these queries were done on atlassg (has snapshot of proddb from Feb 8)

  • a study of log files should be done before increasing MAXATTEMPT

  • proddb should be cleaned

[email protected]


  • Login