1 / 34

A TLAS MC production errors per site

A TLAS MC production errors per site. Overview. MC production monitoring reminder 2008 statistics 2008 errors Comparison with 2007. Panda server at CERN. Please do not use anymore Panda monitor at BNL Use the CERN instance http://panda.cern.ch/.

jeneil
Download Presentation

A TLAS MC production errors per site

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. ATLAS MC production errors per site

  2. Overview • MC production monitoring reminder • 2008 statistics • 2008 errors • Comparison with 2007 Eric Lancon

  3. Panda server at CERN • Please do not use anymore Panda monitor at BNL • Use the CERN instance http://panda.cern.ch/ Eric Lancon

  4. http://dashb-atlas-prodsys-test.cern.ch/dashboard/request.py/site-admin?cloud=&grouping=sitehttp://dashb-atlas-prodsys-test.cern.ch/dashboard/request.py/site-admin?cloud=&grouping=site GRIF Enter GRIF for example Eric Lancon

  5. Eric Lancon

  6. Click GRIF to getstatistics per CE Eric Lancon

  7. Eric Lancon

  8. Click + to geterror messages Eric Lancon

  9. Eric Lancon

  10. Click FR to getdetail of jobs running/allocated etc.. Eric Lancon

  11. Pilots (3hrs) : Nb of pilots on site inlast 3h Assigned : Job for site input not yetavailable of site Activated : input available, waiting for a pilot Failed : failures in last 12hr Eric Lancon

  12. ATLAS MC production - 2008 Eric Lançon

  13. Statistics on FR-coud sites 1.636.410 Jobs Eric Lancon

  14. Eric Lancon

  15. Eric Lancon Some variations between sites and withinyear

  16. Source of errors • ATLAS software errors • Should not happenat T2s • T1 onlyused for test • Panda problems • Communicationbetween pilots & data-base • Bugs in pilot code • Site problems • ATLAS software setup (althoughit has previously been checked) • Local storageproblem • Archiving of resultsat T1 • Shipping and storing Eric Lancon

  17. ATTENTION – ACHTUNG • Internal ATLAS error types willbeused • Have changedduringyear : • Example : • EXEPANDA_DQ2PUT_FILECOPYERROR • WRAPOSG_DQ2PUT_FILECOPYERROR • Sameerror (unable to register output file) but willappear 2 times • Finally…. I am not sure I understand all the errors Eric Lancon

  18. ATLAS software Input file problem Black Hole site for a short period Eric Lancon

  19. Site / Site Analysis • T1 not considered • Serves as ATLAS software validation • Do reconstruction (N inputs -> 1 output) whereas T2s do simulation (1 input -> 1 output) mainly Eric Lancon

  20. Transfer time out Tokyo -> T1 Eric Lancon

  21. Pilot communication lost Input Storage Output Storage Transfer time out GRIF-> T1 Killed by batch system Eric Lancon

  22. Output Storage LFC problem Pilot communication lost ATLAS software error Killed by batch system Eric Lancon

  23. Missing ATLAS software Output Storage LFC problem Input Storage ATLAS software error Eric Lancon

  24. Killed by batch system Input Storage ATLAS software error Eric Lancon

  25. Missing ATLAS software Input Storage ATLAS software error Eric Lancon

  26. ATLAS software error Output Storage Input Storage Pilot communication lost (/afs) Eric Lancon

  27. Errors per Quater • Performed for some sites only… APOLOGISES • Some sites have almost same errors (some time site independent) over year • Some sites have different errors over year (stability) Eric Lancon

  28. Eric Lancon

  29. Eric Lancon

  30. Eric Lancon

  31. Eric Lancon

  32. <ATLAS> : all ATLAS CC-Lyon : T1 part T2/T3 : FR sites but T1 Job efficiency Shows all problems (configuration, inputs, output) CPU efficiency 80% = 20% waisted ressources Shows mainly output problems Version mise a jour pour 2008 des tableaux presentes au CP-LCG-FranceFev. 2008

  33. Some conclusions • Errors are very much site dependent • Except output storage • Site errors can only be improved by • Careful attention to ATLAS job on site • By someone for the site • Only big errors are spotted by ATLAS central operation and by FR-ATLAS • Other needs site operation Eric Lancon

  34. FT tests • Ddm • Tests de tranfert ‘weekly’ • http://dashb-atlas-data-tier0.cern.ch/dashboard/request.py/site?name=&statsInterval=4&fromDate=2009-01-12%2012:40&toDate=2009-01-12%2016:40&activity=2 • Production MC • Frequence hebdomadaire, mais peu utilise encore • http://dashb-atlas-prodsys-test.cern.ch/dashboard/request.py/overview?task-flag=functional%20test&period=last-9-days&grouping=cloud • Analysis • ST a renouvellerregulierement, frequence? • stagein Eric Lançon

More Related