1 / 20

ATLAS Report

ATLAS Report. 14 April 2010 RWL Jones. The Good News. At the IoP meeting before Easter, Dave Charlton said the best thing about the Grid was there was nothing to say about it It is good he thinks so! But it is also a little hopeful! But stick with the good news for now…..

dakota
Download Presentation

ATLAS Report

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. ATLAS Report 14 April 2010 RWL Jones

  2. The Good News • At the IoP meeting before Easter, Dave Charlton said the best thing about the Grid was there was nothing to say about it • It is good he thinks so! • But it is also a little hopeful! • But stick with the good news for now….. • People doing real work on the Grid, and in large numbers

  3. Data throughput MB/s per day End data-taking Beam splashes First collisions Cosmics Total ATLAS data throughput via Grid (Tier0, Tier-1s, Tier-2s) • ATLAS was pumping data out at a high rate, despite the low luminosity • We are limited by trigger rate and event size • We also increased the number of first data replicas

  4. 900GeV reprocessing We are reprocessing swiftly and often We need prompt and continuous response This dataset is tiny compared with what is to come!

  5. Reprocessed data to UK Only one outstanding data transfer issue for UK by end of Xmas period

  6. Tier 2 Nominal Shares 2010

  7. Event data flow to the UK Data distribution has been very fast despite event size Most datasets available in Tier 2s in a couple of hours

  8. High Energy Data Begins UK T1 throughput MB/s UK T2 throughput MB/s UK T2 transfer volume GB

  9. Analysis • This is going generally well • Previously recognised good sites are generally doing well • Workload is uneven depending on data distribution • This is not yet in ‘equilibrium’ as the data is sparse • Remember, datasets for specific streams go to specific sites, so usage reflects those hosting minimum bias samples • However, the fact the UK sites were favoured (e.g. Glasgow) also reflects good performance • The 7TeV will move more to equilibrium

  10. Data Placement • There are issues in the current ATLAS distribution • Should be a full set in each cloud • This has not always happened because of a bug • We need to be more responsive to site performance & capacity • At the moment, the UK has been patching-in extra copies ‘manually’ • ATLAS has followed UK ideas and has introduced ‘primary’ and ‘secondary’ copies of datasets • Secondary copies live only while space permits • Improves access – UK typically has 1.6 copies

  11. The UK and Data Placement • The movement and placement of data must be managed • Overload of the data management system slows the system down for everyone • Unnecessary multiple copies waste disk space and will prevent a full set being available • Some multiple copies will be a good idea to balance loads • We have a group for deciding the data placement: • UK Physics Co-ordinator, UK deputy spokesman, Tony Doyle (UK data ops), Roger Jones (UK ops), aided by Stewart, Love & Brochu • The UK Physics co-ordinator consults the institute physics reps • The initial data plans follows the matching of trigger type to site from previous exercises • We will make second copies until we run short of space, then the second copies will be removed *at little notice*

  12. But dataset x is not in the UK • In general, this should not be the case, unless it is RAW • Access it elsewhere (unless RAW/less popular ESD) • The job goes to the data, not the data to you • We can copy small amounts to the UK on request • E.g. my Higgs candidate in RAW or ESD • But we must manage it - specify • What need for the data is (activity, which physics and performance group) • Why it is not already covered by a physics or performance group area • How big it will be *at a maximum* • How the data will be used (what sort of job to be run, database access etc) • We are still surprised to see requests for datasets that are freely available on the Grid in the UK to be copied to ‘their’ Tier 2 • Local requests should be to Tier 3 (non-GridPP) space

  13. Site responsibilities • Sites are either • Supporting important datasets • Supporting physics groups • Reliability is vital • We need to be in the high 90s at least • Means paying a lot of attention to ATLAS monitoring and not just to SAM tests. • The switch to a SAM nagios based system is potentially useful, but many bugs to be ironed out • Sites just have to be pro-actively looking at the ATLAS dashboards (blame this on the infrastructure people again). • We are reviewing middleware, but the sites must play their part • Local monitoring is important • It should not be users who spot site problems first! • Sites must also look at ATLAS monitoring, not just SAM tests – they are not enough • ATLAS is working to help this…

  14. Monitoring & Validation • ATLAS is working to improve the monitoring • Learn more from the user jobs: • We focus on “active” probing of the sites. • But “passive” yet automatic observation of the user jobs would lead to a better understanding of what is happening at the sites. • The current ADC metrics for analysis are the Hammer Cloud tests using the GangaRobot • These tests are heavy but fairly reliable • Reflect the computing model and needs in data-taking era • Reminder: • About 55% of CPU for ATLAS-wide analysis • About 100% of disk for ATLAS-wide analysis • About 0% of either for local use!

  15. GangaRobot Today • ~8 tests per site per day w/ a mix of: • A few different releases • Different access modes • Mc and real data • Cond DB access • All are defined/prepared by Johannes Elmshauser • Results on GR web and in SAM • Non-critical; sites usually ignore it • Auto blacklisting on EGEE/WMS2x daily email report sent to DAST containing: • Sites with failures • Sites with no job submitted (brokerage error, e.g. no release)

  16. ATLAS Validation – GR2 • New tool, GR2, under development to validate sites • Lighter load on sites – GR2 is HC in ‘gentle mode’ • Concept of Test templates (release, analysis, ds pattern, [sites]) • Defined by ADC • Still has bugs • Installations need to be clearly defined and installed • Test samples need to be in place • This will almost certainly be the framework for future metrics • The metrics themselves require more experience to define

  17. Installations • Installations: • Our sites have been apparently ‘off’ because of missing releases • ATLAS central is also slow at responding to problems with non-CERN set-ups • Major clean-up underway • Auto-retry installer under development

  18. PANDA & WMS • There are now two distinct groups of users • Those who use the PANDA back-end • Those who use the WMS • There is less monitoring of the WMS, and less control • Some (e.g. Graeme) favour a tight focus on the PANDA approach • I am not sure this is possible • However, ATLAS clearly has more feedback and more control if this route is taken • Do not be surprised!

  19. Middleware • Sites cannot be made 100% reliable with the current middleware • Many options are being considered • In particular, data management may reduce from 3 layers to 2 • This would effectively remove the LFC if so • Radical options are also being considered • BUT ATLAS involved in recognizing the limitations of the system today and making it work

  20. Conclusion • We are now finally dealing with real data • We are still learning • We must all work hard to make things work • Many thanks for everyone’s effort so far • But the work continues for 20 years! • The UK has been heavily used and involved in first physics studies • This is partly because of data location • But also because we are a reliable cloud • We can all celebrate this at the dinner tonight • But please keep an eye on your sites on your smart phones!

More Related