Extending the ATLAS PanDA Workload Management System For New Big Data Applications. XXIV International Symposium on Nuclear Electronics and Computing. Alexei Klimentov Brookhaven National Laboratory. Main topics. Introduction Large Hadron Collider at CERN ATLAS experiment
Extending the ATLAS PanDA Workload Management System For New Big Data Applications
XXIV International Symposium on Nuclear Electronics and Computing
Brookhaven National Laboratory
September 12, 2013, Varna
The Large Hadron Collider (LHC), one of the largest and truly global scientific projects ever built, is the most exciting turning point in particle physics.
Exploration of a new energy frontier
Proton-proton and Heavy Ion collisions
at ECM up to 14 TeV
27 km circumference
Korea and CERN / July 2009
→ collisions every50 ns
= 20 MHz crossing rate
≈ 35 pp interactions per crossing – pile-up
→ ≈ 109 pp interactions per second !!!
≈ 1600 charged particlesproduced
Candidate Higgs decay
to four electrons recorded
by ATLAS in 2012.
We use experimentsto inquire about what“reality” (nature) does
We intend to fill this gap
ATLAS Physics Goals
174 Universities and Labs
From 38 countries
More than 1200 students
Big Data Workshop. Knoxville, TN
150 million sensors deliver data …
Big Data Workshop
Higgs Selection using the Trigger
Not all information
Improved ability to reject
High quality reconstruction
algorithms, using information
from all detectors
Total of 15,000 cores in 1,500 machines
Reduce data volume in stages
Select ONLY ‘interesting’ events
Initial data rate (50 ns) :
40 000 000 events/s
Selected and stored
We really do throw away 99.9999% of LHC data before writing it to persistent storage
Big Data in High Energy Physics
Like looking for a single drop of water from theGeneve Jet d’Eau over 2+ days
The ATLAS Data Challenge
Starting from this event…
We are looking for this “signature”
Selectivity: 1 in 1013
Like looking for 1 person in a thousand world populations
Or for a needle in 20 million haystacks!
Letter of Benjamin Franklin to Lord Kames, April 11, 1767. Franklin warned British official what would happen if the English kept trying to control the colonists by imposing taxes, like the Stamp Act. He warned that they would revolt.
The political, scientific and literary papers of Franklin comprise approx. 40 volumes containing approx. 100,000 documents.
George W. Bush Presidential Library:
200 million e-mails
4 million photographs
Library of congress
Business e-mails sent per year
to Facebook each year.
Google search index
ATLAS uses grid computing paradigm to organize distributed resources
A few years ago ATLAS started Cloud Computing RnD project to explore virtualization and clouds
Now we are evaluating how high-performance and super-computers can be used for data processing and analysis
ATLAS data volume on the Grid sites (multiple replicas)
End Run 1
Data Distribution patterns are discipline dependent.
(ATLAS RAW data volume ~3 PB/year, ATLAS Data Volume on Grid sites 131.5 PBytes)
§ now far fewer copies due to demand-driven temporary replication – but much more reliance on wide-area networks
Data Management System (DQ2)
Number of concurrently running PanDA jobs (daily average). Aug 2012- Aug 2013
Number of completed PanDA jobs (daily average. Max 1.7M jobs/day).
Aug 2012- Aug 2013
Data Transfer Volume in TB (weekly average). Aug 2012- Aug 2013
CERN to BNL data transfer
time .An average 3.7h
to export data from CERN
6 PBytes ->
PanDA was able to cope with increasing LHC luminosity and ATLAS data taking rate
Adopted to evolution in ATLAS computing model
Two leading experiments in HEP and astro-particle physics(CMS and AMS) has chosen PanDA as workload management system for data processing and analysis.
ALICE is interested in PanDA evaluation for Grid MC Production and Lеadership Computing Facilities.
PanDA was chosen as a core component of Common Analysis Framework by CERN-IT/ATLAS/CMS project
PanDA was cited in the document titled “Fact sheet: Big Data across the Federal Government” prepared by the Executive Office of the President of the United States as an example of successful technology already in place at the time of the “Big Data Research and Development Initiative” announcement
$ I.Bird(WLCG Leader) presentation at HPC and super-computing workshop for Future Science Applications (BNL, Jun 2013)
Spikes in demand for computational resources
Can significantly exceed available ATLAS Grid resources
Lack of resources slows down pace of discovery
Failed and Finished Jobs
reached throughput of 15K jobs per day
most of job failures occurred during start up and scale up phase
Slide from Ken Read
Big Data Workshop
ATLAS PanDA Coming to Oak-Ridge Leadership Computing Facilities
PanDA & multicore jobs scenario
PanDAon Oak-Ridge Leadership Computing Facilities
There was no need to consider network as a resource in WMS in static data distribution scenario
…and dramatic changes in computing models
We would like to benefit from new networking capabilities and to integrate networking services with PanDA. We start to consider network as a resource on similar way as for CPUs and data storage
In BigPanDA we will use information on how much bandwidth is available and can be reserved before data movement will be initiated
In Task Definition user will specify data volume to be transferred and deadline by which task should be completed. The calculations of (i) how much bandwidth to reserve, (ii) when to reserve, and (iii) along what path to reserve will be carried out by Virtual Network On Demand (VNOD).
Many thanks to J.Boyd, K.De, B.Kersevan, T.Maeno, R.Mount, P.Nilsson, D.Oleynik, S.Panitkin, A.Petrosyan, A.Prescott, K.Read, T.Wenausfor slides and materials used in this talk