Loading in 2 Seconds...
Loading in 2 Seconds...
L'infrastructure de calcul pour le LHC Le point de vue d'ATLAS Simone Campana CERN IT/GS
ATLAS Event Data Model RAW (1.6MB/ev) Raw Data:output of the Event Filter Farm (HLT) in byte-stream format Event Summary Data: output of the event reconstruction (tracks, hits, calorimeter cell and clusters, combined reconstruction objects etc...). For calibrazion, allineamento, refitting … ESD (1MB/ev) AOD (150 KB/ev) Analysis Object Data: reduced representation of the events, suitable for analysis. Reconstructed “physics objects” (elettrons, muons, jets …) DPD (20KB/ev) Derived Physics Data: reduced information for ROOT specific analysis.
ATLAS tiers Organization NG PIC RAL CNAF SARA CERN ASGC LYON FZK TRIUMF BNL T3 FR Cloud TWT2 LYON CCPM GRIF Tokyo LPC Melbourne Roumanie BNL Pékin Clermont NET2 LAPP NW • All Tier-1s have predefined (software) channel with CERN and with each other Tier-1. • Tier-2s are associated with one Tier-1 and form the cloud • Tier-2s have predefined channel with the parent Tier-1 only. GL SW BNL Cloud SLAC “Tier Cloud Model” Unit : 1 T1 + n T2/T3
Detector Data Distribution NG PIC RAL SARA CNAF CERN ASGC LYON FZK TRIUMF BNL Original Processing T0 T1 • Raw data Mass Storage at CERN • Raw data Tier 1 centers • complete dataset distributed among T1s • ESD Tier 1 centers • 2 copies of ESD distributed worldwide • AOD Each Tier 1 center • 1 full set per T1 T2: 100 % AOD, small fraction ESD,RAW
Reprocessed Data Distribution NG PIC RAL SARA CNAF ASGC LYON FZK TRIUMF BNL Reprocessing T1 T1 • Each T1 reconstructs its own RAW • Produces new ESD, AOD • Ships : • ESD to associated T1 • AOD to all other T1s
ATLAS Tier-2 activities PIC NG RAL CNAF SARA ASGC LYON FZK TRIUMF BNL • Monte Carlo production (ESD,AOD) • Ships RAW,ESD,AOD to associated T1 • Physics Analysis • Gets (ESD) AOD from associated T1 GRIF Tokyo Pékin Roumanie Clermont
ATLAS and Grid Middleware • ATLAS resources are distributed across different Grid Infrastructures • EGEE, OSG, Nordugrid • Most of the Grid Services are shared across different Grids • SRM interface for Storage Elements • With different backend storage implementation • LCG File Catalog • At all ATLAS T1s, contains infos for file replicas in the cloud • File Transfer Service at every T1 • Baseline transfer service to import data at any site of the cloud. • VOMS • To administrate VO membership • CondorG • For job dispatching • The ATLAS computing framework guarantees Grid interoperability
The DDM in a nutshell The Distributed Data Management … • … enforces the concept of dataset • Logical collection of files • Dataset contents and location stored in central catalogs • File information stored on local File Catalogs (LFC) at T1s • … based on a subscription model • Datasets are subscribed to sites • A series of services enforce the subscription • Lookup data location in LFC • Trigger data movement via FTS • Validate data transfer
Testing Data Distribution: CCRC08 • Week 1: Data Distribution Functional Test • to make sure all files get where we want them to go • between Tier-0 and Tier-1’s, for disk and tape • Week 2: Tier-1 to Tier-1 tests • similar rates as between Tier-0 and Tier-1 • more difficult to control and monitor centrally • Week 3: Throughput test • try to maximize throughput but still following the model • Tier-0 to Tier-1 and Tier-1 to Tier-2 • Week 4: Final, all tests together • also artificial extra load from simulation production
Week-4: Full Exercise • The aim is to test the full transfer matrix • Emulate the full load T0->T1 + T1->T1 + T1->T2 • Considering 14h data taking • Considering full steam reprocessing at 200Hz • On top of this, add the burden of Monte Carlo production • Attempt to run as many jobs as one can • This also means transfers T1->T2 and T2->T1 • Four days exercise divided in two phases • First two days: functionality (lower rate) • Last two days: throughput (full steam)
Transfer ramp-up T0->T1s throughput MB/s • Test of backlog recovery • First data generated over 12 hours and subscribed in bulk 12h backlog recovered in 90 minutes!
Week-4: T0->T1s data distribution Suspect Datasets Datasets is complete (OK) but double registration Incomplete Datasets Effect of the power-cut at CERN on Friday morning
Week-4: T1-T1 transfer matrix YELLOW boxes Effect of the power-cut DARK GREEN boxes Double Registration problem Compared with week-2 (3 problematic sites) Very good improvement
Week-4: T1->T2s transfers SIGNET: ATLAS DDM configuration issue (LFC vs RLS) CSTCDIE: joined very late. Prototype. Many T2s oversubscribed (should get 1/3 of AOD)
Throughputs MB/s Expected Rate T0->T1 transfers Problem at load generator on 27th Power-cut on 30th MB/s T1->T2 transfers show a time structure Datasets subscribed: -upon completion at T1 -every 4 hours
Week-4: Concurrent Production # running jobs # jobs/day
Week-4: metrics • We said: • T0->T1: sites should demonstrate to be capable to import 90% of the subscribed datasets (complete datasets) within 6 hours from the end of the exercise • T1->T2: a complete copy of the AODs at T1 should be replicated at among the T2s, withing 6 hours from the end of the exercise • T1-T1 functional challenge, sites should demonstrate to be capable to import 90% of the subscribed datasets (complete datasets) for within 6 hours from the end of the exercise • T1-T1 throughput challenge, sites should demonstrate to be capable to sustain the rate during nominal rate reprocessing i.e. F*200Hz, where F is the MoU share of the T1. • Every site (cloud) meet the metric • Despite power-cut • Despite “double registration problem” • Despite competition of production activities
Disk Space (month) ATLAS “moved” 1.4PB of data in May 2008 1PB deleted in EGEE+NDGF in << 1day Possibly another 250TB deleted in OSG Deletion agent at work. Uses SRM+LFC bulk methods. Deletion rate is more than good (but those were big files)
Lessons learned from CCRC08 • The Data Distribution framework seems in good shape and ready for data taking • Few things need attention: • FTS servers at T1s need global tuning of parameters • Some bugs found in ATLAS DDM services • Now fixed • In at least 3 cases, a network problem or inefficiency has been discovered • Monitoring …
Few words about the FDR • FDR = Full Dress Rehearsal • Test the full chain, from the HLT to the analysis at T2s. • Same set of Monte Carlo data (approx 8TB) in byte-stream format, injected every day in the T0 machinery • Data (RAW and reprocessed) distributed and handled as real data • FDR2 data exports (June 2008) • Much less challenging than CCRC08 in terms of distributed computing • 6 hours of data per day to be distributed in 24h • Three days of RAW data have been distributed in less than 4 hours • All datasets (RAW and derived) complete at every T1 and T2 (one exception for T2)
Data Export after CCRC08 and FRD • Data Distribution functional test: • To test data transfers: • Tier-0 to all Tier-1’s tape and disk (RAW, ESD, AOD) • all Tier-1’s to all other Tier-1’s (AOD, DPD) • each Tier-1 to all Tier-2’s in the same cloud (AOD,DPD) • muon calibration streams Tier-0 to some special Tier-2’s • Completely automated: • at 5% of nominal rate, fake generated data from T0 • starts every Monday at midday stops next Sunday at midnight • central data deletion of test data everywhere • reports weekly statistics • Data taking • Mostly Cosmics … • RAW data exported to T1s (for custodial) • ESD exported to 2 T1s following Computing Model • Some data kept permanently on disk at CERN
Activity after CCRC08 Most inefficiencies due to Scheduled Downtimes
Simulation Production • Bursty activity, mainly depending on software readiness • Main samples: fdr2, 10TeV, 900 GeV and validations • Runs in Tier-2’s but also in Tier-1’s • no competition yet with analysis (T2) and re-processing (T1) • Average of 10k simultaneous jobs, peaks of 25k jobs • All production now submitted through Panda system
Monte Carlo Production server job ProdDB job Bamboo pull site B pull https job pilot site A glite run pilot run condor-g Scheduler Worker Nodes
Panda in a nutshell • Job definitions are hosted in the Production Database • The agent “Bamboo” polls jobs from ProdDB and feeds the Panda server • The Panda Server manages all job information centrally • Priority Control • Resource Allocation • Job Scheduling • A job scheduler dispatches pilot jobs to sites • Using various mechanisms: local batch system commands, gLite WMS, CondorG • Pilots jobs are prescheduled to Grid sites • Pilots pull “real jobs” from Panda server as soon as suitable CPUs become available. • Output data are aggregated at T1s using DDM
Simulation Production Running Jobs: Monthly Statistics Errors Number of jobs per day
Simulation Production Functional Test • submits one real MC task as a test to each cloud every Monday • 5000 events, 25 events/job 200 jobs of ~6 hours each • jobs should run in each of the Tier-2’s (and Tier-1) in the cloud • low priority to not interfere with real production • task aborted on Thursday • kills remaining jobs and removes all output • statistics generated: efficiency, brokering, problem sites
Reprocessing • Reprocessing “just” is a special case of production system job • Handled by Panda • Runs at T1s only (first order approximation) • However… • Needs to prestage files (RAW data) from tape at T1s • Needs to access the detector condition data on Oracle racks at T1s • Current issues: • pre-staging still not quite working yet • Software exists, being tested • Every T1 has a different storage setup, performances etc … • conditions database access not quite working yet • each job opens several connections to the database at the beginning of the job • Too many concurrent and simultaneous jobs overload the database. Being investigated.
Analysis • The ATLAS analysis model is “jobs go to data” • Analysis mostly run on DPD and AODs • Initially, large access ESD and possibly RAW • Currently, 2 frameworks for analysis: Ganga and pAthena • Both fully integrated with ATLAS DDM for data co-location • Will possibly be merged in a unique tool • Now a unique support team
Ganga • Client based analysis framework • Central Core component • Multiple plug-ins to benefit of various job submission system • gLite WMS • CondorG • Local Batch System (LFS,PBS) Analysis Functional tests Multi VO project
pAthena • Server based analysis framework • Full usage of the Panda infrastructure • Very advanced monitoring • Offers job prioritization and user shares Monitoring per user Worldwide pAthena Activity (last month)
User Storage Space • ATLAS uses the srmv2 interface everywhere now • Offers the possibility to partition the space (space tokens) depending on the use case • For central activities • DATADISK and DATATAPE for real data • MCDISK, MCTAPE and PRODDISK for Simulation Production • For Group analysis (GROUPDISK) • Ideally, quota management per group • In reality, only global quota, little possibility to configure group based ACLs. Need policing. • User analysis • USERDISK • scratch space for job output, cannot guarantee lifetime • LOCALGROUPDISK • not ATLAS pledged resources, “home” space for users • Same limitation as for GROUPDISK
We started exporting … and we saw issues. Data Exports Throughput in MB/s Effect of concurrent data access from centralized transfers and user activity (overload of disk server) Number of errors
Conclusions • Computing for LHC experiment is extremely challenging • Very demanding use case • The system is is complex, relies on many external components • Centralized Data Distribution works reliably • Tested in many challenges and in real life • Monte Carlo Production framework also reliable • But this is not true for the data reprocessing • Database access and data prestaging need attention • Data Analysis user activities represent the real challenge now • Do not follow a particular pattern (non-organized by definition) • Not always possible to protect production from users or users from other users • Never “tested” at the real scale • The EGEE Grid and offers the necessary baseline services and the infrastructure for ATLAS data taking • Improvements in the area of Storage are foreseen in the near future, based on experiments inputs and lessons.