Facility Status and Resource Requirements

Facility Status and Resource Requirements Michael Ernst, BNL US ATLAS Facilities Workshop at OSG All-Hands Meeting Harvard Medical School March 7, 2011

Outline • Status of computing in ATLAS • Overview and developments since last Facility Meeting • Tier 1/2/3 facilities overview • Job completion metrics for production & analysis • Challenges for higher luminosity running • Resource Projections • Summary

U.S. ATLASPhysics Support & Computing 2.1, 2.9 Management (Wenaus/Willocq) 2.2 Software (Luehring) 2.2.1 Coordination (Luehring) 2.2.2 Core Services (Calafiura) 2.2.3 Data Management (Malon) 2.2.4 Distributed Software (Wenaus) 2.2.5 Application Software (Neubauer) 2.2.6 Infrastructure Support (Undrus) 2.2.7 Analysis support (retired; redundant) 2.2.8 Multicore Processing (Calafiura) 2.3 Facilities and Distributed Computing (Ernst) 2.3.1 Tier 1 Facilities (Ernst) 2.3.2 Tier 2 Facilities (Gardner) 2.3.3 Wide Area Network (McKee) 2.3.4 Grid Tools and Services (Gardner) 2.3.5 Grid Production (De) 2.3.6 Facility Integration (Gardner) 2.3.7 Tier 3 Coordination (Yoshida/Benjamin) 2.4 Analysis Support (Cochran/Yoshida) 2.4.1 Physics/Performance Forums (Black) 2.4.2 Analysis Tools (Cranmer) 2.4.3 Analysis Support Centers (Ma) 2.4.4 Documentation (Luehring)

ATLAS Computing Status • A ‘tremendous’ first year of datataking: in machine & detector performance, data volumes, processing loads, analysis activity, and physics output • Computing at levels far beyond STEP09, which was considered the nominal required performance • All Tier 1s delivered, and Tier 2s were prominent and crucial • Computing delivered well as an enabler for physics analysis • Processing completed and validated on schedule for conferences • Low latency from data taking to physics output (e.g. ICHEP) • U.S. contributed reliably and made innovations toward an improved CM • Tier 1, Tier2s most successful in ATLAS (i.e. analysis performance) • PanDA distributed production/analysis performed and scaled well • Provided critical new tools for coping with data volumes

Accomplishments –Facilities and Distributed Computing • Most successful Tier-1 in ATLAS • Availability, data and CPU delivered, production performance, analysis performance • When ATLAS needs it done now, they send it to the US Tier-1 • U.S. Tier-2s also the best in the ATLAS Tier-2 complex • All U.S. Tier 2s and the Tier 1 in the top 20 (of ~75) sites that do 75% of ATLAS analysis work • U.S.-led dedicated ATLAS–wide Tier-3 support in 2010 • Tier-3s as a distinct but integral component of ATLAS computing focused on end user analysis • In the U.S. tightly integrated with Facility (Integration) Program • Doug is ATLAS Tier-3 Technical Coordinator • Thanks to OSG, a crucial part of the success • We are closely involved in planning OSG’s next 5 years • Only possible due to unique collaborative spirit and effort from many people in the Facilities with distributed leadership, coordinated under the Facilities Integration Program

Tier 3s DOE and NSF funded institutions received their ARRA funds last Fall and the majority is operational or close Tier 3 coordinators Doug and Rik recently contacted all 44 US institutions for a Tier 3 status update and to offer help 26 functional sites 7 are in the process of setting up the site 1 just received hardware, 2 more waiting on it, 1 planning, 7 no news Documentation, tools and procedures developed by Tier 3 team https://twiki.cern.ch/twiki/bin/view/Atlas/AtlasTier3g Closely coupled to the Analysis Support program Asked Alden to help with Tier 3 priorities Currently working on supporting PanDA for analysis at Tier 3 sites NSF approval of DYNES networking project beneficial for Tier 3s; 13 US ATLAS participants U.S. is a leader in the integrated ATLAS Tier 3 effort Technical leadership from U.S. in key Tier 3-directed efforts CVMFS as ‘Global’ Filesystem for Software distribution and (Conditions) Data Access Federated storage across sites based on xrootd for transparent distributed data access Tomorrow is fully devoted to Tier-3 planning and operations

Production Job Completion Single-attempt success rate typically 96% US

Analysis vs. Production(Johannes @ ADC Retreat)

Distributed Analysis • Several open Data Management issues associated w/ Analysis

Output Merging Wei: USERDISK is used for analysis job outputs, including data, ntuple and .log.tgz. The last two are typically small. The .log.tgz normally range from 10KB to 1MB. I guess as users refine their analysis programs by repeatedly running them, they generate large number of temporary outputs they will never be looking at. At SLAC, we have 4.64 million files in our LFC catalog, of which 3.00 million belong to USERDISK (and 670TB vs 88TB) At the ADC Retreat the point was raised • This is only one example of numerous Data Management Issues • Organization and Management of User/Group/LocalGroupdisk

Optimization

Required: 226 kHS06, Pledged: 250 kHS06 Tier-1 CPU Usage in 2010

Tier-2 CPU Usage in 2010 Required: 278 kHS06, Pledged: 281 kHS06

“ultimate” “reasonable” (?)! LHC – Preliminary Luminosity Projections • 2011 Run challenging as there • are uncertainties that will have • significant impact on Resource • Requirements • integrated luminosity • (could vary by 3x) • trigger rate (200-> 400 Hz) • There will be a Run in 2012 • ATLAS Management is in • the process of changing the • Computing Model from • massive Data Pre-placement • to Dynamic Caching which • was developed in the US • (more in Torre’s talk) • We need to be flexible and • nimble in view of limited • resources by e.g. exploring • mechanisms to temporarily • increase them S. Myers

Challenges for Higher Luminosity Running • LHC/ATLAS in 2011: 200 days running, 30%-40% duty cycle, 300-400Hz trigger rate, higher energy (pileup) • Confident that Facilities in the U.S. and Workflow Management system will cope with increased load • Principal challenge is fitting the computing into available resources • In view of pledges in place for 2011 and budget cuts significant changes needed • From full ESDs on disk to rolling buffer (10% of ESDs available on disk at Tier 1s). Rely on AOD and filtered ESDs • Expand use of caching – extend PanDA’s dynamic caching to managing Tier 2 storage as well as Tier 1 • Remove hierarchy from computing model • Break Cloud boundaries and maximize Tier-2 resource usage (CPU and …) • Tier 2s as a storage resource • Based on 2010 experience we have good reasons to believe that this provides a promising path forward w/o compromising physics

New Data Replication Policy(J. Shank’s presentation to CB on Mar 4)

Computing Capacity Requirements • Revised ATLAS computing requirements taking account of planning revisions (ESD reductions etc) first presented to the collaboration in late February • Based on 400Hz trigger rate, 200 running days, 30%-40% LHC efficiency • 2.1B events, 2.7PB raw, 8.8PB total derived • 1 RAW copy on disk, limited derived copies • Rely on dynamic usage-driven replication to caches at Tier-1s and Tier-2s to meet global needs • Data placement practice in 2010 would yield 27PB total volume • MC is in addition: 2010 experience at the Tier-1: Real is 1:1 – 2:1, predicted for the Tier-2s in 2011 Real/MC 1.4 • Lots of changes expected • Success is heavily relying on reliability and performance of sites contributing resources to the worldwide ATLAS Computing Facility • Were largely insulated in the US due to ~complete data inventory

Tier-1 CPU

Tier-1 Disk

Tier-2 CPU Huge increase in User Activities vs. Simulation 2010: 1.2 : 1 2012: 4.4 : 1 All Tier-2 sites must be fully prepared to run analysis at >75% of their total capacity This has a major impact on the performance the storage system and the Network (Note Simulation stays ~constant)

Tier-2 Disk

US Capacity Projection (1/4)

Summary of Resource Estimates

Capacity Requirements – Budget Implications • These capacity levels represent a substantial reduction from previously estimated and budgeted U.S. capacity • ATLAS was able to be more aggressive than anticipated, and PanDA-based dynamic caching has proven very effective • Savings are limited to the Tier 1 in 2011 and 2012; Tier 2s see a substantial ramp in disk space in 2012 to accommodate analysis • Detailed planning and budgets remain to be worked out, but in broad outline, with U.S. Tier 1 capacity reduced to just the new ATLAS requirements, computing funding needs roughly match the target based on low-guidance funding level • We have extrapolated the new ATLAS estimates to 2015 and find this holds true in the out years • But it is not without costs…

Budget Requirements – Program Implications • Tier 2s under the low scenario… • Where do they stand under the different guidance levels? • Costs including ancillary equipment needs, replacement • At what guidance level are we able to add IL/NCSA? • Why is achieving the nominal (at least) scenario important? • At the Tier-1 restoration of some US-dedicated capacity • Risk reduction and more realistic planning/budgeting for mandatory replacements and upgrades • Exciting opportunity to introduce a major NSF Supercomputing Center as a third MWT2 site to the Tier 2 program, affiliated with a strong ATLAS group at Illinois • Would significantly strengthen U.S. ATLAS Facilities expertise base, and strengthen couplings to OSG, XD (next generation Teragrid)

Activities essential to scaling & sustaining ATLAS computing(From Torre’s list) • Computing R&D activities and (collaborative, eg. ATLAS/wLCG, OSG) plans: • Virtualization and cloud computing incl. CVMFS (active) • Multi/many-core computing (active, supplemental DOE support) • Campus grids and inter-campus bridging (active) • Bring to the campus what has worked so well over the wide area in OSG • ‘Intelligent’ cache-based distributed storage (active) • Efficient use of disk through greater reliance on network, federated xrootd • Hierarchical storage incorporating SSD (active) • Highly scalable ‘noSQL’ databases (active) • Tools from the ‘cloud giants’: Cassandra, HBASE/Hadoop, SimpleDB… • ‘Flatter’ point-to-point networking model (active) • Validation, diagnostics, monitoring of (especially) T2-T2 networking • GPU computing (active in ATLAS, not in US) • Managing complexity in distributed computing (active) • Monitoring, diagnostics, error management, automation

Cloud-enabled Elasticity Home resource expands elastically • Cloud providers “join” home/dedicated resources • Ability to add resources on demand • Virtual machines deployed on demand • By that establish proven environment apps can run in • Dynamically create a PanDA Production site • Use cloud resources as long as needed, turn off when done • Scalable infrastructure • Addresses primarily Computational side • Simulated Event Production is an ideal candidate • Computational intensive w/ small IO requirements • Current cloud service offerings are lacking adequate technical capabilities and attractive pricing for data intensive processing tasks • May require “Hybrids”, a combination out of Grids and Clouds for the next few years

Proposed Cloud Roadmap J. Hover

From Local to Global Data Access Using existing solutions to make data globally accessible – a pragmatic approach • Use Xrootd (SLAC, CERN) to build a federated storage system based on autonomous storage systems at sites • Supports file copies & direct/sparse access across all sites and across heterogeneous storage systems • Work in collaboration w/ US CMS and OSG C. Waldman

… and improve the Performance by adding a Caching Layer Reading a 1 GB ATLAS AOD file over HTTP * Though IO is asynchronous the performance is limited by the wide area network latency

Region e.g. US Compromise: Proposed Implementation based on regional exchange points (LHCONE) Network Evolution – “Flattening” today’s Hierarchical Model Hierarchical (Monarch) Model (restricted) Ultimate Pull Model based on unrestricted access across sites

Summary • Computing was a big success as enabler for physics, on its own metrics but also on the ultimate metric of timely physics output • The Facilities, the Tier-1 and the Tier-2’ centers, have performed well in initial LHC data taking and analysis • Production and Analysis Operations Coordination provides seamless integration with ATLAS world-wide computing operations • We have a very effective Integration Program is in place to ensure readiness in view of the steep ramp-up of analysis operations with real data • Excellent contribution of Tier-2 Sites to high volume production (event simulation, reprocessing) and analysis • The U.S. ATLAS Computing Facilities need sufficient funding to be on track to meet the ATLAS performance and capacity requirements • Tier-2 funding uncertainties beyond 2011 • Proposal for Cooperative Agreement w/ NSF for 2012 – 2016 submitted in Dec 2010 • Tier-1 equipment target reduced to minimum • Extended equipment lifetime increases risk • 2011 will also mark the rise of the Tier 3s in the U.S. • U.S. ATLAS is actively pursuing continuation of OSG • Overall, the Facilities in the U.S. have performed very well during the 2010 run, and I have no doubts that this will hold for 2011 and beyond !

Facility Status and Resource Requirements

Facility Status and Resource Requirements

Presentation Transcript

Functional Requirements Status and Plans

Meeting Human Resource Requirements

Dining Facility Account Status

Resource Deployment Status

Functional Requirements Status and Plans

Status and requirements of PLANCK

Resource and Energy Recovery Facility

Facility Resource Emergency Database

NSTX Facility/Diagnostic Status

PWG4 Status and requirements

IARC Requirements and Status

EFC Interoperability – Requirements and Status

Status MAMI facility

ASAR Processing Facility Status

Meeting Human Resource Requirements

Resource Ordering and Status System

Resource Requirements

AusAID Health Resource Facility

Facility Security Documentation Requirements

Status and Plans for SRF Facility

Resource Ordering and Status System

LCLS MMF Facility Status