Panda: Production and Distributed Analysis System

Panda: US ATLAS Production and Distributed Analysis SystemXin ZhaoBrookhaven National Laboratory

Outline • Panda background • Design and key features • Core components • Panda Server • DDM in Panda • JobScheduler and Pilots • Panda in production • Conclusion and more information Xin Zhao, BNL xzhao@bnl.gov

What is Panda? • Panda – Production ANd Distributed Analysis system • ATLAS prodsys executor for OSG (diagram next page) • “One stop shopping” for all ATLAS users in the U.S. • managed ATLAS production jobs, regional/group/user production jobs, distributed analysis jobs Xin Zhao, BNL xzhao@bnl.gov

CE CE ProdDB ATLAS Prodsys DMS (DQ2) super super super NG exe OSG exe LCG exe “Panda” SE/RLS SE/RLS SE/RLS CE Xin Zhao, BNL xzhao@bnl.gov

What is Panda? (cont’d) • Written in Python, development team is from BNL, UTA, UC, OU,ANL and LBL, led by Torre Wenaus (BNL) and Kaushik De (UTA) • Started August 2005 • full redesign based on previous DC2/Rome production experience, to achieve performance, scalability, ease of operation needed for ATLAS datataking (up to 100-200K jobs/day) • In production since Dec. 2005 • Ambitious development milestones met • Still in rapid development Xin Zhao, BNL xzhao@bnl.gov

Panda Design and key features • Architecture (diagram on next page) • Core components • Panda Server: Job brokerage and dispatcher • Data Management System (DDM) : ATLAS data management services running on top of Grid SEs • JobScheduler and Pilots : Acquisition of Grid CE resources • Key features --- “pull model” • Data-driven workflow, tightly integrated with ATLAS data management system (Don Quijote 2): pull data to the SE of targeted site • Late binding of jobs to worker nodes via “pilot job” scheme : pull job payload to acquired CE worker nodes • Data movement (stage-in and stage-out) is decoupled from job processing Xin Zhao, BNL xzhao@bnl.gov

Panda Architecture Xin Zhao, BNL xzhao@bnl.gov

Job info, etc Apache (mod_python) DB child process MySQL API client Python interpreter HTTP/HTTPS Job submitter Pilot DQ2 callback Monitor Python interpreter DQ2 HTTP/HTTPS Core components (I): Panda Server • Apache-based • Communication via HTTP/HTTPS • Multi-process • Global info in the memory resident database • No dependence on special grid middleware Xin Zhao, BNL xzhao@bnl.gov

Panda Server (cont’d) • MySQL cluster backend • Memory-resident MySQL database is used for recording current/recent Job processing activities • Longer term information is stored in disk-resident DB • Brokerage • Manage where jobs (and associated data) are sent based on job characteristics, data locality, priorities,user/group role, site resources & capacities matched to job needs • Need improve dynamic site information gathering via OSG information systems • Manage data/work flow pipeline • Ask DDM to ‘dispatch’ dataset associated with set of jobs to a site • Received notification of transfer completion, and then release jobs • Received notification of job completion, ask DDM to transfer outputs to destination • Dispatcher dispatches released jobs to sites upon pilots request Xin Zhao, BNL xzhao@bnl.gov

Core components (II): DDM • Don Quijote 2 is the data management system for ATLAS • Supports all three ATLAS Grid flavors (OSG, LCG and NorduGrid) • Supports all file based data (event data, conditions, …) • Manage all data flows, EF -> T0 -> Grid Tiers -> Institutes -> laptops, end users • DQ2 architecture (diagram next page) • Central catalog services • Local site services Xin Zhao, BNL xzhao@bnl.gov

DQ2 Architecture Xin Zhao, BNL xzhao@bnl.gov

DDM (cont’d) • Principal features • Bulk operation supported • data movement and lookup is done in “dataset” and “datablock” • Dataset: collections of logical files • Datablock: immutable dataset, specifically designed for replication and global data discovery • Data movement is triggered by “subscription” • Client subscription specifies which data to be transferred to which sites • DQ2 local site service finds subscriptions and “pulls” the requested data to local site • Scalable global data discovery and access via catalog hierarchy • Physical file information available and managed locally only • Use GSI authentication, supports SRM (v1.1), GridFTP, glite FTS, HTTP, cp Xin Zhao, BNL xzhao@bnl.gov

DDM (cont’d) • DDM in Panda (diagram next page): Data management is done asynchronously with job processing • Decouple data movement issues with job failures : lack of reliable, robust data movement/storage services is one of the major reasons for job failures, as seen from previous Data Challenges • Allow “just in time” job launch and execution, good for latency sensitive jobs, e.g. distributed analysis • Issue • DQ2 site service is a prerequisite for a site to run panda jobs • Service is hard to install, involves lots of manual steps done by site admin, right now only USATLAS sites have it • Use OSG edge service box later • Or make one site service serve several sites Xin Zhao, BNL xzhao@bnl.gov

DDM (cont’d) DQ2 based data handling in Panda broker Panda server Datasets catalog service subscription DB dispatcher callback Subscription management Local sites DB DB DQ2 site service DQ2 site service CE dCache HPSS Site B (e.g. BNL) Transfer output to destination SE Site A Xin Zhao, BNL xzhao@bnl.gov

Core components (III):JobScheduler and Pilots • Panda interface with Grid CE resources • Acquisition of CPU slots is pre-scheduled before job payload is available • JobScheduler: send “pilots” to sites constantly via Condor-G • Pilot: “pull” job payload from panda server to the CPU • A “CPU slot” holder for the panda server • An ordinary batch job for the local batch system • A sandbox for one or multiple real ATLAS jobs • Workflow (diagram on next page) Xin Zhao, BNL xzhao@bnl.gov

Submit process Condor Schedd Panda job dispatcher OSG Site information https Pilot job running on remote sites Scheduler Ask dispatcher for a new job to Grid Set up runtime environment Condor-G Input from local DDM to workdir Fork real job and monitor it Output from workdir to local DDM Final status update to dispatcher Workflow of Panda jobscheduler and pilot srmcp https dccp Local DDM site service Xin Zhao, BNL xzhao@bnl.gov

JobScheduler • Implementation • Condor-G based • Common “Grid Scheduler” today • Comes with some good features, e.g. GridMonitor, which reduces load on gatekeeper and has become a standard practice on OSG. • Infinite loop to send pilots to all usable sites at a steady rate (like the condor GridExerciser application) • Currently it submits two types of pilots: production pilots and user analysis pilots • The submission rate is configurable, currently 5 pilots/3 minutes, and always keeps 30 queued pilot jobs on each remote site • Static CE site information, e.g. $APP, $DATA…, collected from OSG information services • Automate it later, and share a unique site information database cross all panda components (ref. slide 9) Xin Zhao, BNL xzhao@bnl.gov

JobScheduler (cont’d) • Scalability Issue • Goal: • to run ~100k jobs/day • support 2000 jobs (running plus pending) at any time • From Condor-G developers: doable at supporting 500~1000 jobs (running plus pending) per site from one submit host • USATLAS production has never reached that range with the current available CE resources • Multiple submit host • Extra operational manpower • Local submission • a fallback, can bypass the whole cross-domain issue • needs more involvement from site administrators, and difficult for shifters to maintain remotely, unless OSG provides edge service boxes Xin Zhao, BNL xzhao@bnl.gov

Pilot • Implementation • Connects to panda server through https for retrieving new jobs and update job status • outbound connectivity (or through a proxy server) from worker nodes is required on CE • Set up job work directory on worker node local disk (OSG $WNTMP) • Call DDM local site service for staging-in and staging-out data between worker node local disk and CE local storage system • Lease-based fault-tolerance algorithm: job heartbeat messages sent to panda server every 30 minutes; panda server fails a job and re-submit it if no updates in 6 hours • Debugging and logging: make a tarball of workdir and save it into DDM, for all jobs (finished and failed) • Doesn’t consume CPU if no real job is available (exit immediately from worker node) Xin Zhao, BNL xzhao@bnl.gov

General issues inJobscheduler and pilot scheme • Security and authentication concerns • Late binding of job payload with worker node confuses site accounting/auditing • JobScheduler operator’s certificate is used in authentication/authorization with remote Grid sites, as it pushes pilots into CEs • Real job submitters’ jobs go directly to pilots without going through site authentication/authorization with his/her own certificates • Forwarding of certificate/identity to switch worker node process identities to match real users • Collaborate with CMS/CDF and other OSG groups Xin Zhao, BNL xzhao@bnl.gov

General issues (cont’d) • Latency sensitive DA(distributed analysis) jobs • Panda reduces latency by bypassing the usual obstacles of acquisition of SE/CE resources • Pre-stage input data into CE local storage using DQ2 • Late-binding of job payload with CPU slots (pilots) • Allocation of pilots slots between production and DA jobs • Currently 10% of the pilots are allocated to DA users by JobScheduler • No guarantee of a “steady,adequate” DA pilots stream to panda server • “soft” allocation, controlled at panda level, not directly on the batch system level on CEs • Problem occurs when long-running production jobs (walltime ~2 days) occupies all available CPUs --- No new pilot requests at all, any pre-slot-allocation doesn’t work Xin Zhao, BNL xzhao@bnl.gov

Issues (cont’d):DA pilots delivery • Alternative approach I --- short queue • DA jobs are short (~ 1 hour at most) • This is why/how it can ask for low latencies • Dedicated “short” queue for DA jobs • Traditional batch system model, like a “HOV” lane on the highway • Average rate for DA pilot stream at ~ <job walltime>/<# of CPUs> • Drawback: “short” queue could stay “idle” if no enough DA jobs to run • Shared with other VO jobs? Policy could change from site to site • Requirement for a CE site configuration, to be deployed firstly on ATLAS owned resources Xin Zhao, BNL xzhao@bnl.gov

DA Pilot Delivery (cont’d) • Alternative II ---Multitasking pilot • It runs both a long production job and a short analysis job in parallel • Asks for new analysis jobs one after another till the production job finishes, then release the CPU resource • Production job could be checkpointed and suspended, but is not the initial approach Panda Server retrieve/ update info Monitor thread pilot fork/exec Monitor cleanup anal job prod job anal job anal job Job Status update Xin Zhao, BNL xzhao@bnl.gov

Multitasking pilot (cont’d) • Alternative approach II --- multitasking pilot • “Virtual short queue” under “pilots pool” • Very short latency, always ready for DA job pickup • No special configuration required on CE, • But, still concerns: • Resource contention on worker nodes, particularly memory • Atlas jobs usually consume a lot of memory (hundreds of MB) • Conflict with resource usage policies, break “fair” sharing of local batch systems with other users’ jobs • Firstly try on ATLAS owned resources Xin Zhao, BNL xzhao@bnl.gov

General issues (cont’d) • We are actively testing both “short queue” and “multitasking pilot” approaches now on USATLAS sites, with performance comparison, cost evaluation… • Collaborate with Condor and CMS on “just-in-time workload management” • Exploring common land in the context of OSG’s planned program of “middleware extentions” development. • Condor has many functionalities in place, like “multitasking pilot” functionality through its VM system, Master-Worker framework for low latency job dispatch • Extending/generalizing Panda into a condor-enabled generic WM system, deployed to OSG? • A SciDAC-II proposal has been submitted for this, together with Condor team and USCMS group Xin Zhao, BNL xzhao@bnl.gov

Panda in production • Steady utilization of ~500 USATLAS CPUs in months (as long as jobs are available) • Reached >9000 jobs/day in a brief “scaling test”, ~ 4 times of the previous; no scaling limit found • Lowest failure rate among all ATLAS executors (<10%) • Half the shift manpower compared to previous Xin Zhao, BNL xzhao@bnl.gov

Conclusion • Newly designed and implemented distributed production/analysis system for US ATLAS now in operation • Designed for ‘one stop shopping’ distributed processing for US ATLAS, also interfaced to ATLAS production • Based on internally managed queue/brokerage, pilot jobs and ‘just-in-time’ workload delivery • Closely aligned with ATLAS DDM • Shows dramatic increase in throughput/scalability and decrease in operations workload • Analysis systems in place but latencies need to be improved • May spawn a more generic effort in collaboration with Condor, CMS and other OSG groups Xin Zhao, BNL xzhao@bnl.gov

More information • Panda • https://uimon.cern.ch/twiki/bin/view/Atlas/Panda • Panda monitor/browser • http://gridui01.usatlas.bnl.gov:28243/ • ATLAS DDM (DQ2) • https://uimon.cern.ch/twiki/bin/view/Atlas/DistributedDataManagement Xin Zhao, BNL xzhao@bnl.gov

Thanks to Condor team for constant, prompt response and assistance on our system tuning, troubleshooting and new feature implementation discussions! Xin Zhao, BNL xzhao@bnl.gov

Panda: Production and Distributed Analysis System

Panda: Production and Distributed Analysis System

Presentation Transcript

GEOTHERMAL RESEARCH PROGRAM BROOKHAVEN NATIONAL LABORATORY

Alexei Klimentov Brookhaven National Laboratory

Brookhaven National Laboratory

Panda: US ATLAS Production and Distributed Analysis System Xin Zhao Brookhaven National Laboratory

Paul Sorensen Brookhaven National Laboratory

Zhangbu Xu Brookhaven National Laboratory

Alexei Klimentov Brookhaven National Laboratory

Alexei Klimentov Brookhaven National Laboratory

Fermi National Accelerator Laboratory, U.S.A. Brookhaven National Laboratory, U.S.A,

Brookhaven National Laboratory

Patricia Fachini Brookhaven National Laboratory

Alexei Klimentov Brookhaven National Laboratory

ATLAS Distributed Analysis and proposal for ATLAS-LHCb system

Haibin Zhang Brookhaven National Laboratory

ATLAS Distributed Analysis

The PanDA Distributed Production and Analysis System

ATLAS Distributed Analysis

Patricia Fachini Brookhaven National Laboratory

ATLAS Distributed Analysis

Technology Transfer at Brookhaven National Laboratory

N. Simos Brookhaven National Laboratory

Brookhaven National Laboratory A brief overview