panda us atlas production and distributed analysis system xin zhao brookhaven national laboratory n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Panda: US ATLAS Production and Distributed Analysis System Xin Zhao Brookhaven National Laboratory PowerPoint Presentation
Download Presentation
Panda: US ATLAS Production and Distributed Analysis System Xin Zhao Brookhaven National Laboratory

Loading in 2 Seconds...

play fullscreen
1 / 29

Panda: US ATLAS Production and Distributed Analysis System Xin Zhao Brookhaven National Laboratory - PowerPoint PPT Presentation


  • 140 Views
  • Uploaded on

Panda: US ATLAS Production and Distributed Analysis System Xin Zhao Brookhaven National Laboratory. Outline. Panda background Design and key features Core components Panda Server DDM in Panda JobScheduler and Pilots Panda in production Conclusion and more information. What is Panda?.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Panda: US ATLAS Production and Distributed Analysis System Xin Zhao Brookhaven National Laboratory' - yoshiko


Download Now An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
panda us atlas production and distributed analysis system xin zhao brookhaven national laboratory

Panda: US ATLAS Production and Distributed Analysis SystemXin ZhaoBrookhaven National Laboratory

outline
Outline
  • Panda background
  • Design and key features
  • Core components
    • Panda Server
    • DDM in Panda
    • JobScheduler and Pilots
  • Panda in production
  • Conclusion and more information

Xin Zhao, BNL xzhao@bnl.gov

what is panda
What is Panda?
  • Panda – Production ANd Distributed Analysis system
  • ATLAS prodsys executor for OSG (diagram next page)
  • “One stop shopping” for all ATLAS users in the U.S.
    • managed ATLAS production jobs, regional/group/user production jobs, distributed analysis jobs

Xin Zhao, BNL xzhao@bnl.gov

slide4

CE

CE

ProdDB

ATLAS Prodsys

DMS (DQ2)

super

super

super

NG exe

OSG exe

LCG exe

“Panda”

SE/RLS

SE/RLS

SE/RLS

CE

Xin Zhao, BNL xzhao@bnl.gov

what is panda cont d
What is Panda? (cont’d)
  • Written in Python, development team is from BNL, UTA, UC, OU,ANL and LBL, led by Torre Wenaus (BNL) and Kaushik De (UTA)
  • Started August 2005
    • full redesign based on previous DC2/Rome production experience, to achieve performance, scalability, ease of operation needed for ATLAS datataking (up to 100-200K jobs/day)
  • In production since Dec. 2005
    • Ambitious development milestones met
    • Still in rapid development

Xin Zhao, BNL xzhao@bnl.gov

panda design and key features
Panda Design and key features
  • Architecture (diagram on next page)
  • Core components
    • Panda Server: Job brokerage and dispatcher
    • Data Management System (DDM) : ATLAS data management services running on top of Grid SEs
    • JobScheduler and Pilots : Acquisition of Grid CE resources
  • Key features --- “pull model”
    • Data-driven workflow, tightly integrated with ATLAS data management system (Don Quijote 2): pull data to the SE of targeted site
    • Late binding of jobs to worker nodes via “pilot job” scheme : pull job payload to acquired CE worker nodes
    • Data movement (stage-in and stage-out) is decoupled from job processing

Xin Zhao, BNL xzhao@bnl.gov

panda architecture
Panda Architecture

Xin Zhao, BNL xzhao@bnl.gov

core components i panda server

Job info, etc

Apache (mod_python)

DB

child process

MySQL API

client

Python interpreter

HTTP/HTTPS

Job submitter

Pilot

DQ2 callback

Monitor

Python interpreter

DQ2

HTTP/HTTPS

Core components (I): Panda Server
  • Apache-based
  • Communication via HTTP/HTTPS
  • Multi-process
  • Global info in the memory resident database
  • No dependence on special grid middleware

Xin Zhao, BNL xzhao@bnl.gov

panda server cont d
Panda Server (cont’d)
  • MySQL cluster backend
    • Memory-resident MySQL database is used for recording current/recent Job processing activities
    • Longer term information is stored in disk-resident DB
  • Brokerage
    • Manage where jobs (and associated data) are sent based on job characteristics, data locality, priorities,user/group role, site resources & capacities matched to job needs
      • Need improve dynamic site information gathering via OSG information systems
  • Manage data/work flow pipeline
    • Ask DDM to ‘dispatch’ dataset associated with set of jobs to a site
    • Received notification of transfer completion, and then release jobs
    • Received notification of job completion, ask DDM to transfer outputs to destination
  • Dispatcher dispatches released jobs to sites upon pilots request

Xin Zhao, BNL xzhao@bnl.gov

core components ii ddm
Core components (II): DDM
  • Don Quijote 2 is the data management system for ATLAS
    • Supports all three ATLAS Grid flavors (OSG, LCG and NorduGrid)
    • Supports all file based data (event data, conditions, …)
    • Manage all data flows, EF -> T0 -> Grid Tiers -> Institutes -> laptops, end users
  • DQ2 architecture (diagram next page)
    • Central catalog services
    • Local site services

Xin Zhao, BNL xzhao@bnl.gov

dq2 architecture
DQ2 Architecture

Xin Zhao, BNL xzhao@bnl.gov

ddm cont d
DDM (cont’d)
  • Principal features
    • Bulk operation supported
      • data movement and lookup is done in “dataset” and “datablock”
        • Dataset: collections of logical files
        • Datablock: immutable dataset, specifically designed for replication and global data discovery
    • Data movement is triggered by “subscription”
      • Client subscription specifies which data to be transferred to which sites
      • DQ2 local site service finds subscriptions and “pulls” the requested data to local site
    • Scalable global data discovery and access via catalog hierarchy
      • Physical file information available and managed locally only
    • Use GSI authentication, supports SRM (v1.1), GridFTP, glite FTS, HTTP, cp

Xin Zhao, BNL xzhao@bnl.gov

ddm cont d1
DDM (cont’d)
  • DDM in Panda (diagram next page): Data management is done asynchronously with job processing
    • Decouple data movement issues with job failures : lack of reliable, robust data movement/storage services is one of the major reasons for job failures, as seen from previous Data Challenges
    • Allow “just in time” job launch and execution, good for latency sensitive jobs, e.g. distributed analysis
  • Issue
    • DQ2 site service is a prerequisite for a site to run panda jobs
      • Service is hard to install, involves lots of manual steps done by site admin, right now only USATLAS sites have it
      • Use OSG edge service box later
      • Or make one site service serve several sites

Xin Zhao, BNL xzhao@bnl.gov

ddm cont d2
DDM (cont’d)

DQ2 based data handling in Panda

broker

Panda

server

Datasets catalog

service

subscription

DB

dispatcher

callback

Subscription

management

Local

sites

DB

DB

DQ2 site

service

DQ2 site

service

CE

dCache

HPSS

Site B (e.g. BNL)

Transfer output to destination

SE

Site A

Xin Zhao, BNL xzhao@bnl.gov

core components iii jobscheduler and pilots
Core components (III):JobScheduler and Pilots
  • Panda interface with Grid CE resources
    • Acquisition of CPU slots is pre-scheduled before job payload is available
  • JobScheduler: send “pilots” to sites constantly via Condor-G
  • Pilot: “pull” job payload from panda server to the CPU
    • A “CPU slot” holder for the panda server
    • An ordinary batch job for the local batch system
    • A sandbox for one or multiple real ATLAS jobs
  • Workflow (diagram on next page)

Xin Zhao, BNL xzhao@bnl.gov

slide16

Submit

process

Condor Schedd

Panda job dispatcher

OSG Site

information

https

Pilot job running on remote sites

Scheduler

Ask dispatcher for a new job

to Grid

Set up runtime environment

Condor-G

Input from local DDM to workdir

Fork real job and monitor it

Output from workdir to local DDM

Final status update to dispatcher

Workflow of Panda jobscheduler and pilot

srmcp

https

dccp

Local DDM site service

Xin Zhao, BNL xzhao@bnl.gov

jobscheduler
JobScheduler
  • Implementation
    • Condor-G based
      • Common “Grid Scheduler” today
      • Comes with some good features, e.g. GridMonitor, which reduces load on gatekeeper and has become a standard practice on OSG.
    • Infinite loop to send pilots to all usable sites at a steady rate (like the condor GridExerciser application)
      • Currently it submits two types of pilots: production pilots and user analysis pilots
    • The submission rate is configurable, currently 5 pilots/3 minutes, and always keeps 30 queued pilot jobs on each remote site
    • Static CE site information, e.g. $APP, $DATA…, collected from OSG information services
      • Automate it later, and share a unique site information database cross all panda components (ref. slide 9)

Xin Zhao, BNL xzhao@bnl.gov

jobscheduler cont d
JobScheduler (cont’d)
  • Scalability Issue
    • Goal:
      • to run ~100k jobs/day
      • support 2000 jobs (running plus pending) at any time
    • From Condor-G developers: doable at supporting 500~1000 jobs (running plus pending) per site from one submit host
      • USATLAS production has never reached that range with the current available CE resources
    • Multiple submit host
      • Extra operational manpower
    • Local submission
      • a fallback, can bypass the whole cross-domain issue
      • needs more involvement from site administrators, and difficult for shifters to maintain remotely, unless OSG provides edge service boxes

Xin Zhao, BNL xzhao@bnl.gov

pilot
Pilot
  • Implementation
    • Connects to panda server through https for retrieving new jobs and update job status
      • outbound connectivity (or through a proxy server) from worker nodes is required on CE
    • Set up job work directory on worker node local disk (OSG $WNTMP)
    • Call DDM local site service for staging-in and staging-out data between worker node local disk and CE local storage system
    • Lease-based fault-tolerance algorithm: job heartbeat messages sent to panda server every 30 minutes; panda server fails a job and re-submit it if no updates in 6 hours
    • Debugging and logging: make a tarball of workdir and save it into DDM, for all jobs (finished and failed)
    • Doesn’t consume CPU if no real job is available (exit immediately from worker node)

Xin Zhao, BNL xzhao@bnl.gov

general issues in jobscheduler and pilot scheme
General issues inJobscheduler and pilot scheme
  • Security and authentication concerns
    • Late binding of job payload with worker node confuses site accounting/auditing
      • JobScheduler operator’s certificate is used in authentication/authorization with remote Grid sites, as it pushes pilots into CEs
      • Real job submitters’ jobs go directly to pilots without going through site authentication/authorization with his/her own certificates
    • Forwarding of certificate/identity to switch worker node process identities to match real users
      • Collaborate with CMS/CDF and other OSG groups

Xin Zhao, BNL xzhao@bnl.gov

general issues cont d
General issues (cont’d)
  • Latency sensitive DA(distributed analysis) jobs
    • Panda reduces latency by bypassing the usual obstacles of acquisition of SE/CE resources
      • Pre-stage input data into CE local storage using DQ2
      • Late-binding of job payload with CPU slots (pilots)
    • Allocation of pilots slots between production and DA jobs
      • Currently 10% of the pilots are allocated to DA users by JobScheduler
      • No guarantee of a “steady,adequate” DA pilots stream to panda server
        • “soft” allocation, controlled at panda level, not directly on the batch system level on CEs
        • Problem occurs when long-running production jobs (walltime ~2 days) occupies all available CPUs --- No new pilot requests at all, any pre-slot-allocation doesn’t work

Xin Zhao, BNL xzhao@bnl.gov

issues cont d da pilots delivery
Issues (cont’d):DA pilots delivery
  • Alternative approach I --- short queue
    • DA jobs are short (~ 1 hour at most)
      • This is why/how it can ask for low latencies
    • Dedicated “short” queue for DA jobs
      • Traditional batch system model, like a “HOV” lane on the highway
      • Average rate for DA pilot stream at ~ <job walltime>/<# of CPUs>
      • Drawback: “short” queue could stay “idle” if no enough DA jobs to run
        • Shared with other VO jobs? Policy could change from site to site
      • Requirement for a CE site configuration, to be deployed firstly on ATLAS owned resources

Xin Zhao, BNL xzhao@bnl.gov

da pilot delivery cont d
DA Pilot Delivery (cont’d)
  • Alternative II ---Multitasking pilot
    • It runs both a long production job and a short analysis job in parallel
    • Asks for new analysis jobs one after another till the production job finishes, then release the CPU resource
    • Production job could be checkpointed and suspended, but is not the initial approach

Panda Server

retrieve/

update info

Monitor thread

pilot

fork/exec

Monitor

cleanup

anal job

prod job

anal job

anal job

Job Status update

Xin Zhao, BNL xzhao@bnl.gov

multitasking pilot cont d
Multitasking pilot (cont’d)
  • Alternative approach II --- multitasking pilot
    • “Virtual short queue” under “pilots pool”
    • Very short latency, always ready for DA job pickup
    • No special configuration required on CE,
    • But, still concerns:
      • Resource contention on worker nodes, particularly memory
        • Atlas jobs usually consume a lot of memory (hundreds of MB)
      • Conflict with resource usage policies, break “fair” sharing of local batch systems with other users’ jobs
        • Firstly try on ATLAS owned resources

Xin Zhao, BNL xzhao@bnl.gov

general issues cont d1
General issues (cont’d)
  • We are actively testing both “short queue” and “multitasking pilot” approaches now on USATLAS sites, with performance comparison, cost evaluation…
  • Collaborate with Condor and CMS on “just-in-time workload management”
    • Exploring common land in the context of OSG’s planned program of “middleware extentions” development.
    • Condor has many functionalities in place, like “multitasking pilot” functionality through its VM system, Master-Worker framework for low latency job dispatch
    • Extending/generalizing Panda into a condor-enabled generic WM system, deployed to OSG?
    • A SciDAC-II proposal has been submitted for this, together with Condor team and USCMS group

Xin Zhao, BNL xzhao@bnl.gov

panda in production
Panda in production
  • Steady utilization of ~500 USATLAS CPUs in months (as long as jobs are available)
  • Reached >9000 jobs/day in a brief “scaling test”, ~ 4 times of the previous; no scaling limit found
  • Lowest failure rate among all ATLAS executors (<10%)
  • Half the shift manpower compared to previous

Xin Zhao, BNL xzhao@bnl.gov

conclusion
Conclusion
  • Newly designed and implemented distributed production/analysis system for US ATLAS now in operation
  • Designed for ‘one stop shopping’ distributed processing for US ATLAS, also interfaced to ATLAS production
  • Based on internally managed queue/brokerage, pilot jobs and ‘just-in-time’ workload delivery
  • Closely aligned with ATLAS DDM
  • Shows dramatic increase in throughput/scalability and decrease in operations workload
  • Analysis systems in place but latencies need to be improved
  • May spawn a more generic effort in collaboration with Condor, CMS and other OSG groups

Xin Zhao, BNL xzhao@bnl.gov

more information
More information
  • Panda
    • https://uimon.cern.ch/twiki/bin/view/Atlas/Panda
  • Panda monitor/browser
    • http://gridui01.usatlas.bnl.gov:28243/
  • ATLAS DDM (DQ2)
    • https://uimon.cern.ch/twiki/bin/view/Atlas/DistributedDataManagement

Xin Zhao, BNL xzhao@bnl.gov

slide29
Thanks to Condor team for constant, prompt response and assistance on our system tuning, troubleshooting and new feature implementation discussions!

Xin Zhao, BNL xzhao@bnl.gov