panda status report
Download
Skip this Video
Download Presentation
PanDA Status Report

Loading in 2 Seconds...

play fullscreen
1 / 27

PanDA Status Report - PowerPoint PPT Presentation


  • 150 Views
  • Uploaded on

PanDA Status Report. Kaushik De Univ. of Texas at Arlington ANSE Meeting, Nashville May 13, 2014. Overview. We are nearing end of ANSE project ~6 months Review goals/scope of PanDA work in ANSE Assess progress so far PanDA work started ~1 year ago Plans for completion of current work

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' PanDA Status Report' - palma


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
panda status report

PanDA Status Report

Kaushik De

Univ. of Texas at Arlington

ANSE Meeting, NashvilleMay 13, 2014

overview
Overview
  • We are nearing end of ANSE project ~6 months
  • Review goals/scope of PanDA work in ANSE
  • Assess progress so far
    • PanDAwork started ~1 year ago
  • Plans for completion of current work
  • Plans for new work
    • Discuss tomorrow
  • Synergy with other projects
    • Artem is co-funded by DOE-ASCR BigPanDA project
    • BigPanDA continues for ~9 months after ANSE ends
    • What happens after 2015?

Kaushik De

panda goals
PanDA Goals
  • Explicit integration of Networking with PanDA
    • Never before attempted for any WMS
    • PanDA has many implicit assumptions about networking
    • Goal 1: Use network information directly in PanDA workflow
    • Goal 2: Attempt direct control (provisioning) through PanDA
  • ANSE + DOE-ASCR
    • Picked few well defined topics
    • Set up infrastructure and interactions with other projects
    • Develop and deploy software
    • Evaluation metrics
  • Deliver new capabilities forLHC experiments
    • This is not only R&D – use in production environment

Kaushik De

panda steps
PanDA Steps
  • Collect network information
  • Storage and access
  • Using network information
  • Using dynamic circuits

Kaushik De

sources of network information
Sources of Network Information
  • DDM Sonar measurements
    • Actual transfer rates for files between all sites (Tier 1 and Tier 2)
    • This information is normally used for site white/blacklisting
    • Measurements available for small, medium, and large files
  • perfSonar (PS) measurements
    • perfSonar provides dedicated network monitoring data
    • All WLCG sites are being instrumented with PS boxes
    • US sites are already instrumented and monitored
  • Federated XRootD (FAX) measurements
    • Read-time ofremote files are measured for pairs of sites
  • This is not an exclusive list – just a starting point

http://dashb-atlas-ssb.cern.ch/dashboard/request.py/siteview#currentView=Sonar&highlight=false

Kaushik De

ddm sonar
DDM Sonar

Kaushik De

perfsonar
perfSonar

Kaushik De

slide8
FAX

Kaushik De

data repositories
Data Repositories
  • Three levels of data storage and access
  • Native data repositories
    • Historical data stored from collectors
    • SSB – site status board for sonar and perfSonar data
    • FAX data is kept independently and uploaded
  • AGIS (ATLAS Grid Information System)
    • Most recent / processed data only – updated periodically
    • Mixture of push/pull – moving to JSON API (pushed only)
  • schedConfigDB
    • Internal Oracle DB used by PanDA for fast access
    • Uses standard ATLAS collector

Kaushik De

using network information
Using Network Information
  • Pick a few use cases
    • Important to PanDA users
    • Enhance workload management through use of network
    • Should provide clear metrics for success/failure
  • Case 1: Improve User Analysis workflow
  • Case 2: Improve Tier 1 to Tier 2 workflow

Kaushik De

improving user analysis
Improving User Analysis
  • In PanDA, user jobs go to data
    • Typically, user jobs are IO intensive – hence constrain jobs to data
    • Note - almost any user payload is allowed by PanDA
    • User analysis jobs are routed automatically to T1/T2 sites
  • For popular data, bottlenecks develop
    • If data isonly at a few sites, user jobs have long wait times
    • PD2P was implemented 3 years ago to solve this problem
    • Additional copies are made asynchronously by PanDA
    • Waiting jobs are automatically re-brokered to new sites
    • But bottlenecks still take time to clear up
  • Can we do something else using network information?
    • Why not use FAX?
    • First we need to develop network metrics for efficient use of FAX

Kaushik De

faster user analysis through fax
Faster User analysis through FAX
  • First use case for network integration with PanDA
  • PanDA brokerage will use concept of ‘nearby’ sites
    • Calculate weight based on usual brokerage criteria (availability of CPU, release, pilot rate…)
    • Add network transfer cost to brokerage weight
    • Jobs will be sent to the site with best weight – not necessarily the site with local data
    • If nearby site has less wait time, access the data through FAX

Kaushik De

first tests
First Tests
  • Tested in production for ~1 day in March, 2014
    • Useful for debugging and tuning direct access infrastructure
    • We got first results on network aware brokerage
  • Job distribution
    • 4748 jobs from 20 user tasks which required data from congested U.S. Tier 1 site were automatically brokered to U.S. Tier 1/2 sites

Kaushik De

conclusions for case 1
Conclusions for Case 1
  • Network data collection working well
    • Additional algorithms to combine network data will be tried
    • HC tests working well – but PS data not robust yet
  • PanDA brokerage worked well
    • Achieved goal of reducing wait time
    • Well balanced local vs remote access
    • Will fine tune after more data on performance
  • Waiting for final implementation
    • But we have no data on actual performance of successful jobs
    • Need to test and validate sites for this mode of data access
    • First tests in March had 100% failure rate (FAX deployment related)
    • Second test 1 week ago also did not go well
    • Expect third test soon

Kaushik De

managing data rates
Managing Data Rates
  • Tests have shown direct access rates need to be managed
  • Parameters for WAN throttling implemented in PanDA
    • Throttling at brokerage level is easy (eg. ratio FAX jobs/non FAX jobs), but does not guarantee throttling during execution
    • Throttling during dispatch is not scalable when million jobs are dispatched daily (scale may be higher in the future)
    • Throttling may also be done at pilot level
    • PanDAhas implemented a mixed approach to throttling, being tested now

Kaushik De

cloud selection
Cloud Selection
  • Second use case for network integration with PanDA
  • Optimize choice of T1-T2 pairings (cloud selection)
    • In ATLAS, production tasks are assigned to Tier 1’s
    • Tier 2’s are attached to a Tier 1 cloud for data processing
    • Any T2 may be attached to multiple T1’s
    • Currently, operations team makes this assignment manually
    • This could/should be automated using network information
    • For example, each T2 could be assigned to a native cloud by operations team, and PanDA will assign to other clouds based on network performance metrics

Kaushik De

ddm sonar data
DDM Sonar Data

http://aipanda021.cern.ch/networking/t1tot2d_matrix/

Kaushik De

tier 1 view
Tier 1 View

Kaushik De

tier 2 view
Tier 2 View

Kaushik De

conclusion for case 2
Conclusion for Case 2
  • Working well in real time
  • Currently implementing archival information
    • Keep data for last ‘n’ Tier 1 – Tier 2 associations
    • Necessary to check robustness of approach
    • Algorithm may use the historical information in the future
  • Expect to deploy this summer
    • Hopefully ~1 month

Kaushik De

summary
Summary
  • First 2 use cases for network integration with PanDA working well
    • Work will be completed this summer
    • Metrics showing usefulness of approach will be available in Fall
    • On track for timely final report to ANSE

Kaushik De

ad