Panda status report
This presentation is the property of its rightful owner.
Sponsored Links
1 / 27

PanDA Status Report PowerPoint PPT Presentation


  • 117 Views
  • Uploaded on
  • Presentation posted in: General

PanDA Status Report. Kaushik De Univ. of Texas at Arlington ANSE Meeting, Nashville May 13, 2014. Overview. We are nearing end of ANSE project ~6 months Review goals/scope of PanDA work in ANSE Assess progress so far PanDA work started ~1 year ago Plans for completion of current work

Download Presentation

PanDA Status Report

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Panda status report

PanDA Status Report

Kaushik De

Univ. of Texas at Arlington

ANSE Meeting, NashvilleMay 13, 2014


Overview

Overview

  • We are nearing end of ANSE project ~6 months

  • Review goals/scope of PanDA work in ANSE

  • Assess progress so far

    • PanDAwork started ~1 year ago

  • Plans for completion of current work

  • Plans for new work

    • Discuss tomorrow

  • Synergy with other projects

    • Artem is co-funded by DOE-ASCR BigPanDA project

    • BigPanDA continues for ~9 months after ANSE ends

    • What happens after 2015?

Kaushik De


Panda goals

PanDA Goals

  • Explicit integration of Networking with PanDA

    • Never before attempted for any WMS

    • PanDA has many implicit assumptions about networking

    • Goal 1: Use network information directly in PanDA workflow

    • Goal 2: Attempt direct control (provisioning) through PanDA

  • ANSE + DOE-ASCR

    • Picked few well defined topics

    • Set up infrastructure and interactions with other projects

    • Develop and deploy software

    • Evaluation metrics

  • Deliver new capabilities forLHC experiments

    • This is not only R&D – use in production environment

Kaushik De


Panda steps

PanDA Steps

  • Collect network information

  • Storage and access

  • Using network information

  • Using dynamic circuits

Kaushik De


Sources of network information

Sources of Network Information

  • DDM Sonar measurements

    • Actual transfer rates for files between all sites (Tier 1 and Tier 2)

    • This information is normally used for site white/blacklisting

    • Measurements available for small, medium, and large files

  • perfSonar (PS) measurements

    • perfSonar provides dedicated network monitoring data

    • All WLCG sites are being instrumented with PS boxes

    • US sites are already instrumented and monitored

  • Federated XRootD (FAX) measurements

    • Read-time ofremote files are measured for pairs of sites

  • This is not an exclusive list – just a starting point

http://dashb-atlas-ssb.cern.ch/dashboard/request.py/siteview#currentView=Sonar&highlight=false

Kaushik De


Ddm sonar

DDM Sonar

Kaushik De


Perfsonar

perfSonar

Kaushik De


Panda status report

FAX

Kaushik De


Panda status report

Kaushik De


Data repositories

Data Repositories

  • Three levels of data storage and access

  • Native data repositories

    • Historical data stored from collectors

    • SSB – site status board for sonar and perfSonar data

    • FAX data is kept independently and uploaded

  • AGIS (ATLAS Grid Information System)

    • Most recent / processed data only – updated periodically

    • Mixture of push/pull – moving to JSON API (pushed only)

  • schedConfigDB

    • Internal Oracle DB used by PanDA for fast access

    • Uses standard ATLAS collector

Kaushik De


Panda status report

Kaushik De


Using network information

Using Network Information

  • Pick a few use cases

    • Important to PanDA users

    • Enhance workload management through use of network

    • Should provide clear metrics for success/failure

  • Case 1: Improve User Analysis workflow

  • Case 2: Improve Tier 1 to Tier 2 workflow

Kaushik De


Improving user analysis

Improving User Analysis

  • In PanDA, user jobs go to data

    • Typically, user jobs are IO intensive – hence constrain jobs to data

    • Note - almost any user payload is allowed by PanDA

    • User analysis jobs are routed automatically to T1/T2 sites

  • For popular data, bottlenecks develop

    • If data isonly at a few sites, user jobs have long wait times

    • PD2P was implemented 3 years ago to solve this problem

    • Additional copies are made asynchronously by PanDA

    • Waiting jobs are automatically re-brokered to new sites

    • But bottlenecks still take time to clear up

  • Can we do something else using network information?

    • Why not use FAX?

    • First we need to develop network metrics for efficient use of FAX

Kaushik De


Faster user analysis through fax

Faster User analysis through FAX

  • First use case for network integration with PanDA

  • PanDA brokerage will use concept of ‘nearby’ sites

    • Calculate weight based on usual brokerage criteria (availability of CPU, release, pilot rate…)

    • Add network transfer cost to brokerage weight

    • Jobs will be sent to the site with best weight – not necessarily the site with local data

    • If nearby site has less wait time, access the data through FAX

Kaushik De


First tests

First Tests

  • Tested in production for ~1 day in March, 2014

    • Useful for debugging and tuning direct access infrastructure

    • We got first results on network aware brokerage

  • Job distribution

    • 4748 jobs from 20 user tasks which required data from congested U.S. Tier 1 site were automatically brokered to U.S. Tier 1/2 sites

Kaushik De


Brokerage results

Brokerage Results

Kaushik De


Conclusions for case 1

Conclusions for Case 1

  • Network data collection working well

    • Additional algorithms to combine network data will be tried

    • HC tests working well – but PS data not robust yet

  • PanDA brokerage worked well

    • Achieved goal of reducing wait time

    • Well balanced local vs remote access

    • Will fine tune after more data on performance

  • Waiting for final implementation

    • But we have no data on actual performance of successful jobs

    • Need to test and validate sites for this mode of data access

    • First tests in March had 100% failure rate (FAX deployment related)

    • Second test 1 week ago also did not go well

    • Expect third test soon

Kaushik De


Managing data rates

Managing Data Rates

  • Tests have shown direct access rates need to be managed

  • Parameters for WAN throttling implemented in PanDA

    • Throttling at brokerage level is easy (eg. ratio FAX jobs/non FAX jobs), but does not guarantee throttling during execution

    • Throttling during dispatch is not scalable when million jobs are dispatched daily (scale may be higher in the future)

    • Throttling may also be done at pilot level

    • PanDAhas implemented a mixed approach to throttling, being tested now

Kaushik De


Cloud selection

Cloud Selection

  • Second use case for network integration with PanDA

  • Optimize choice of T1-T2 pairings (cloud selection)

    • In ATLAS, production tasks are assigned to Tier 1’s

    • Tier 2’s are attached to a Tier 1 cloud for data processing

    • Any T2 may be attached to multiple T1’s

    • Currently, operations team makes this assignment manually

    • This could/should be automated using network information

    • For example, each T2 could be assigned to a native cloud by operations team, and PanDA will assign to other clouds based on network performance metrics

Kaushik De


Ddm sonar data

DDM Sonar Data

http://aipanda021.cern.ch/networking/t1tot2d_matrix/

Kaushik De


Tier 1 view

Tier 1 View

Kaushik De


More t1 information

More T1 Information

Kaushik De


Tier 2 view

Tier 2 View

Kaushik De


Improving site association

Improving Site Association

Kaushik De


More t2 information

More T2 Information

Kaushik De


Conclusion for case 2

Conclusion for Case 2

  • Working well in real time

  • Currently implementing archival information

    • Keep data for last ‘n’ Tier 1 – Tier 2 associations

    • Necessary to check robustness of approach

    • Algorithm may use the historical information in the future

  • Expect to deploy this summer

    • Hopefully ~1 month

Kaushik De


Summary

Summary

  • First 2 use cases for network integration with PanDA working well

    • Work will be completed this summer

    • Metrics showing usefulness of approach will be available in Fall

    • On track for timely final report to ANSE

Kaushik De


  • Login