1 / 12

ATLAS Site Status Board Automatic queue exclusion based on downtimes

ATLAS Site Status Board Automatic queue exclusion based on downtimes. ATLAS site topology Site exclusion algorithm Test results First real exclusion and recovery. C. Borrego, S. Campana, S. Gayazov, A. DiGirolamo, X. Espinal, E. Magradze, L. Rinaldi, J. Schovancova, G. Stewart, M. Wrigth

nishi
Download Presentation

ATLAS Site Status Board Automatic queue exclusion based on downtimes

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. ATLAS Site Status Board Automatic queue exclusion based on downtimes • ATLAS site topology • Site exclusion algorithm • Test results • First real exclusion and recovery C. Borrego, S. Campana, S. Gayazov, A. DiGirolamo, X. Espinal, E. Magradze, L. Rinaldi, J. Schovancova, G. Stewart, M. Wrigth atlas-adc-ssb-devs@cern.ch 21st Feb 2012

  2. ATLAS site topology • Based on information from AGIS, Schedconfig • Mapping between various ATLAS site naming conventions • AGIS (based on GOCDB/OIM), Panda, DDM • Populated “exception file” • ATLAS site-oriented topology • http://adc-ssb.cern.ch/SITE_EXCLUSION/ATLAS_sites.json • ATLAS Panda queue-oriented topology • http://adc-ssb.cern.ch/SITE_EXCLUSION/panda_queues.json • http://adc-ssb.cern.ch/SITE_EXCLUSION/panda_queues_dict.json • In touch with Pilot factory monitoring developers to get mapping between queues and resources as Pilot factories see it • Will enable us to map ANALY queues to downtimes of CE 21st Feb 2012

  3. Site exclusion • Queue exclusion based on downtime of a SE, CE (, LFC) • Exclusion tools has undergone thorough testing before was put into production for the first queues GOCDB OIMDB Site A SE downtime starts AGIS Site downtime information DDM exclusion collector Fetches SE downtime from AGIS Site A: SE SE Excluded Site exclusion collector Fetches SE/CE/LFC downtime from AGIS Site C SE downtime starts Site B SE downtime over Site E CE(s) downtime starts Site D LFC downtime starts Site C: CE CEs Excluded Site B: SE SE Recovered Site E: CE CE(s) Excluded Site D: SE SE(s) Excluded Site D: CE CE(s) Excluded In production In testing phase 18 Oct 2011 21st Feb 2012

  4. Site exclusion algorithm • Fetch ongoing and future downtimes from AGIS • Map downtimes from sites to queues (topology!) • SRM downtime: action with every queue type (ANALY, prod) • CE downtime: action only with prod queues • Decide exclusion/recovery action, consider • time of downtime • queue type (production, analysis, “special”) • current queue status • current queue comment 21st Feb 2012

  5. Exclusion of a production queue • 12 hr in advance of a downtime: • setoffline with comment “set.offline.by.SSB” if queue is: • Online with any possible comment • Brokeroff with comment “set.brokeroff.by.SSB” • Test with comment “HC.Test.Me” • Otherwise do not touch that queue! • When downtime starts: • Make sure that queue is set offline when appropriate • See the rules above, in the T-12h .. T intervals • End of downtime/downtime disappears – recovery: • settest with comment “HC.Test.Me” if the current status is Offline with comment “set.offline.by.SSB” • Otherwise do not touch that queue! 21st Feb 2012

  6. Exclusion of an analysis queue • 6 hr in advance of a downtime: • setbrokeroff with comment “set.brokeroff.by.SSB” if queue is: • Online with any possible comment • Brokeroff with comment “set.brokeroff.by.SSB” • Offline with comment “set.offline.by.SSB” • Otherwise do not touch that queue! • 2 hr in advance of a downtime and during downtime: • setofflinewith comment “set.offline.by.SSB” if queue is: • Online with any possible comment • Brokeroff with comment “set.brokeroff.by.SSB” • Test with comment “HC.Test.Me” • Otherwise do not touch that queue! • End of downtime/downtime disappears – recovery: • settest with comment “HC.Test.Me” if the current status is Offline with comment “set.offline.by.SSB” • Otherwise do not touch that queue! 21st Feb 2012

  7. Testing the exclusion idea - 1 • Assembled test data: • 2 flavours of production queues (only 1 enabled), • 2 flavours of analysis queues (only 1 enabled) • Phase space of queue status contains every possible combination of [queue type, queue status, queue comment]: • FAKE_QUEUE_TYPES (x) FAKE_QUEUE_PREFIXES (x) (x) FAKE_STATES (x) FAKE_COMMENTS, where • FAKE_QUEUE_TYPES=[DEFAULT_QUEUE_TYPE_PRODUCTION, DEFAULT_QUEUE_TYPE_ANALYSIS, DEFAULT_QUEUE_TYPE_SPECIAL] • FAKE_QUEUE_PREFIXES={DEFAULT_QUEUE_TYPE_PRODUCTION: ['testsite-testsitece02-at2testsite-pbs_test', 'testsite-testsitece03-at2testsite-pbs_test'], DEFAULT_QUEUE_TYPE_ANALYSIS:['ANALY', 'ANALY2'], DEFAULT_QUEUE_TYPE_SPECIAL:['SPECIAL1', 'SPECIAL2']} • FAKE_STATES=['online', 'offline', 'test', 'brokeroff'] • FAKE_COMMENTS=['', 'dummy', 'set.offline.by.SSB', 'set.offline.by.SSB.dummy', 'set.brokeroff.by.SSB', 'set.brokeroff.by.SSB.dummy', 'set.online.by.SSB', 'set.online.by.SSB.dummy', 'HC.Test.Me', 'HC.Test.Me.dummy'] 21st Feb 2012

  8. Testing the exclusion idea -2 • “Dashboard” with the timeline for each queue class from the phase space http://adc-ssb.cern.ch/SITE_EXCLUSION/switcher/switcher_digest.html • Log with detailed actions described http://adc-ssb.cern.ch/SITE_EXCLUSION/switcher/switcher_digest.log • Test downtimes: • SRM: from 2012-02-05 23:30 UTC to 2012-02-06 02:00 UTC • SRM: from 2012-02-06 04:30 UTC to 2012-02-06 06:00 UTC • SRM: from 2012-02-07 04:30 UTC to 2012-02-07 06:00 UTC • CE: for each queue from 2012-02-06 8am 9am UTC • The exclusion algorithm does what is expected and when it is expected! 21st Feb 2012

  9. Real actions • After thorough testing and improving log debugging features for operations • We started taking real actions for several queues • https://atlas-logbook.cern.ch/elog/ATLAS+Computer+Operations+Logbook/33952 • The exclusion tool does what is expected and when it is expected! • Tested with ifae and UKI-SCOTGRID-DURHAM, which have downtimes today. • Next in the pipeline is SFU-LCG2. 21st Feb 2012

  10. Operational experience - 1 • Every action is logged, so it’s easier to debug what went wrong if this occur. • http://adc-ssb.cern.ch/SITE_EXCLUSION/switcher/switcher.log • Found few minor issues on the way • Fetched only future downtimes from AGIS. • Fixed. Now fetching ongoing and future downtimes. • Disabled all real queues for the past night • Fixed. Now all queues from elog:33952 are enabled again. • The exclusion tool takes only actions we intend it to take! 21st Feb 2012

  11. Operational experience - 2 • Found few minor issues on the way • Fetched only future downtimes from AGIS. • Fixed. Now fetching ongoing and future downtimes. • Disabled all real queues for the past night • Fixed. Now all queues from elog:33952 are enabled again. • The exclusion tool takes only actions we intend it to take! 21st Feb 2012

  12. Summary • Using ATLAS site topology • http://adc-ssb.cern.ch/SITE_EXCLUSION/ATLAS_sites.json • First real exclusions and recoveries successful! • Next steps: • Add more queues to real actions • Add more configurability (now: system-wide) Questions? atlas-adc-ssb-devs@cern.ch 21st Feb 2012

More Related