1 / 8

Wuppertal Post Mortem

Wuppertal Post Mortem. ATLAS FAX Meeting November 4, 2013 Andrew Hanushevsky, SLAC http://xrootd.org. The Issues. Many times files cannot be found in Wuppertal The problem seems to extend to other DE sites This appears to be an endpoint issue The endpoint cmsd does not respond in time

belle
Download Presentation

Wuppertal Post Mortem

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Wuppertal Post Mortem ATLAS FAX Meeting November 4, 2013 Andrew Hanushevsky, SLAC http://xrootd.org

  2. The Issues • Many times files cannot be found in Wuppertal • The problem seems to extend to other DE sites • This appears to be an endpoint issue • The endpoint cmsd does not respond in time • Popularized redirector restart is not a solution • This never solves endpoint problems

  3. Endpoint Problems I • Large delays in LFC communications • Log notes communication delays of minutes • Exceeds timeout when locating a file • Leads to monitoring exception • Under-provisioned VM • VM bumps against resource limit and stalls • Log indicates memory, file descriptor, thread limit • Various limits reached at various times • Exacerbated by long LFC communication delays • Excessive queuing inside cmsd and xrootd

  4. Endpoint Problems II • Possible VM hypervisor issues • Use of VM’s in an I/O heavy workload is problematic • Proxy server is essentially all I/O • Some hypervisors behave badly • E.G. “… under XEN environment; rapidly oscillating bottleneck between CPU and disk I/O limits system scalability and the I/O utilization indicates that I/O jobs are batched.” • http://www.cc.gatech.edu/systems/projects/Elba/pub/jpark_5982_SCC13.pdf • XEN is used in Wuppertal (generally in DE?) • Additional network I/O performance issues noted…. • http://www.vmware.com/pdf/hypervisor_performance.pdf

  5. Mitigations & Suggestions • Over provision VM’s, minimally • 4GB RAM (may need muchmore in SL6) • Set large file descriptor limit • ulimit –n 16000 (may need hard limit increase) • Set large thread limit • ulimit –u 2048 (may need hard limit increase) • Disable proxy cache • pss.setoptReadCacheSize 0 • Avoid SL6 malloc issue (cmsd & xrootd) • MALLOC_ARENA_MAX=4 • Or use tcmalloc or jemalloc (via LD_PRELOAD) • Ideally, run proxy server on real hardware • I/O intensive production workload in a VM, really?

  6. Other Problems I • Monitoring granularity may lead to panic • A 1 hours granularity yields misleading display • A failure at report capture shows red for a long time • Difficult to diagnose what to look at • This is a real time waster • We really need a real-time display • Or a different metric • Color indicating % of failures • Test should run much more often • No need to actually copy, an open will suffice

  7. Other Problems II • LFC is in a critical path • This will never scale is a significant problem • Fortunately, a mitigation is on the way • Only a mitigation due to space token usage • Requires multiple FS lookups • Likely to hit dCache more than not • Multiple redirector VM’s on same physical hardware • These are load finicky Microsoft VM’s • My personal observation • Single hardware/hypervisor glitch -> multiple failures

  8. Recommendations • Require minimal provisioning • Resource limits • VM requirements • Strongly recommend real hardware • Revisit monitoring page • Develop a real-time display • Move redirectors to affected regions

More Related