Wuppertal Post Mortem

Wuppertal Post Mortem ATLAS FAX Meeting November 4, 2013 Andrew Hanushevsky, SLAC http://xrootd.org

The Issues • Many times files cannot be found in Wuppertal • The problem seems to extend to other DE sites • This appears to be an endpoint issue • The endpoint cmsd does not respond in time • Popularized redirector restart is not a solution • This never solves endpoint problems

Endpoint Problems I • Large delays in LFC communications • Log notes communication delays of minutes • Exceeds timeout when locating a file • Leads to monitoring exception • Under-provisioned VM • VM bumps against resource limit and stalls • Log indicates memory, file descriptor, thread limit • Various limits reached at various times • Exacerbated by long LFC communication delays • Excessive queuing inside cmsd and xrootd

Endpoint Problems II • Possible VM hypervisor issues • Use of VM’s in an I/O heavy workload is problematic • Proxy server is essentially all I/O • Some hypervisors behave badly • E.G. “… under XEN environment; rapidly oscillating bottleneck between CPU and disk I/O limits system scalability and the I/O utilization indicates that I/O jobs are batched.” • http://www.cc.gatech.edu/systems/projects/Elba/pub/jpark_5982_SCC13.pdf • XEN is used in Wuppertal (generally in DE?) • Additional network I/O performance issues noted…. • http://www.vmware.com/pdf/hypervisor_performance.pdf

Mitigations & Suggestions • Over provision VM’s, minimally • 4GB RAM (may need muchmore in SL6) • Set large file descriptor limit • ulimit –n 16000 (may need hard limit increase) • Set large thread limit • ulimit –u 2048 (may need hard limit increase) • Disable proxy cache • pss.setoptReadCacheSize 0 • Avoid SL6 malloc issue (cmsd & xrootd) • MALLOC_ARENA_MAX=4 • Or use tcmalloc or jemalloc (via LD_PRELOAD) • Ideally, run proxy server on real hardware • I/O intensive production workload in a VM, really?

Other Problems I • Monitoring granularity may lead to panic • A 1 hours granularity yields misleading display • A failure at report capture shows red for a long time • Difficult to diagnose what to look at • This is a real time waster • We really need a real-time display • Or a different metric • Color indicating % of failures • Test should run much more often • No need to actually copy, an open will suffice

Other Problems II • LFC is in a critical path • This will never scale is a significant problem • Fortunately, a mitigation is on the way • Only a mitigation due to space token usage • Requires multiple FS lookups • Likely to hit dCache more than not • Multiple redirector VM’s on same physical hardware • These are load finicky Microsoft VM’s • My personal observation • Single hardware/hypervisor glitch -> multiple failures

Recommendations • Require minimal provisioning • Resource limits • VM requirements • Strongly recommend real hardware • Revisit monitoring page • Develop a real-time display • Move redirectors to affected regions

Wuppertal Post Mortem

Wuppertal Post Mortem

Presentation Transcript

Post Mortem Review Process

Early post mortem events

[Project Name] Post-Mortem

Post Mortem Changes

Post Mortem Review

POST MORTEM INTERVAL

Post Mortem

2. Post mortem and tissue harvesting a) Post mortem delays

MDC post-mortem redux

Post Mortem Prototyping G.Kruk

Post Mortem

AI Post Mortem

POST MORTEM

Mercury Post-mortem

POST MORTEM CHANGES

CVMFS Post Mortem

POST MORTEM CHANGES

Temple Review Post-Mortem

POST MORTEM CHANGES