200 likes | 310 Views
This case study explores the persistent multicast problems encountered in high-performance network applications, particularly within the context of the HAVnet project at the University of Wisconsin. It details issues related to multicast setup delays and failures observed during applications of anatomical and surgical education networking. The analysis provides insights into debugging methods utilized, collaborative vendor engagements, and systematic troubleshooting approaches that led to partial resolutions, alongside ongoing challenges that remain in multicast reliability.
E N D
Case Study: Debugging Multicast Problems from an Applications Perspective Steven Senger, Ph.D. Dept. of Computer Science University of Wisconsin - La Crosse
HAVnet Project • Parvati Dev, PI, Stanford SUMMIT • National Library of Medicine, NGI & SII programs since 1999. • Applications of high-performance networks to anatomical and surgical education. • http://havnet.stanford.edu • http://visu.uwlax.edu
Other Apps and Components • Information Channels • Multicast based announcement/discovery mechanism. • Supports other app requirements such as logging. • Access Grid
Potholes Along the Way • Stanford / CENIC • Multicast setup delay • WiscNet • Conflict between sender and receiver • Michigan / Merit • Multicast setup delay • Inbound flow stops after 209 secs
Stanford / CENIC … • Longstanding problem (observed in ‘01). • Large delays (~15 min) in multicast setup. • Stanford / La Crosse / NLM • Significant delays except for La Crosse / NLM • Originally thought to be at Stanford Border and RP. • 04 hardware/ios upgrades at Stanford. • Situation improved.
Stanford / CENIC … • Only Michigan to Stanford delayed, ~6 mins. • Oct 04, Phone calls, Stanford, CENIC, Vendor support, La Crosse. Escalate through 3 layers of vendor support. • Test/Debug every couple of weeks through March ‘05. • Identified as MSDP propagation delay related to encap/unencap data received by MSDP.
Stanford / CENIC • Delay occurred at each CENIC router. • At some point problem had been internally found and resolved by vendor. • Solution: upgrade OS on CENIC routers.
La Crosse / WiscNet … • First observed spring 05 using AccessGrid. • La Crosse sender and Stanford receiver OK. • Starting a La Crosse receiver breaks the flow. • WiscNet identified problem router. • Vendor support engaged. • Discovered rpd restart sufficient to fix. • Reoccurs every 2 months.
La Crosse / WiscNet … • When failing • Upstream interface on router gets set to unreasonable value. • Sender continues to send data in encapsulated PIM-register messages. • Router never sends register-stop messages.
La Crosse / WiscNet • Problem has survived router chassis upgrade. • No solution as yet.
U. Michigan / Merit … • Discovered after CENIC problem solved. • Small delay in setup for Michigan to Stanford. • Varies between 0 and 60 sec. • Similar behavior for Milwaukee to Stanford. • Does not appear to be in CENIC?
U. Michigan / Merit … • Presence of other receivers seems to change the setup delay. • Merit engaged in isolating problem. • No solution as yet.
U. Michigan / Merit • Discovered Jan ‘06 using AccessGrid. • Traffic from Stanford to MCBI/Merit starts correctly but stops after 208 seconds. • When stopped IPLSng shows as pruned. • Merit identified problem with a switch in Chicago not allowing streams to setup correctly. • Problem resolved with OS upgrade.
Diagnostic Help • Debugging strategies • Tools • Monitoring