1 / 22

Network-aware OS

Network-aware OS. DOE/MICS Project Review August 18, 2003. Tom Dunigan thd@ornl.gov Matt Mathis mathis@psc.edu Brian Tierney bltierney@lbl.gov. Roadmap. www.net100.org. Motivation & Background Net100 project components Web100 network probes & sensors

penney
Download Presentation

Network-aware OS

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Network-aware OS DOE/MICS Project Review August 18, 2003 Tom Dunigan thd@ornl.gov Matt Mathis mathis@psc.edu Brian Tierney bltierney@lbl.gov

  2. Roadmap www.net100.org • Motivation & Background • Net100 project components • Web100 • network probes & sensors • protocol analysis and tuning • Results • TCP tuning daemon • Tuning experiments • Ongoing & future research • DOE-funded project (Office of Science) • $2.6M, 3 yrs beginning 9/01 • LBNL, ORNL, PSC, NCAR • Net100 project objectives: (network-aware operating systems) • measure, understand, and improve end-to-end network/application performance • tune network protocols and applications (grid and bulk transfer) • emphasis: TCP bulk transfer over high delay/bandwidth nets

  3. Motivation • Poor network application performance • High bandwidth paths, but app’s slow • Is it application? OS? network? … Yes • Often need a network “wizard” • Changing: bandwidths • 9.6 Kbs… 1.5 Mbs ..45 …100…1000…? Gbs • Unchanging: TCP • speed of light (RTT) • packet size (MSS/MTU) still 1500 bytes • TCP congestion control • TCP is lossy by design ! • 2x overshoot at startup, sawtooth • Recovery proportional to MSS/RTT2 • recovery after a loss can be very slow on today’s high delay/bandwidth links -- unacceptable on tomorrow’s links: • 10 Gbs cross country: recovery time > 1 hr.!! Linear recovery at 0.5 Mb/s! Instantaneous bandwidth 8 Mbs Early startup losses Average bandwidth 40 seconds ORNL to NERSC ftp GigE/OC12 (600 Mbs) 80ms RTT

  4. TCP 101 • adaptable and fair • flow-controlled by sender/receiver buffer sizes • self-clocking with positive ACK’s of in-sequence data • sensitive to packet size (MTU) and RTT • slow start -- +1 packet per each packet ACK’d (exponential) • congestion window (cwnd)-- max packets that can be in flight • packet loss: 3 dup ACKs or timeout (AIMD) • cut cwnd in half (Multiplicative Decrease) • add 1 packet to cwnd per RTT (Additive Increase) • Workarounds: • parallel streams • non-TCP (UDP) applications • Net100 (no changes to applications)

  5. Net100 components • Web100 Linux kernel (NSF) • instrumented TCP stack (IETF MIB draft) • Path characterization • Network Tuning and Analysis Framework (NTAF) • both active and passive measurement tools • data base of measurements • TCP protocol analysis and tuning • simulation/emulation • ns • TCP-over-UDP (atou) • NISTNet • kernel tuning extensions • tuning daemon

  6. Web100 • NSF funded (PSC/NCAR/NCSA) web100.org • Modified Linux kernel • instrumented kernel to read/set TCP variables for a specific flow • readable: RTT, counts (bytes, pkts, retransmits,dups), state (SACKs, windowscale, cwnd, ssthresh) • settable: buffer sizes • 100+ TCP variables (IETF MIB) ( /proc/web100/) • GUI to display/modify a flow’s TCP variables, real-time • API for network-aware applications or tuning daemon • Net100 extensions: • additional tuning variables and algorithms • event notification • Java bandwidth testerhttp://firebird.ccs.ornl.gov:7123

  7. Network Tool Analysis Framework (NTAF) • Configure and launch network tools • measure bandwidth/latency (iperf, pchar, pipechar) • augment tools to report Web100 data • Collect and transform tool results • use Netlogger to transform common format • Save results for short-term auto-tuning and archive for later analysis • compare predicted to actual performance • measure effectiveness of tools and auto-tuning • provide data that can be used to predict future performance • invaluable for comparing tools (pathload/pchar/netest) Net100 hosts at: LBNL,ORNL,PSC,NCAR NERSC, SLAC, UT, CERN, Amsterdam,ANL

  8. TCP flow visualization - Web interface for data archive and visualization

  9. Monitoring Tool Comparison

  10. TCP tuning • “enable” high speed • need buffer = bandwidth*RTT - autotuneORNL/NERSC (80 ms, OC12) need 6 MB • faster slow-start • avoid losses • modified slow-start • reduce bursts • anticipate loss (ECN,Vegas?) • reorder threshold • speed recovery • bigger MTU or “virtual MSS” • modified AIMD (0.5,1) (Floyd, Kelly) • delayed ACKs, initial window, slow-start increment • avoid congestion collapse, be fair (?) … intranets, QoS • Net100: ns simulation, NISTNet emulation, “almost TCP over UDP” (atou), WAD/Internet ns simulation: 500 mbs link, 80 ms RTT Packet loss early in slow start. Standard TCP with del ACK takes 10 minutes to recover!

  11. TCP Tuning Daemon WAD config file [bob] src_addr: 0.0.0.0 src_port: 0 dst_addr: 10.5.128.74 dst_port: 0 mode: 1 sndbuf: 2000000 rcvbuf: 100000 wadai: 6 wadmd: 0.3 maxssth: 100 divide: 1 reorder: 9 sendstall: 0 delack: 0 floyd: 1 kellyai: 0 • Work-around Daemon (WAD) • tune unknowing sender/receiver at startup and/or during flow • Web100 kernel extensions • pre-set windowscale to allow dynamic tuning • uses netlink to alert daemon of socket open/close (or poll) • besides existing Web100 buffer tuning, new tuning parameters and algorithms • knobs to disable Linux 2.4 caching, burst mgt., and sendstall • config file with static tuning data • mode specifies dynamic tuning (AIMD options, NTAF buffer size, concurrent streams) • daemon periodically polls NTAF for fresh tuning data • can do out-of-kernel tuning (e.g., Floyd) • written in C (also Python version)

  12. Experimental results • Evaluating the tuning daemon in the wild • emphasis: bulk transfers over high delay/bandwidth nets (Internet2, ESnet) • tests over: 10GigE/OC192,OC48, OC12, OC3, ATM/VBR, GigE,FDDI,100/10T,cable, ISDN,wireless (802.11b),dialup • tests over NISTNet testbed (speed, loss, delay) • Various TCP tuning options • buffer tuning (static and dynamic/NTAF) • AIMD mods (including Floyd, Kelly, static, virtual MSS, and autotuning) • slow-start mods • parallel streams vs single tuned NISTNet host

  13. Buffer tuning • Classic buffer tuning • network-challenged app. gets 10 Mbs • same app., WAD/NTAF tuned buffer gets 143 Mbs ORNL to PSC, OC12, 80ms RTT • Autotuning buffers (kernel) • Linux 2.4, Feng’s Dynamic Right Sizing • Net100 autotuning • receiver estimates RTT • receiver advertises window 2 times data recv’d in RTT • buffer size grows dynamically to 2x bandwidth*RTT • separate application buffers from kernel buffers ORNL to PSC, OC192, 30 ms RTT

  14. Speeding recovery • Virtual MSS • tune TCP’s additive increase (WAD_AI) • add k segments per RTT during recovery • k=6 like GigE jumbo frame, but: • interrupt rate not reduced • doesn’t do k segments for initial window Selectable TCP AIMD algorithms: Floyd HS TCP: as cwnd grows increase AI and decrease MD, do the reverse when cwnd shrinks Kelly scalable TCP: use MD of 1/8 instead of 1/2 and add % of cwnd (e.g. 1%) each RTT Amsterdam-Chicago GigE via 10GigE, 100 ms RTT UDP burst

  15. WAD tuning • Modified slow-start and AI • often losses in slow-start • WAD tuned Floyd slow-start and fixed AI (6) ORNL to NERSC, OC12, 80 ms RTT • WAD-tuned AIMD and slow-start • parallel streams AIMD (1/(2k),k) • exploit TCP’s fairness • WAD-tuned single stream (0.125,4) • “ “ + Floyd slow-start ORNL to CERN, OC12, 150ms RTT

  16. Clever Alice -- 3 streams Bad girl ... Workaround: parallel streams • Takes advantage of TCP’s fairness • Faster startup, k buffers • faster recovery • often only 1 stream loses a packet • MD: 1/(2k) rather than 1/2 • AI: k times faster linear phase • BUT • requires rewrite of applications • how many streams? Buffer size? • GridFTP, bbftp, psocket lib Alice and Bob sharing

  17. GridFTP tuning Can tuned single stream compete with parallel streams? Mostly not with “equivalence” tuning, but sometimes…. Parallel streams have slow-start advantage. WAD can divide buffer among concurrent flows—fairer/faster? Tests inconclusive so far…. Testing on real Internet is problematic. Is there a “congestion metric”? Per unit of time? Flow Mbs congestion re-xmits untuned 28 4 30 tuned 74 5 295 parallel 52 30 401 untuned 25 7 25 tuned 67 2 420 parallel 88 17 440 Buffers: 64K I/O, 4MB TCP Data/plots from Web100 tracer

  18. Ongoing Net100 research • more user-friendly WAD • invited to submit Web100/Net100 mods to Linux 2.6 • port of Web100 to FreeBSD (Web100 team) • base for AIX, SGI, Solaris, OSF • port to ORNL Cray X1 • Linux network front-end • added Net100 kernel, 4x improvement in wide-area TCP! • TCP Vegas • Vegas avoids loss (if RTT increasing, Vegas backs off) • can be configured to compete with standard TCP (Feng) • CalTech’s FAST • comparison with other “work arounds” • parallel streams • non-TCP (SABUL, FOBS, TSUNAMI, RBUDP, SCTP) • additional accelerants • slow-start initial/increment • reorder resiliance • delayed ACKs

  19. TCP tuning for other OS’s • Reorder threshold • seeing more out of order packets (future: multipath?) • WAD tune a bigger reorder threshold for path • 40x improvement! • Linux 2.4 does a good job already • adjusts and caches reorder threshold • “undo” congestion avoidance LBL to ORNL (using our TCP-over-UDP) : dup3 case had 289 retransmits, but all were unneeded! • Delayed ACKs • WAD could turn off delayed ACKs 2x improvement in recovery rate and slow-start • Linux 2.4 already turns off delayed ACKs for initial slow-start ns simulation: 500 mbs link, 80 ms RTT Packet loss early in slow-start. Standard TCP with del ACK takes 10 minutes to recover! NOTE aggressive static AIMD (Floyd pre-tune)

  20. Planned Net100 research • improve ease of use (WAD  WAND) • analyze effectiveness/fairness of current tuning options • simulation • emulation • on the net (systematic tests) • NTAF probes -- characterizing a path to tune a flow • integration with SCNM • monitoring applications with Web100 • latest probe tools • additional tuning algorithms • identify non-congestive loss, ECN? • Tuning for dedicated path (lambda/10GigE) • parallel/multipath selection/tuning • WAD-to-WAD tuning • WAD caching • SGI/Linux • jumbo frame experiments… the quest for bigger and bigger MTUs

  21. Scientific applications SciDAC supernova and global climate Data grids (CERN, SLAC) Radio telescopes (MIT) Middleware Globus/gridFTP HSI/HPSS Network measurement Internet2 end-to-end Pinger (Cottrell) Claffy/Dovrolis pathload netest (Guojun) SCNM Protocol research Dynamic Right-Sizing (Feng) HS TCP (Floyd) Scalable TCP (Kelly) TCP Vegas (Feng, Low) Tsunami/SABUL/FOBS/RBUDP parallel streams (Hacker) OS vendors Linux IBM AIX/Linux Cray X1 Talks/papers/software/ www.net100.org Interactions

  22. Summary • Novel approaches • non-invasive dynamic tuning of legacy applications • out-of-kernel tuning • using TCP to tune TCP • tuning on a per flow/destination based on recent path metrics or policy (QoS) • Effective evaluation framework • protocol analysis and tuning • network/application/OS debugging • path characterization tools, archive, and visualization tools • Performance improvements • WAD tuned: • buffers  10x • AIMD  2x to 10x • delayed ACK  2x • slowstart  3x • reorder  40x • Timely -- needed for science on today’s and tomorrow’s networks

More Related