1 / 42

Root Cause Analysis of TCP Throughput: Methodology, Techniques, and Applications Matti Siekkinen

Root Cause Analysis of TCP Throughput: Methodology, Techniques, and Applications Matti Siekkinen Ph.D. Defense October 30, 2006 Institut Eurecom Sophia Antipolis, France. Outline. Introduction and Motivation Root cause analysis of TCP throughput: what and why? Part 1: Methodology

edison
Download Presentation

Root Cause Analysis of TCP Throughput: Methodology, Techniques, and Applications Matti Siekkinen

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Root Cause Analysis of TCP Throughput: Methodology, Techniques, and Applications Matti Siekkinen Ph.D. Defense October 30, 2006 Institut Eurecom Sophia Antipolis, France

  2. Outline • Introduction and Motivation • Root cause analysis of TCP throughput: what and why? • Part 1: Methodology • InTraBase: Integrated Traffic Analysis Based on Object Relational DBMS • Part 2: Root cause analysis techniques • Taxonomy of TCP rate limitation causes • Our approach to infer limitation causes • Part 3: Case study on Performance Analysis of ADSL Clients • Conclusions • Contributions • Future work

  3. The Internet: over the last 5 years… • Traffic volumes and number of users have skyrocketed • Access link capacities have multiplied • Dominance shifted from Web+FTP into Peer-to-peer applications • TCP still the dominating transport protocol • Carries over 90% of traffic

  4. The Internet: questions raised • ISPs would like to know how clients are doing • What are the performance limitations that Internet applications are facing? • Why does a client with 4Mbit/s ADSL access obtain only total download rate of few KB/s with eDonkey? • Why, after upgrading my link, I see no improvement in throughput? • Internet does not provide directly answers • The network is dumb! • Need techniques for traffic measurement and analysis

  5. Root Cause Analysis of TCP Throughput What? • Analysis and inference of the reasons that prevent a given TCP connection from achieving a higher throughput. • Reasons are called limitation causes Why TCP? • TCP typically over 90% of all traffic

  6. Background • TCP Rate Analysis Tool (T-RAT) by Zhang et al. (sigcomm 2002) • Pioneering research work • Ground breaking insights • It is not all congestion! • Opened up many questions • We implemented and tested it • Results are way off too often • Fundamental assumptions do not hold • T-RAT analyzes unidirectional traffic • Passively collected measurements • Usable in more cases (asymmetric paths) • The source of the problems

  7. Our approach • We analyze only passive traffic measurements • Capture and store all TCP/IP headers, analyze later off-line • Observe traffic at a single measurement point • Applicable in diverse situations • E.g. at the edge of an ISP’s network • Know all about clients’ downloads and uploads • Bidirectional packet traces • Connection level analysis

  8. Challenges (1/3) • Single measurement point anywhere along the path • Cannot/don’t want to control it • Complicates estimation of parameters (RTT and cwnd) A: RTT ~ d1  piece of cake… B: RTT ~ d3+d4  How to get d4? • (Did ack2 trigger • data2?) ack2 A B

  9. Challenges (2/3) • A lot of data to analyze • Potentially millions of connections per trace • Deep analysis • For each connection of each trace • Compute a lot of metrics • Divide connections into pieces • Analyse separately and compute more metrics • Need to keep track of everything

  10. Challenges (3/3) • Find the right metrics to characterize all limitations • Not too many • Need to gather a lot of experience • Get it right! • Several methods for computing a particular metrics • Choose the “best” for the situation • Try to maximize correctness of results • E.g. 5 ways to estimate RTTs • Careful validations • Benchmark with a lot of reference traces • Cross validate metrics

  11. Outline • Introduction and Motivation • Root cause analysis of TCP throughput: what and why? • Part 1: Methodology • InTraBase: Integrated Traffic Analysis Based on Object Relational DBMS • Part 2: Root cause analysis techniques • Taxonomy of TCP rate limitation causes • Our approach to infer limitation causes • Part 3: Case study on Performance Analysis of ADSL Clients • Conclusions • Contributions • Future work

  12. Filter Process Combine Store Interpret Why did we need InTraBase? • First try: ad-hoc scripts and specialized software tools (tcptrace et al.) • Problems: • Management • Data, metadata, and tools • Got lost with files containing data and ad-hoc scripts • Lot of metrics to compute and combine • Cumbersome analysis process • Iterative analysis • Data loses semantics and structure • Scalability • Cannot analyze large enough data sets

  13. Meta data Functions Queries Results Base data Application logs Database System Preprocess Application Raw base data files Web100 TCP IP tcpdump Network link Our InTraBase approach • Store traffic measurements in files as base data • Upload base data into the db and process it within the db • Issue SQL queries • Object-relational DBMS create functions for advanced processing

  14. Benefits from a DBMS-based Approach • Organize and manage data, related metadata, analysis results and tools • Data becomes structured and has semantics • Processing and updating data is easier • Tools “understand” the data  higher-level programming • Searching is more efficient (indexes) • Store reusable intermediate results • It is easier to combine different data sources • E.g. across OSI layers

  15. packets connections bytes packets tput … connection id timestamp start #seq end #seq flags … connection id iat(…) plot_ts_hist() histogram.pdf Histogram of the packet inter-arrival times of the fastest connection SELECT plot_ts_hist(‘SELECT * FROM iat(t2.cnxid,t2.reverse,”packets”)','histogram.pdf') FROM (SELECT cnxid,reverse FROM cnxs,(SELECT max(throughput) FROM cnxs) AS t1 WHERE cnxs.throughput=t1.max) AS t2; SELECT plot_ts_hist(‘SELECT * FROM iat(t2.cnxid,t2.reverse,”packets”)','histogram.pdf') FROM (SELECT cnxid,reverse FROM cnxs,(SELECT max(throughput) FROM cnxs) AS t1 WHERE cnxs.throughput=t1.max) AS t2; SELECT plot_ts_hist(‘SELECT * FROM iat(t2.cnxid,t2.reverse,”packets”)','histogram.pdf') FROM (SELECT cnxid,reverse FROM cnxs,(SELECT max(throughput) FROM cnxs) AS t1 WHERE cnxs.throughput=t1.max) AS t2; SELECT plot_ts_hist(‘SELECT * FROM iat(t2.cnxid,t2.reverse,”packets”)','histogram.pdf') FROM (SELECT cnxid,reverse FROM cnxs,(SELECT max(throughput) FROM cnxs) AS t1 WHERE cnxs.throughput=t1.max) AS t2;

  16. Outline • Introduction and Motivation • Root cause analysis of TCP throughput: what and why? • Part 1: Methodology • InTraBase: Integrated Traffic Analysis Based on Object Relational DBMS • Part 2: Root cause analysis techniques • Taxonomy of TCP rate limitation causes • Our approach to infer limitation causes • Part 3: Case study on Performance Analysis of ADSL Clients • Conclusions • Contributions • Future work

  17. Scope • Study long lived TCP connections • Short connections are another topic • Dominated by slow start? • Assume FIFO scheduling • Necessary for link capacity estimations with packet dispersion techniques • Reasonable assumption for most traffic • May not hold for cable modem and 802.11 access networks

  18. Limitation Causes for TCP Throughput • Application • Transport layer • TCP receiver • Receiver window limitation • TCP protocol • Slow start… • Network layer • Bottleneck link

  19. Application that sends larger bursts separated by idle periods • BitTorrent, HTTP/1.1 (persistent) transfer periods only keep-alive messages

  20. Sender Receiver Application Application buffers TCP Network TCP Limitation Causes: Application • The application does not even attempt to use all network resources • TCP connections are partitioned into two periods: • Bulk Transfer Period (BTP): application provides constantly data to transfer • Never run out of data in buffer B1 • Application Limited Period (ALP): opposite of BTP • TCP has to wait for data because B1 is empty B1

  21. Sender Receiver Application Application buffers TCP Network TCP Limitation Causes: TCP Receiver • Receiver advertized window limits the rate • max amount of outstanding bytes = min(cwnd,rwnd)  Sender is idle waiting for ACKs to arrive • Flow control • Sender application overflows receiving application • Buffer B2 is full • Configuration problem (unintentional) • default receiver advertized window is set too low • window scaling is not enabled B2

  22. Limitation Causes: Network • Limitation is due to congestion at a bottleneck link • Shared bottleneck: obtain only a fraction of its capacity • Non-shared bottleneck: obtain all of its capacity

  23. Our Approach to Root Cause Analysis • Divide & Conquer • Partition connections into BTPs and ALPs • Filter out application impact • Analyze the bulk transfer periods for limitation by • TCP receiver • TCP protocol • Network • Methods are based on metrics computed from packet headers

  24. Why filter out application effect? • Many TCP/IP –level traffic studies do not account for application effect • RTTs, burstiness… • Try to study network properties but end up measuring application effect instead!

  25. Distinguishing BTPs from ALPs:Isolate & Merge algorithm • 1. phase: Isolate • Fact: TCP always tries to send MSS size packets • Consequence: small packets (size < MSS) and idle time indicate application limitation • Buffer between application and TCP is empty packet smaller than MSS ALP ALP … … large fraction of small packets Idle time > RTT Time MSS packet

  26. Distinguishing BTPs from ALPs:Isolate & Merge algorithm • 2. phase: Merge • Why? • After Isolate, BTPs may be separated by very short ALPs • Analyze impact of the application • How much ALPs decrease overall throughput? • How? • Merge subsequent transfer periods separated by ALP to create a new BTP • Mergers controlled with drop parameter • Iterate until all possible mergers are performed

  27. BTP Analysis • Compute limitation scores for each BTP • 4 quantitative scores • [0,1] • We use retransmission rates, inter-arrival time patterns, path capacity, RTT etc. • Perform classification of BTPs into limitation causes • Map (combination of) limitation scores into a cause • Threshold-based scheme

  28. Classification scheme Dispersion score • 4 thresholds need to be set Retransmission score Receiver window limitation score b-score

  29. Classification: calibrating the thresholds • Difficult task: Diversity vs. Control • Reference data needs to be representative & diverse enough • No simulations • Need to control experiments in some way to get what we want • Reference data with partially controlled experiments • Try to generate transfers limited by certain cause • FTP downloads from Fedora Core mirror sites • 232 sites covering all continents • Artificial bottleneck links with rshaper • network limitation • Nistnet to add delay • receiver limitation (Wr/RTT < bw) • Control the number of simultaneous downloads • unshared vs. shared bottleneck Australia Japan Internet Rshaper Nistnet Eurecom USA Finland

  30. Classification: calibrating the thresholdsexample set th1 here bottleneck set at 1 Mbit/s, 1 download at a time

  31. Outline • Introduction and Motivation • Root cause analysis of TCP throughput: what and why? • Part 1: Methodology • InTraBase: Integrated Traffic Analysis Based on Object Relational DBMS • Part 2: Root cause analysis techniques • Taxonomy of TCP rate limitation causes • Our approach to infer limitation causes • Part 3: Case study on Performance Analysis of ADSL Clients • Conclusions • Contributions • Future work

  32. Motivation • Stress test for our techniques • Do we learn useful things? • Knowing throughput limitations (=performance) is useful • ISPs want satisfied clients • Need to know what’s going on before things can be improved • Installed InTraBase at France Telecom to study traffic at their ADSL access network • Root cause analysis techniques implemented within InTraBase

  33. Measurement Setup • 24 hours of traffic on March 10, 2006 • 290 GB of TCP traffic • 64% downstream, 36% upstream • Observed packets from ~3000 clients, analyze only 1335 • Excluded clients did not generate enough traffic for RCA Internet access network collect network Two pcap probes here

  34. Warming up… • Connections • Size distribution highly skewed • Use only 1% of them for RCA • Represent > 85% of all traffic • Clients • Heavy-hitters: 15% of clients generate 85-90% of traffic (up & down) • Low access link utilization • Why?

  35. Results of Limitation Analysis • Striking result • Application limits performance of over 80% of clients • What’s going on?

  36. Application analysis:Application limited traffic other • Quite stable and symmetric volumes • Over 80% of all traffic • eDonkey and “other” dominate eDonkey P2P

  37. Application analysis:Saturated access link • No recognized P2P • Asymmetric port 80/8080 downstream • Real Web traffic?

  38. Connecting the evidence… • Most clients’ performance limited by applications • Very low link utilizations for application limited traffic • Most of application limited traffic seems to be P2P • Peers often have asymmetric uplink and downlink capacities • P2P applications/users enforce upload rate limits  Most clients’ download performance seems to suffer from P2P clients drastically limiting their upload rates downloading client uploading clients Internet Low utilization Low capacity+rate limiter

  39. Outline • Introduction and Motivation • Root cause analysis of TCP throughput: what and why? • Part 1: Methodology • InTraBase: Integrated Traffic Analysis Based on Object Relational DBMS • Part 2: Root cause analysis techniques • Taxonomy of TCP rate limitation causes • Our approach to infer limitation causes • Part 3: Case study on Performance Analysis of ADSL Clients • Conclusions • Contributions • Future work

  40. ConclusionsClaims and contributions • DBMSs provide powerful infrastructure for analysis of passive traffic measurements • Performance is good. • We can infer root causes for TCP throughput using • bidirectional packet traces at • single measurement point located anywhere on the TCP/IP path. • Today’s Internet applications interact in diverse ways with TCP • Bias/error in TCP/IP path analysis • Filter out their effects first • TCP root cause analysis techniques with DBMS-based analysis enable: • performance evaluation of applications, • evaluation of network utilization, and • identification of TCP configuration problems. • Part 1 • Part 2 • Part 2 • Part 3

  41. The case is not yet closed… • Short connections • Challenge previous “old” results with RCA • What about persistent connections? • Wireless traffic • Non-FIFO scheduling • Link-layer issues • Extended case study on ADSL clients • We saw a day, what about a week? • Trends, consistency

  42. Thank you!Questions?

More Related