1 / 20

Solving the TCP-incast Problem with Application-Level Scheduling

Solving the TCP-incast Problem with Application-Level Scheduling. Maxim Podlesny, University of Waterloo Carey Williamson, University of Calgary. Motivation. 2. 2. Emerging IT paradigms Data centers, grid computing, HPC, multi-core Cluster-based storage systems, SAN, NAS

emery
Download Presentation

Solving the TCP-incast Problem with Application-Level Scheduling

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Solving the TCP-incast Problem with Application-Level Scheduling Maxim Podlesny, University of WaterlooCarey Williamson, University of Calgary

  2. Motivation 2 2 • Emerging IT paradigms • Data centers, grid computing, HPC, multi-core • Cluster-based storage systems, SAN, NAS • Large-scale data management “in the cloud” • Data manipulation via “services-oriented computing” • Cost and efficiency advantages from IT trends, economy of scale, specialization marketplace • Performance advantages from parallelism • Partition/aggregation, MapReduce, BigTable, Hadoop • Think RAID at Internet scale! (1000x)

  3. Problem Statement TCP retransmission timeouts • High-speed, low-latency network (RTT ≤ 0.1 ms) • Highly-multiplexed link (e.g., 1000 flows) • Highly-synchronized flows on bottleneck link • Limited switch buffer size (e.g., 32 KB) How to provide high goodput for data center applications? TCP throughput degradation N 3

  4. Related Work 4 4 E. Krevat et al., “On Application-based Approaches to Avoiding TCP Throughput Collapse in Cluster-based Storage Systems”, Proceedings of SuperComputing 2007 A. Phanishayee et al., “Measurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systems”, Proceedings of FAST 2008 Y. Chen et al., “Understanding TCP Incast Throughput Collapse in Datacenter Networks”, WREN 2009 V. Vasudevan et al., “Safe and Effective Fine-grained TCP Retransmissions for Datacenter Communication”, Proceedings of ACM SIGCOMM 2009 M. Alizadeh et al., “Data Center TCP”, Proc. ACM SIGCOMM 2010 A. Shpiner et al., “A Switch-based Approach to Throughput Collapse and Starvation in Data Centers”, IWQoS 2010

  5. Summary Summary of Related Work • Data centers have specific network characteristics • TCP-incast throughput collapse problem emerges • Possible solutions: • Tweak TCP timers and/or parameters for this environment • Redesign (or replace!) TCP in this environment • Rewrite applications for this environment (Facebook) • Increase switch buffer sizes (extra queueing delay!) • Smart edge coordination for uploads/downloads 5

  6. packet size S_DATA 1 small buffer B 2 3 switch link capacity C N Data Center System Model Logical data block (S) (e.g., 1 MB) Server Request Unit (SRU) (e.g., 32 KB) client N servers

  7. Performance Comparisons • Internet vs. data center network: • Internet propagation delay: 10-100 ms • data center propagation delay: 0.1 ms • packet size 1 KB, link capacity 1 Gbps -> packet transmission time is 0.01 ms

  8. Summary Determine maximum TCP flow concurrency (n) that can be supported without any packet loss Arrange the servers into k groups of (at most) n servers each, by staggering the group scheduling Analysis Overview (1 of 2) 8

  9. Summary Determine maximum TCP flow concurrency (n) that can be supported without any packet loss Determine flow size in packets (based on SRU and MSS) Determine maximum outstanding packets per flow (Wmax) Determine max flow concurrency (based on B and Wmax) Arrange the servers into k groups of (at most) n servers each, by staggering the group scheduling Analysis Overview (2 of 2) 9

  10. Summary Recall TCP slow start dynamics: Initial TCP congestion window (cwnd) is 1 packet Acks cause cwnd to double every RTT (1, 2, 4, 8, 16…) Consider TCP transfer of an arbitrary SRU (e.g., 21) Determine peak power-of-2 cwnd value (WA) Determine “residual window” for the last RTT (WB) Wmax depends on both WA and WB (e.g., WA+ WB/2 ) Determining Wmax 10

  11. Scheduling Overview N n n n n n n 11

  12. Scheduling Details Using lossless scheduling of server responses: maximum n servers responding simultaneously, with k groups of responding servers scheduled Server i (1 <= i <= N) starts responding at:

  13. Theoretical Results • Maximum goodput of an application in a data center • with lossless scheduling is: • where: • S - size of a logical data block • T - actual completion time of an SRU • - SRU completion time used for scheduling • k – how many groups of servers to use • dmax - real system scheduling variance

  14. Solution Analytical Model Results 14 14

  15. Results for 10 KB Fixed SRU Size (1 of 2)

  16. Results for 10 KB Fixed SRU Size (2 of 2)

  17. Results for Varied SRU Size (1 MB / N)

  18. Effect of TCP Timer Granularity

  19. Summary and Conclusion • Application-level scheduling for TCP-incast throughput collapse • Main idea: scheduling responses of servers so that there are no losses • Maximum goodput with lossless scheduling • Non-monotonic goodput, highly-sensitive to network configuration parameters

  20. Future Work • Implementing and testing our solution in real data centers • Evaluating our solution for different application traffic scenarios

More Related