Solving the TCP-incast Problem with Application-Level Scheduling

Solving the TCP-incast Problem with Application-Level Scheduling Maxim Podlesny, University of WaterlooCarey Williamson, University of Calgary

Motivation 2 2 • Emerging IT paradigms • Data centers, grid computing, HPC, multi-core • Cluster-based storage systems, SAN, NAS • Large-scale data management “in the cloud” • Data manipulation via “services-oriented computing” • Cost and efficiency advantages from IT trends, economy of scale, specialization marketplace • Performance advantages from parallelism • Partition/aggregation, MapReduce, BigTable, Hadoop • Think RAID at Internet scale! (1000x)

Problem Statement TCP retransmission timeouts • High-speed, low-latency network (RTT ≤ 0.1 ms) • Highly-multiplexed link (e.g., 1000 flows) • Highly-synchronized flows on bottleneck link • Limited switch buffer size (e.g., 32 KB) How to provide high goodput for data center applications? TCP throughput degradation N 3

Related Work 4 4 E. Krevat et al., “On Application-based Approaches to Avoiding TCP Throughput Collapse in Cluster-based Storage Systems”, Proceedings of SuperComputing 2007 A. Phanishayee et al., “Measurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systems”, Proceedings of FAST 2008 Y. Chen et al., “Understanding TCP Incast Throughput Collapse in Datacenter Networks”, WREN 2009 V. Vasudevan et al., “Safe and Effective Fine-grained TCP Retransmissions for Datacenter Communication”, Proceedings of ACM SIGCOMM 2009 M. Alizadeh et al., “Data Center TCP”, Proc. ACM SIGCOMM 2010 A. Shpiner et al., “A Switch-based Approach to Throughput Collapse and Starvation in Data Centers”, IWQoS 2010

Summary Summary of Related Work • Data centers have specific network characteristics • TCP-incast throughput collapse problem emerges • Possible solutions: • Tweak TCP timers and/or parameters for this environment • Redesign (or replace!) TCP in this environment • Rewrite applications for this environment (Facebook) • Increase switch buffer sizes (extra queueing delay!) • Smart edge coordination for uploads/downloads 5

packet size S_DATA 1 small buffer B 2 3 switch link capacity C N Data Center System Model Logical data block (S) (e.g., 1 MB) Server Request Unit (SRU) (e.g., 32 KB) client N servers

Performance Comparisons • Internet vs. data center network: • Internet propagation delay: 10-100 ms • data center propagation delay: 0.1 ms • packet size 1 KB, link capacity 1 Gbps -> packet transmission time is 0.01 ms

Summary Determine maximum TCP flow concurrency (n) that can be supported without any packet loss Arrange the servers into k groups of (at most) n servers each, by staggering the group scheduling Analysis Overview (1 of 2) 8

Summary Determine maximum TCP flow concurrency (n) that can be supported without any packet loss Determine flow size in packets (based on SRU and MSS) Determine maximum outstanding packets per flow (Wmax) Determine max flow concurrency (based on B and Wmax) Arrange the servers into k groups of (at most) n servers each, by staggering the group scheduling Analysis Overview (2 of 2) 9

Summary Recall TCP slow start dynamics: Initial TCP congestion window (cwnd) is 1 packet Acks cause cwnd to double every RTT (1, 2, 4, 8, 16…) Consider TCP transfer of an arbitrary SRU (e.g., 21) Determine peak power-of-2 cwnd value (WA) Determine “residual window” for the last RTT (WB) Wmax depends on both WA and WB (e.g., WA+ WB/2 ) Determining Wmax 10

Scheduling Overview N n n n n n n 11

Scheduling Details Using lossless scheduling of server responses: maximum n servers responding simultaneously, with k groups of responding servers scheduled Server i (1 <= i <= N) starts responding at:

Theoretical Results • Maximum goodput of an application in a data center • with lossless scheduling is: • where: • S - size of a logical data block • T - actual completion time of an SRU • - SRU completion time used for scheduling • k – how many groups of servers to use • dmax - real system scheduling variance

Solution Analytical Model Results 14 14

Results for 10 KB Fixed SRU Size (1 of 2)

Results for 10 KB Fixed SRU Size (2 of 2)

Results for Varied SRU Size (1 MB / N)

Effect of TCP Timer Granularity

Summary and Conclusion • Application-level scheduling for TCP-incast throughput collapse • Main idea: scheduling responses of servers so that there are no losses • Maximum goodput with lossless scheduling • Non-monotonic goodput, highly-sensitive to network configuration parameters

Future Work • Implementing and testing our solution in real data centers • Evaluating our solution for different application traffic scenarios

Solving the TCP-incast Problem with Application-Level Scheduling

Solving the TCP-incast Problem with Application-Level Scheduling

Presentation Transcript

DASHBOARDS for district-level problem solving

Problem Solving with Decisions

Problem Solving with NetSolve

TCP Incast in Data Center Networks

Problem Solving with Networks

Problem Solving with Excel

Problem Solving with Loops

Problem Solving with Trigonometry

Solving the Concave Cost Supply Scheduling Problem

Problem Solving with Loops

Problem Solving with Fractions!!

Problem Solving With Partnerships

Problem Solving with Decisions

Problem Solving with Matrices

Problem Solving with Loops

Problem Solving with MEASUREMENT

Problem-solving with MATLAB

Graphing with Problem Solving

Problem Solving with Networks

Problem Solving with Time

Solving the TCP-incast Problem with Application-Level Scheduling

Collaborative Problem Solving/Planning (Problem Solving for the Problem Solving Team)