tcp incast in data center networks n.
Skip this Video
Loading SlideShow in 5 Seconds..
TCP Incast in Data Center Networks PowerPoint Presentation
Download Presentation
TCP Incast in Data Center Networks

Loading in 2 Seconds...

play fullscreen
1 / 43

TCP Incast in Data Center Networks - PowerPoint PPT Presentation

  • Uploaded on

TCP Incast in Data Center Networks. A study of the problem and proposed solutions. Outline. TCP Incast - Problem Description Motivation and challenges Proposed Solutions Evaluation of proposed solutions Conclusion References. Outline. TCP Incast - Problem Description

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'TCP Incast in Data Center Networks' - tommy

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
tcp incast in data center networks

TCP Incastin Data Center Networks

A study of the problem and proposed solutions

  • TCP Incast - Problem Description
  • Motivation and challenges
  • Proposed Solutions
  • Evaluation of proposed solutions
  • Conclusion
  • References
  • TCP Incast - Problem Description
  • Motivation and challenges
  • Proposed Solutions
  • Evaluation of proposed solutions
  • Conclusion
  • References
tcp incast problem description
TCP Incast– Problem Description
  • Incast jargons:
    • Barrier Synchronized Workload
    • SRU (Server Request Unit)
    • Goodput, Throughput
    • MTU
    • BDP
    • and TCP acronyms like RTT, RTO, CA, AIMD, etc.
tcp incast problem
TCP Incast– Problem

A typical implementation scenario in the Data Centers

tcp incast problem1
TCP Incast - Problem
  • Many-to-one barrier synchronized workload:
    • Receiver requests k blocks of data from S storage servers.
    • Each block of data striped across S storage servers
    • Each server responses with a “fixed” amount of data. (fixed-fragment workload)
    • Client won’t request block k+1 until all the fragments of block k have been received.
  • Datacenter scenario:
      • k=100
      • S = 1-48
      • fragment size : 256KB
tcp incast problem2
TCP Incast - Problem

Goodput Collapse

tcp incast problem3
TCP Incast - Problem
  • Switch buffers are inherently small in magnitude i.e. 32KB-128KB per port
  • Bottleneck switch buffer gets overwhelmed by synchronous sending of data by servers and consequently switch drops the packets
  • RTT is typically 1-2ms in datacenters and RTOmin is 200ms. This gap results in packets not getting retransmitted soon
  • All the other senders who have already sent the data have to wait until the dropped packet gets retransmitted.
  • But large RTO implies that retransmission will be delayed resulting in decrease in goodput
  • TCP Incast - Problem Description
  • Motivation and challenges
  • Proposed Solutions
  • Evaluation of proposed solutions
  • Conclusion
  • References
  • Internet datacenters support a myriad of service and applications.
    • Google, Microsoft, Yahoo, Amazon
  • Vast majority of datacenter use TCP for communication between nodes.
  • Companies like Facebook have adopted UDP as their transport layer protocol to avoid TCP incast and endowed the responsibility of flow control to the application layer protocols
  • The unique workload such as Mapreduce , Hadoop, scale and environment of internet datacenter violate the WAN assumption on which TCP was originally designed.
    • Ex: In a Web search application, many workers respond near simultaneously to search queries in which key-value pairs from many Mappers are transferred to appropriate Reducers during the shuffle stage
incast in bing microsoft
Incast in Bing (Microsoft)

Ref : Slide from Albert Greenberg(Microsoft) presentation at SIGCOMM’10

  • Minimum changes to TCP implementation needed
  • Cannot decrease the RTO min to less than 1ms as operating systems fail to work with high resolution timers for RTO
  • Have to address internal and external flows
  • Cannot afford large buffer at the switch because it is costly
  • Solution needs to be easily deployed and should be cost effective
  • TCP Incast - Problem Description
  • Characteristics of the problem and challenges
  • Proposed Solutions
  • Evaluation of proposed solutions
  • Conclusion
  • References
proposed solutions
Proposed Solutions

Solutions can be divided into

  • Application level solutions
  • Transport layer solutions
  • Transport layer solutions aided by switch’s ECN and QCN capabilities.

Alternative way to categorize the solutions

  • Avoiding timeouts in TCP
  • Reducing RTOmin
  • Replace TCP
  • Call lower layer functionalities like Ethernet Flow control for help
understanding the problem
Understanding the problem…
  • Collaborated study by EECS Berkeley and Intel labs[1]
  • Their study focused on
    • proving this problem is general,
    • deriving an analytical model
    • Studying the impact of various modifications to TCP on incast behavior.

Different RTO Timers


  • Initial goodput min occurs at the same number of servers.
  • Smaller RTO timer value has faster goodput “recovery” rate
  • The decrease rate after local max is the same between different min RTO settings.

Decreasing the RTO – proportional increase in the goodput

  • Surprisingly, 1ms RTO with delayed ack enabled was a better performer
  • Delayed ack disabled in 1ms forces overriding of TCP congestion window on the sender side due to high transmission of acks resulting in fluctuations in smoothed RTT


  • D: total amount of data to be sent, 100 blocks of 256KB
  • L: total transfer time of the workload without and RTO events.
  • R: the number of RTO events during the transfer
  • S: number of server:
  • r: the value of the minimum RTO timer value
  • I : Interpacket wait time
  • Modeling of R and I was done based on empirically observed behavior

Net goodput:

key observations
Key Observations
  • A smaller minimum RTO timer value means larger goodput values for the initial minimum.
  • The initial goodput minimum occurs at the same number of senders, regardless the value of the minimum RTO times.
  • The second order goodput peak occurs at a higher number of senders for a larger RTO timer value
  • The smaller the RTO timer values, the faster the rate of recovery between the goodput minimum and the second order goodput maximum.
  • After the second order goodput maximum, the slope of goodput decrease is the same for different RTO timer values.
application level solution 5
Application level solution[5]
  • No changes required to the TCP stack or network switches
  • Based on scheduling server responses to the same data block so that no data loss occurs
  • Caveats:
    • Retransmissions can be interesting
    • Scheduling at the application level cannot be easily synchronized
    • Limited control over transport layer
ictcp incast congestion control for tcp in data center networks 8
ICTCP-Incast Congestion Control for TCP in Data Center Networks[8]
  • Features
    • Solution based on modifying Congestion window dynamically
    • Can choose implementation on the receiver side only
    • focuses on avoiding packet losses before the incast congestion occurs
    • Test implementation on Windows NDIS
    • Novelties in the solution:
      • Using Available bandwidth to coordinate the receive window increase in all incoming connections
      • Per flow congestion control is performed independently in slotted time of RTT
      • Receive window adjustment is based on the ratio of difference of measured and expected throughput over expected one.

Design considerations

    • Receiver knows how much throughput is achieved and what is the available bandwidth
    • While overly controlled window mechanism may constrain TCP performance, less controlled does not prevent incast congestion
    • Only low latency flows less than 2ms are considered
    • Receive window increase is determined by the available bandwidth
    • Frequency of receive window based congestion control should be per-flow
    • Receive window based scheme should adjust the window according to link congestion and application requirement
ictcp algorithm
ICTCP Algorithm
  • Control trigger: Available bandwidth
    • Calculate available bandwidth
    • Estimate the potential throughput/flow increase before increasing receive window
    • Time divided into two slots
    • For each network interface, measure available bandwidth in first sub-slot and calculate quota for window increase in second sub-slot
    • Ensure the total increase in the receive window is less than the total available bandwidth calculated in the first sub-slot
ictcp algorithm1
ICTCP Algorithm
  • per connection control interval: 2*RTT
    • to estimate the throughput of a TCP connection for receive window adjustment, the shortest time scale is an RTT for that connection
    • control interval for a TCP connection is 2*RTT in ICTCP
      • One RTT latency for adjusted window to take effect
      • One additional RTT for measuring throughput with the newly adjusted window
    • For any TCP connection, if now time is in the second global sub-slot and it observes that the past time is larger than 2*RTT since its last receive window adjustment, it may increase its window based on newly observed TCP throughput and current available bandwidth.
ictcp algorithm2
ICTCP Algorithm
  • Window adjustment on single connection
    • Receive window is adjusted based on its incoming measured throughput
      • Measured throughput is current requirement of the application over that TCP connection
      • Expected throughput is the expectation of throughput on that TCP connection if throughput is constrained by receive window
    • Define ratio of throughput difference
    • Make receive adjustment based on the following conditions
      • MSS and i increase receive window if it’s now in global second sub-slot and there is enough quota of available bandwidth on the network interface. Decrease the quota correspondingly if the receive window is increased.
      • decrease receive window by one MSS if this condition holds for three continuous RTT. The minimal receive window is 2*MSS.
      • Otherwise, keep current receive window.
ictcp algorithm3
ICTCP Algorithm
  • Fairness Controller for multiple connections
    • Fairness considered only for low latency flows
    • Decrease window for fairness only when BWA < 0.2C
    • For window decrease, we cut the receive window by one MSS3, for some selected TCP connections.
      • Select those connections that have receive window larger than the average window value of all connections.
    • For window increase, this is automatically achieved by our window adjustment
ictcp experimental results
ICTCP Experimental Results
  • Testbed
    • 47 servers
    • 1 LB4G 48-port Gigabit Ethernet switch
    • Gigabit Ethernet Broadcom NIC at the hosts
    • Windows Server 2008 R2 Enterprise 64-bit version
issues with ictcp
Issues with ICTCP
  • ICTCP scalability to a large number of TCP connections is an issue because receive window can decrease below 1 MSS resulting in degraded TCP performance
  • Extending ICTCP to handle congestion in general cases where sender and receiver are not under the same switch and bottleneck link is not the last link to the receiver
  • ICTCP for future high bandwidth low latency networks
  • Features
    • TCP like protocol for data centers
    • It uses ECN (Explicit Congestion Notification) to provide multi-bit feedback to the end-hosts
    • Claim is that DCTCP provides better throughput than TCP using 90% less buffer space
    • Provides high burst tolerance and low latency for short flows
    • Also can handle 10X increase in foreground and background traffic without significant hit on the performance front.
  • Overview
    • Applications in data centers largely require
      • Low latency for short flows
      • High burst tolerance
      • High utilization for long flows
    • Low latency for short flows have real time deadlines of about approximately 10-100ms
    • To avoid continuously modifying internal data structures, high utilization for long flows is essentia
    • Study analyzed production traffic from app. 6000 servers with app. 150 TB of traffic for a period of 1 month
    • Query traffic (of 2KB to 20KB) experience incast impairment
  • Overview (Contd)
    • Proposed DCTCP uses ECN capability available in most modern switches
    • Uses multi-bit feedback on congestion from single bit stream of ECN marks
    • Essence of the proposal is to keep switch buffer occupancies persistently low, while maintaining high throughput
    • to control queue length at switches, use Active Queue management(AQM) approach that uses explicit feedback from congested switches
    • Claim is also that only as much as 30 LoC to TCP and setting of a single parameter on switches is needed
    • DCTCP focuses on 3 problems
      • Incast
      • Queue Buildup
      • Buffer pressure

Our area

  • Algorithm
    • Mainly concentrates on extent of congestion rather than just the presence of it.
    • Act of deriving multi-bit feedback from single bit sequence of marks
    • Three components of the algorithm
      • Simple marking at the switch
      • ECN-echo at the receiver
      • Controller at the sender
  • Simple marking at the switch
    • An arriving packet is marked with CE (Congestion Experienced) codepoint if the queue occupancy is greater than K (marking threshold)
    • Marking is not based on average queue length , but instantaneous
  • ECN-ECHO at the receiver:
    • Normally, in TCP, an ECN-ECHO is set to all packets until receiver gets CWR from the sender
    • DCTCP receiver sends a ECN-ECHO only if a CE codepoint is seen on the packet
  • Controller at the sender:
    • Sender maintains an estimate of the fraction of packets that are marked and updated every window
    • When is close to 0 , low congestion and close to 1 indicate high congestion
    • While TCP cuts its window by half, DCTCP uses to determine the sender’s window size
      • cwnd cwnd x (1 - /2)
  • Modeling of when window reaches W*(when K is at the critcal point)
  • Maximum size Q max of the queue depends on the number of synchronously sending servers N
  • Lower bound for K can be derived by
  • How DCTCP solves Incast?
    • TCP suffers from timeouts when N>10
    • DCTCP senders receive ECN marks, slow their rate
    • Suffers timeouts when N large enough to overwhelm static buffer size
    • Solution is Dynamic Buffering
  • TCP Incast - Problem Description
  • Motivation and challenges
  • Proposed Solutions
  • Evaluation of proposed solutions
  • Conclusion
  • References
evaluation of proposed solutions
Evaluation of proposed solutions
  • Application level solution
    • Genuine retransmissions  cascading timeouts  congestion
    • Scheduling at the application level cannot be easily synchronized
    • Limited control over transport layer
  • ICTCP- Solution that needs minimal change and is cost effective
    • ICTCP scalability to a large number of TCP connections is an issue
    • Extending ICTCP to handle congestion in general cases has a limited solution
    • ICTCP for future high bandwidth low latency networks will need extra support from Link layer technologies
  • DCTCP- Solution that needs minimal change but requires switch support
    • DCTCP requires dynamic buffering for larger number of senders
  • No solution completely solves the problem other than configuring less RTO
  • Solutions have less focus on foreground an background traffic together
  • Need solutions which are cost effective+ requiring minimal change to environment+ and of course solving incast!!!
  • Y.Chen, R.Griffith, J.Liu, R.H.Katz, and A.D.Joseph, “Understanding TCP Incast Throughput Collapse in Datacenter Networks” in Proc. of ACM WREN, 2009.
  • Kulkarni.S., Agrawal. P, “A Probabilistic Approach to Address TCP Incast in Data Center Networks”,Distributed Computing Systems Workshops (ICDCSW), 2011
  • Peng Zhang, Hongbo Wang, Shiduan Cheng “Shrinking MTU to Mitigate TCP Incast Throughput Collapse in Data Center Networks”, Communications and Mobile Computing (CMC), 2011
  •  Yan Zhang, Ansari, N.,”On mitigating TCP Incast in Data Center Networks”, INFOCOM Proceedings-IEEE,2011
  • Maxim Podlesny, Carey Williamson, “An application-level solution for the TCP-incast problem in data center networks”,IWQoS ’11: Proceedings of the 19th International Workshop on Quality of Service,IEEE, June,2011
  • Mohammad Alizadeh, Albert Greenberg, David A. Maltz, JitendraPadhye, Parveen Patel, BalajiPrabhakar, SudiptaSengupta, MurariSridharan, “Data center TCP (DCTCP)”, SIGCOMM ’10: Proceedings of the ACM\SIGCOMM, August 2010
  • HongyunZheng, Changjia Chen, ChunmingQiao, “Understanding the Impact of Removing TCP Binary Exponential Backoff in Data Centers”, Communications and Mobile Computing (CMC), 2011
  • Haitao Wu, ZhenqianFeng, ChuanxiongGuo, Yongguang Zhang, “ICTCP: Incast Congestion Control for TCP in data center networks”,Co-NEXT ’10: Proceedings of the 6th International COnference, ACM,November2010