curbing delays in datacenters need time to save time n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Curbing Delays in Datacenters: Need Time to Save Time? PowerPoint Presentation
Download Presentation
Curbing Delays in Datacenters: Need Time to Save Time?

Loading in 2 Seconds...

play fullscreen
1 / 29

Curbing Delays in Datacenters: Need Time to Save Time? - PowerPoint PPT Presentation


  • 72 Views
  • Uploaded on

Curbing Delays in Datacenters: Need Time to Save Time?. Mohammad Alizadeh Sachin Katti , Balaji Prabhakar Insieme Networks Stanford University . Window-based rate control schemes (e.g., TCP) do not work at near zero round-trip latency . Datacenter Networks.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Curbing Delays in Datacenters: Need Time to Save Time?' - more


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
curbing delays in datacenters need time to save time

Curbing Delays in Datacenters:Need Time to Save Time?

Mohammad Alizadeh

SachinKatti, BalajiPrabhakar

Insieme Networks Stanford University

slide2

Window-based rate control schemes (e.g., TCP)

do not work at near zero round-trip latency

datacenter networks
Datacenter Networks
  • Message latency is King need very high throughput, very low latency

10-40Gbps links

1-5μs latency

1000s of server ports

web

app

cache

db

map-reduce

HPC

monitoring

transport in datacenters
Transport in Datacenters
  • TCP widely used, but has poor performance
    • Buffer hungry: adds significant queuing latency

TCP

~1–10ms

Baseline fabric latency: 1-5μs

How do we get here?

Queuing Latency

DCTCP

~100μs

~Zero Latency

reducing queuing dctcp vs tcp

S1

Reducing Queuing: DCTCP vs TCP

Sn

Experiment: 2 flows (Win 7 stack), Broadcom 1Gbps Switch

(KBytes)

ECN Marking Thresh = 30KB

towards zero queuing

ECN@90%

S1

Towards Zero Queuing

Sn

S1

S1

ECN@90%

ECN@90%

Sn

Sn

towards zero queuing1

ECN@90%

S1

Towards Zero Queuing

Sn

ns2 sim:10 DCTCP flows, 10Gbps switch, ECN at 9Gbps (90% util)

Target Throughput

Floor ≈ 23μs

window based rate control
Window-based Rate Control

RTT = 10  C×RTT = 10 pkts

Cwnd = 1

Sender

Receiver

C = 1

Throughput = 1/RTT = 10%

window based rate control1
Window-based Rate Control

RTT = 2  C×RTT = 2 pkts

Cwnd = 1

Sender

Receiver

C = 1

Throughput = 1/RTT = 50%

window based rate control2
Window-based Rate Control

RTT = 1.01  C×RTT = 1.01 pkts

Cwnd = 1

Sender

Receiver

C = 1

Throughput = 1/RTT = 99%

window based rate control3
Window-based Rate Control

RTT = 1.01  C×RTT = 1.01 pkts

Cwnd = 1

Sender 1

Sender 2

Receiver

As propagation time 0:

Queue buildup is unavoidable

Cwnd = 1

so what
So What?

Window-based RC needs lag in the loop

Near-zero latency transport must:

  • Use timer-based rate control / pacing
  • Use small packet size
  • Both increase CPU overhead (not practical in software)
  • Possible in hardware, but complex (e.g., HULL NSDI’12)

Or…

Change the Problem!

changing the problem
Changing the Problem…

Priority queue

5

Switch Port

Switch Port

9

4

3

1

7

FIFO queue

  • Queue buildup costly
  • need precise rate control
  • Queue buildup irrelevant
  • coarse rate control OK
dc fabric just a giant switch1
DC Fabric: Just a Giant Switch

TX

RX

H1

H1

H1

H2

H2

H2

H3

H3

H3

H4

H4

H4

H5

H5

H5

H6

H6

H6

H7

H7

H7

H8

H8

H8

H9

H9

H9

dc fabric just a giant switch2
DC Fabric: Just a Giant Switch

TX

RX

H1

H1

H2

H2

H3

H3

H4

H4

H5

H5

H6

H6

H7

H7

H8

H8

H9

H9

slide18

DC transport = Flow scheduling on giant switch

  • Objective?
  • Minimize avg FCT

TX

RX

H1

H1

H2

H2

H3

H3

H4

H4

H5

H5

H6

H6

H7

H7

H8

H8

H9

H9

ingress & egress

capacity constraints

ideal flow scheduling
“Ideal” Flow Scheduling

Problem is NP-hard [Bar-Noy et al.]

  • Simple greedy algorithm: 2-approximation

1

1

2

2

3

3

pfabric in 1 slide
pFabric in 1 Slide

Packets carry a single priority #

  • e.g., prio = remaining flow size

pFabric Switches

  • Very small buffers (~10-20 pktsfor 10Gbps fabric)
  • Send highest priority / drop lowest priority pkts

pFabric Hosts

  • Send/retransmit aggressively
  • Minimal rate control: just prevent congestion collapse
key idea
Key Idea

Switches implement flow

scheduling via local mechanisms

Queue buildup does not hurt performance

Window-based rate control OK

Decouple flow scheduling from rate control

Hosts use simple window-based rate control(≈TCP) to avoid high packet loss

H7

H8

H9

H1

H2

H3

H4

H5

H6

pfabric switch
pFabric Switch
  • Priority Scheduling send highest priority packet first
  • Priority Dropping drop lowest priority packets first

small “bag” of packets per-port

5

9

4

Switch Port

3

3

2

6

1

7

H7

H8

H9

H1

H2

H3

H4

H5

H6

prio = remaining flow size

pfabric switch complexity
pFabric Switch Complexity
  • Buffers are very small (~2×BDP per-port)
    • e.g., C=10Gbps, RTT=15µs → Buffer ~ 30KB
    • Today’s switch buffers are 10-30x larger

Priority Scheduling/Dropping

    • Worst-case:Minimum size packets (64B)
    • 51.2ns to find min/max of ~600 numbers
    • Binary comparator tree: 10 clock cycles
    • Current ASICs: clock ~ 1ns
why does this work
Why does this work?

Invariant for ideal scheduling:

At any instant, have the highest priority packet (according to ideal algorithm) available at the switch.

  • Priority scheduling
    • High priority packets traverse fabric as quickly as possible
  • What about dropped packets?
    • Lowest priority → not needed till all other packets depart
    • Buffer > BDP → enough time (> RTT) to retransmit
evaluation 144 port fabric search traffic pattern
Evaluation (144-port fabric; Search traffic pattern)
  • Recall: “Ideal” is REALLY idealized!
      • Centralized with full view of flows
      • No rate-control dynamics
      • No buffering
      • No pktdrops
      • No load-balancing inefficiency
mice fct 100kb
Mice FCT (<100KB)

Average

99th Percentile

conclusion
Conclusion
  • Window-based rate control does not work at near-zero round-trip latency
  • pFabric: simple, yet near-optimal
    • Decouples flow scheduling from rate control
    • Allows use of coarse window-base rate control
  • pFabricis within 10-15% of “ideal” for realistic DC workloads (SIGCOMM’13)