High fidelity latency measurements in low latency networks
Download
1 / 49

High -Fidelity Latency Measurements in Low -Latency Networks - PowerPoint PPT Presentation


  • 168 Views
  • Uploaded on

High -Fidelity Latency Measurements in Low -Latency Networks. Ramana Rao Kompella Myungjin Lee (Purdue), Nick Duffield (AT&T Labs – Research). Low Latency Applications. Many important data center applications require low end-to-end latencies ( microseconds )

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' High -Fidelity Latency Measurements in Low -Latency Networks' - nike


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
High fidelity latency measurements in low latency networks

High-Fidelity Latency Measurements inLow-Latency Networks

Ramana RaoKompella

Myungjin Lee (Purdue), Nick Duffield (AT&T Labs – Research)


Low latency applications
Low Latency Applications

  • Many important data center applications require low end-to-end latencies (microseconds)

    • High Performance Computing – lose parallelism

    • Cluster Computing, Storage – lose performance

    • Automated Trading – lose arbitrage opportunities

Stanford


Low latency applications1
Low Latency Applications

  • Many important data center applications require low end-to-end latencies (microseconds)

    • High Performance Computing – lose parallelism

    • Cluster Computing, Storage – lose performance

    • Automated Trading – lose arbitrage opportunities

  • Cloud applications

    • Recommendation Systems, Social Collaboration

    • All-up SLAs of 200ms [AlizadehSigcomm10]

    • Involves backend computation timeand network latencies have little budget

Stanford


Latency measurements are needed
Latency Measurements are Needed

Core Router

Edge Router

ToR S/W

Which router causes the problem??

Router

Measurement within a router is necessary

At every router, high-fidelity measurements are critical to localize root causes

Once root cause localized, operators can fix by rerouting traffic, upgrade links or perform detailed diagnosis

Stanford

1ms


Vision knowledge plane
Vision: Knowledge Plane

SLA Diagnosis

Routing/Traffic Engineering

Scheduling/Job Placement

Knowledge

Plane

Query Interface

Push

Query

Latency

Measurements

Response

Latency

Measurements

Pull

Data Center

Network

Stanford


Contributions thus far
Contributions Thus Far…

  • Aggregate Latency Estimation

    • Lossy Difference Aggregator – Sigcomm 2009

    • FineComb – Sigmetrics2011

    • mPlane – ReArch2009

  • Differentiated Latency Estimation

    • MultiflowEstimator – Infocom2010

    • Reference Latency Interpolation – Sigcomm 2010

    • RLI across Routers – Hot-ICE 2011

    • Delay Sketching – (under review at Sigcomm 2011)

  • Scalable Query Interface

    • MAPLE – (under review at Sigcomm 2011)

Per-flow latency measurements at every hop

Per-Packet Latency Measurements

Stanford


1 per flow measurements with reference latency interpolation sigcomm 2010
1) Per-Flow Measurements WITH REFERENCE LATENCY INTERPOLATION[sigcomm 2010]

Stanford


Obtaining fine grained measurements
Obtaining Fine-Grained Measurements INTERPOLATION

  • Native router support: SNMP, NetFlow

    • No latency measurements

  • Active probes and tomography

    • Too many probes (~10000HZ) required wasting bandwidth

  • Use expensive high-fidelity measurement boxes

    • London Stock Exchange uses Corvil boxes

    • Cannot place them ubiquitously

  • Recent work: LDA [Kompella09Sigcomm]

    • Computes average latency/variance accurately within a switch

    • Provides a good start but may not be sufficient to diagnose flow-specific problems

Stanford


From aggregates to per flow
From Aggregates to Per-Flow INTERPOLATION

Delay

S/W

Average latency

Time

Interval

Queue

Small delay

Large delay

  • Observation: Significant amount of difference in average latencies across flows at a router

  • Goal of this paper: How to obtain per-flow latency measurements in a scalable fashion ?

Stanford


Measurement model
Measurement Model INTERPOLATION

Ingress I

Egress E

Router

  • Assumption: Time synchronization between router interfaces

  • Constraint: Cannot modify regular packets to carry timestamps

    • Intrusive changes to the routing forwarding path

Stanford


Na ve approach
Naïve Approach INTERPOLATION

Egress E

Ingress I

+

=

22

20

27

10

15

=

+

32

30

23

13

18

Avg. delay = 22/2 = 11

Avg. delay = 32/2 = 16

  • For each flow key,

    • Store timestamps for each packet at I and E

    • After a flow stops sending, I sends the packet timestamps to E

    • E computes individual packet delays

    • E aggregates average latency, variance, etc for each flow

  • Problem: High communication costs

    • At 10Gbps, few million packets per second

    • Sampling reduces communication, but also reduces accuracy

Stanford


A na ve extension of lda
A (Naïve) Extension of LDA INTERPOLATION

Egress E

Ingress I

LDA

LDA

LDA

LDA

Coordination

LDA

LDA

2

28

Per-flow latency

1

15

Sum of timestamps

Packet count

  • Maintaining LDAs with many counters for flows of interest

  • Problem: (Potentially) high communication costs

    • Proportional to the number of flows

  • Stanford


    Key observation delay locality
    Key Observation: Delay Locality INTERPOLATION

    Delay

    D3

    WD1

    WD2

    Time

    D2

    D1

    True mean delay = (D1 + D2 + D3) / 3

    Localized mean delay = (WD1 + WD2 + WD3) / 3

    How close is localized mean delay to

    true mean delay as window size varies?

    Stanford

    WD3


    Key observation delay locality1
    Key Observation: Delay Locality INTERPOLATION

    Local mean delay per key / ms

    1s: RMSRE=1.72

    10ms: RMSRE=0.16

    0.1ms: RMSRE=0.054

    True Mean delay per key / ms

    Data sets from real router and synthetic queueing models

    Global Mean

    Stanford


    Exploiting delay locality
    Exploiting Delay Locality INTERPOLATION

    Delay

    Ingress

    Timestamp

    Reference

    Packet

    Time

    • Reference packets are injected regularly at the ingress I

      • Special packets carrying ingress timestamp

      • Provides some reference delay values (substitute for window averages)

      • Used to approximate the latencies of regular packets

    Stanford


    Rli architecture
    RLI Architecture INTERPOLATION

    R

    1) Reference

    Packet

    Generator

    2) Latency

    Estimator

    Ingress I

    Egress E

    3

    1

    2

    3

    1

    2

    Ingress

    Timestamp

    L

    • Component 1: Reference Packet generator

    • Injects reference packets regularly

    • Component 2: Latency Estimator

    • Estimates packet latencies and updates per-flow statistics

    • Estimates directly at the egress with no extra state maintained at ingress side (reduces storage and communication overheads)

    Stanford


    Component 1 reference p acket generator
    Component 1: Reference INTERPOLATIONPacket Generator

    • Question: When to inject a reference packet ?

    • Idea 1: 1-in-n: Inject one reference packet every npackets

      • Problem: low accuracy under low utilization

    • Idea 2: 1-in-τ: Inject one reference packet every τseconds

      • Problem: bad in case where short-term delay variance is high

    • Our approach: Dynamic injection based on utilization

      • High utilization  low injection rate

      • Low utilization  high injection rate

      • Adaptive scheme works better than fixed rate schemes

      • Details in the paper

    Stanford


    Component 2 latency e stimator
    Component 2: Latency INTERPOLATIONEstimator

    Linear interpolation

    line

    Interpolated

    delay

    Estimated

    delay

    L

    Delay

    Error in

    delay estimate

    Error in

    delay estimate

    Reference

    Packet

    Arrival time and delay are known

    Arrival time is known

    R

    R

    Regular

    Packet

    Time

    • Question 1:How to estimate latencies using reference packets ?

      • Solution: Different estimators possible

        • Use only the delay of a left reference packet (RLI-L)

        • Use linear interpolation of left and right reference packets (RLI)

        • Other non-linear estimators possible (e.g., shrinkage)

    Stanford


    Component 2 latency estimator
    Component 2: Latency Estimator INTERPOLATION

    Interpolation buffer

    Flow Key

    L

    R

    Delay

    4

    1

    5

    Update

    Square of delay

    Estimate

    Right Reference Packet arrived

    16

    1

    25

    Update

    Any flow selection

    strategy

    When a flow is exported

    Selection

    Avg. latency = C2 / C1

    • Question 2: How to compute per-flow latency statistics

    • Solution: Maintain 3 counters per flow at the egress side

      • C1: Number of packets

      • C2: Sum of packet delays

      • C3: Sum of squares of packet delays (for estimating variance)

      • To minimize state, can use any flow selection strategy to maintain counters for only a subset of flows

    Stanford


    Experimental setup
    Experimental Setup INTERPOLATION

    • Data sets

      • No public data center traces with timestamps

      • Real router traces with synthetic workloads: WISC

      • Real backbone traces with synthetic queueing: CHIC and SANJ

    • Simulation tool: Open source NetFlow software – YAF

      • Supports reference packet injection mechanism

      • Simulates a queueing model with RED active queue management policy

    • Experiments with different link utilizations

    Stanford


    Accuracy under high link utilization
    Accuracy under High Link Utilization INTERPOLATION

    CDF

    Median relative error

    is 10-12%

    Relative error

    Stanford


    Comparison with other solutions
    Comparison with Other Solutions INTERPOLATION

    Packet sampling rate = 0.1%

    Average relative error

    1-2 orders of magnitude difference

    Utilization

    Stanford


    Overhead of rli
    Overhead of RLI INTERPOLATION

    • Bandwidth overhead is low

      • less than 0.2% of link capacity

    • Impact to packet loss is small

      • Packet loss difference with and without RLI is at most 0.001% at around 80% utilization

    Stanford


    Summary
    Summary INTERPOLATION

    A scalable architecture to obtain high-fidelity per-flow latency measurements between router interfaces

    Achieves a median relative error of 10-12%

    Obtains 1-2 orders of magnitude lower relative error compared to existing solutions

    Measurements are obtained directly at the egress side

    Stanford


    Contributions thus far1
    Contributions Thus Far… INTERPOLATION

    • Aggregate Latency Estimation

      • Lossy Difference Aggregator – Sigcomm 2009

      • FineComb – Sigmetrics2011

      • mPlane – ReArch2009

    • Differentiated Latency Estimation

      • MultiflowEstimator – Infocom2010

      • Reference Latency Interpolation – Sigcomm 2010

      • RLI across Routers – Hot-ICE 2011

      • Virtual LDA – (under review at Sigcomm 2011)

    • Scalable Query Interface

      • MAPLE – (under review at Sigcomm 2011)

    Stanford


    2 scalable per packet latency measurement architecture under review at sigcomm 2011
    2) Scalable PER-PACKET LATENCY MEASUREMENT ARCHITECTURE (Under REVIEW at SIGCOMM 2011)

    Stanford


    Maple motivation
    MAPLE Motivation (Under REVIEW at SIGCOMM 2011)

    • LDA and RLI are ossified in the aggregation level

    • Not suitable for obtaining arbitrary sub-population statistics

      • Single packet delay may be important

    • Key Goal: How to enable a flexible and scalable architecture for packet latencies ?

    Stanford


    Maple architecture
    MAPLE Architecture (Under REVIEW at SIGCOMM 2011)

    • Timestamping not strictly required

      • Can work with RLI estimated latencies

    Router B

    Router A

    P1

    P1

    1) Packet Latency

    Store

    2) Query

    Engine

    A(P1)

    Q(P1)

    Timestamp

    Unit

    Central

    Monitor

    P1

    P1

    T1

    D1

    Stanford


    Packet latency store pls
    Packet Latency Store (PLS) (Under REVIEW at SIGCOMM 2011)

    • Challenge: How to store packet latencies in the most efficient manner ?

    • Naïve idea:Hashtables does not scale well

      • At a minimum, require label (32 bits) + timestamp (32 bits) per packet

      • To avoid collisions, need a large number of hash table entries (~147 bits/pkt for a collision rate of 1%)

    • Can we do better ?

    Stanford


    Our approach
    Our Approach (Under REVIEW at SIGCOMM 2011)

    • Idea 1: Cluster packets

      • Typically few dominant values

      • Cluster packets into equivalence classes

      • Associate one delay value with a cluster

      • Choose cluster centers such that error is small

    • Idea 2: Provision storage

      • Naïvely, we can use one Bloom Filter per cluster (Partitioned Bloom Filter)

      • We propose a new data structure called Shared Vector Bloom Filter (SVBF) that is more efficient

    Stanford


    Selecting representative delays
    Selecting Representative Delays (Under REVIEW at SIGCOMM 2011)

    • Approach 1: Logarithmic delay selection

      • Divide delay range into logarithmic intervals

        • E.g., 0.1-10,000μs 0.1-1μs, 1-10μs …

      • Simple to implement, bounded relative error, but accuracy may not be optimal

    • Approach 2: Dynamic clustering

      • k-means (medians) clustering formulation

      • Minimizes the average absolute error of packet latencies (minimizes total Euclidean distance)

    • Approach 3: Hybrid clustering

      • Split centers equally across static and dynamic

      • Best of both worlds

    Stanford


    K means
    K-means (Under REVIEW at SIGCOMM 2011)

    • Goal: Determine k-centers every measurement cycle

      • Can be formulated as a k-means clustering algorithm

    • Problem 1: Running k-means typically hard

      • Basic algorithm has O(nk+1 log n) run time

      • Heuristics (Lloyd’s algorithm) also complicated in practice

    • Solution: Sampling and streaming algorithms

      • Use sampling to reduce n to pn

      • Use a streaming k-medians algorithm (approximate but sufficient)

    • Problem 2: Can’t find centers and record membership at the same time

    • Solution: Pipelined implementation

      • Use previous interval’s centers as an approximation for this interval

    Stanford


    Streaming k medians charikarstoc03
    Streaming (Under REVIEW at SIGCOMM 2011)k-Medians [CharikarSTOC03]

    np packets

    at i-thepoch

    O(k log(np)centers

    at (i+1)th epoch

    SOFTWARE

    Packet

    Sampling

    Online

    Clustering

    Stage

    Offline

    Clustering

    Stage

    Packet

    Stream

    k-centers

    HARDWARE

    Storage Data Structure

    Packets in (i+2)th epoch

    Flushed after every

    epoch for archival

    DRAM/SSD

    Data

    Stanford


    Na ve partitioned bf pbf
    Naïve: Partitioned BF (PBF) (Under REVIEW at SIGCOMM 2011)

    INSERTION

    c1

    c2

    1

    1

    Packet

    Latency

    1

    1

    1

    1

    0

    0

    1

    1

    1

    1

    1

    1

    Bits are set by hashing packet contents

    c3

    Parallel matching ofclosest center

    c4

    0

    0

    1

    1

    1

    1

    LOOKUP

    c1

    0

    0

    0

    0

    1

    1

    1

    0

    0

    0

    1

    1

    1

    1

    0

    0

    1

    0

    1

    1

    1

    1

    c2

    1

    1

    0

    0

    0

    0

    1

    1

    1

    1

    Packet

    Contents

    All bits are 1

    c3

    Query all Bloom filters

    c4

    Stanford


    Problems with pbf
    Problems with PBF (Under REVIEW at SIGCOMM 2011)

    • Provisioning is hard

      • Cluster sizes not known apriori

      • Over-estimation or under estimation of BF sizes

    • Lookup complexity is higher

      • Need the data structure to be partitioned every cycle

      • Need to lookup multiple random locations in the bitmap (based on number of hash functions)

    Stanford


    Shared vector bloom filter
    Shared-Vector Bloom Filter (Under REVIEW at SIGCOMM 2011)

    Bit position is located by hashing

    Packet

    Contents

    INSERTION

    c1

    H1

    H2

    c2

    Packet

    Latency

    1

    1

    c3

    Parallel matching ofclosest center

    Bit is set to 1 after offset by the number of matched center

    c4

    LOOKUP

    # of centers

    0

    0

    0

    0

    1

    0

    0

    0

    0

    0

    1

    1

    1

    1

    0

    0

    0

    1

    1

    1

    1

    1

    Packet

    Contents

    H1

    H2

    Bulk read

    AND

    c2

    0

    0

    1

    1

    0

    0

    0

    0

    Offset is center id

    0

    1

    1

    1

    Stanford


    Comparing pbf and svbf
    Comparing PBF and SVBF (Under REVIEW at SIGCOMM 2011)

    • PBF

      − Lookup is not easily parallelizable

      − Provisioning is hard since number of packets per BF is not known apriori

    • SVBF

      + One Bloom filter is used

      +Burst read at the length of word

    • COMB [Hao10Infocom]

      + Single BF with groups of hash functions

      − More memory usage than SVBF and burst read not possible

    Stanford


    Comparing storage needs
    Comparing Storage Needs (Under REVIEW at SIGCOMM 2011)

    For same classification failure rate of 1% and 50 centers (k=50)

    Stanford


    Tie breaking heuristic
    Tie-Breaking Heuristic (Under REVIEW at SIGCOMM 2011)

    • Bloom filters have false positives

    • Lookups involve search across all BFs

      • So, multiple BFs may return match

    • Tie-breaking heuristic returns the group that has the highest cardinality

      • Store a counter per center to store number of packets that match the center (cluster cardinality)

      • Works well in practice (especially when skewed distributions)

    Stanford


    Estimation accuracy
    Estimation Accuracy (Under REVIEW at SIGCOMM 2011)

    CDF

    Absolute error (μs)

    Stanford


    Accuracy of aggregates
    Accuracy of (Under REVIEW at SIGCOMM 2011)Aggregates

    CDF

    Relative error

    Stanford


    Maple architecture1
    MAPLE Architecture (Under REVIEW at SIGCOMM 2011)

    Router B

    Router A

    2) Query

    Engine

    A(P1)

    Q(P1)

    Central

    Monitor

    Stanford


    Query interface
    Query Interface (Under REVIEW at SIGCOMM 2011)

    • Assumption: Path of a packet is known

      • Possible to determine using forwarding tables

      • In OpenFlow-enabled networks, controller has the information

    • Query answer:

      • Latency estimate

      • Type: (1) Match, (2) Multi-Match, (3) No-Match

    Stanford


    Query bandwidth
    Query Bandwidth (Under REVIEW at SIGCOMM 2011)

    Continuous IPID block

    Query

    message:

    f1

    1

    5

    f1

    20

    35

    • Query method 1: Query using packet hash

      • Hashed using invariant fields in a packet header

      • High query bandwidth for aggregate latency statistics (e.g., flow-level latencies)

    • Query method 2: Query using flow key and IP identifier

      • Support range search to reduce query bandwidth overhead

      • Inserts: use flow key and IPID for hashing

      • Query: use a flow key and ranges of continuous IPIDs are sent

    Stanford


    Query bandwidth compression
    Query Bandwidth Compression (Under REVIEW at SIGCOMM 2011)

    Median compression per flow reduces bw by 90%

    CDF

    Compression ratio

    Stanford


    Storage
    Storage (Under REVIEW at SIGCOMM 2011)

    • OC192 interface

      • 5 Million packets

      • 60Mbits per second

      • Assuming 10% utilization, 6 Mbits per second

    • DRAM – 16 GB

      • 40 minutes of packets

    • SSD – 256 GB

      • 10 hours – enough time for diagnosis

    Stanford


    Summary1
    Summary (Under REVIEW at SIGCOMM 2011)

    • RLI and LDA are ossified in their aggregation level

    • Proposed MAPLE as a mechanism to compute measurements across arbitrary sub-populations

      • Relies on clustering dominant delay values

      • Novel SVBF data structure to reduce storage and lookup complexity

    Stanford


    Conclusion
    Conclusion (Under REVIEW at SIGCOMM 2011)

    • Many applications demand low latencies

    • Network operators need high-fidelity tools for latency measurements

    • Proposed RLI for fine-grained per-flow measurements

    • Proposed MAPLE to:

      • Store per-packet latencies in a scalable way

      • Compose latency aggregates across arbitrary sub-populations

    • Many other solutions (papers on my web page)

    Stanford


    Sponsors
    Sponsors (Under REVIEW at SIGCOMM 2011)

    • CNS – 1054788: NSF CAREER: Towards a Knowledge Plane for Data Center Networks

    • CNS – 0831647: NSF NECO: Architectural Support for Fault Management

    • Cisco Systems: Designing Router Primitives for Monitoring Network Health

    Stanford


    ad