High fidelity latency measurements in low latency networks
Sponsored Links
This presentation is the property of its rightful owner.
1 / 49

High -Fidelity Latency Measurements in Low -Latency Networks PowerPoint PPT Presentation


  • 112 Views
  • Uploaded on
  • Presentation posted in: General

High -Fidelity Latency Measurements in Low -Latency Networks. Ramana Rao Kompella Myungjin Lee (Purdue), Nick Duffield (AT&T Labs – Research). Low Latency Applications. Many important data center applications require low end-to-end latencies ( microseconds )

Download Presentation

High -Fidelity Latency Measurements in Low -Latency Networks

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


High-Fidelity Latency Measurements inLow-Latency Networks

Ramana RaoKompella

Myungjin Lee (Purdue), Nick Duffield (AT&T Labs – Research)


Low Latency Applications

  • Many important data center applications require low end-to-end latencies (microseconds)

    • High Performance Computing – lose parallelism

    • Cluster Computing, Storage – lose performance

    • Automated Trading – lose arbitrage opportunities

Stanford


Low Latency Applications

  • Many important data center applications require low end-to-end latencies (microseconds)

    • High Performance Computing – lose parallelism

    • Cluster Computing, Storage – lose performance

    • Automated Trading – lose arbitrage opportunities

  • Cloud applications

    • Recommendation Systems, Social Collaboration

    • All-up SLAs of 200ms [AlizadehSigcomm10]

    • Involves backend computation timeand network latencies have little budget

Stanford


Latency Measurements are Needed

Core Router

Edge Router

ToR S/W

Which router causes the problem??

Router

Measurement within a router is necessary

At every router, high-fidelity measurements are critical to localize root causes

Once root cause localized, operators can fix by rerouting traffic, upgrade links or perform detailed diagnosis

Stanford

1ms


Vision: Knowledge Plane

SLA Diagnosis

Routing/Traffic Engineering

Scheduling/Job Placement

Knowledge

Plane

Query Interface

Push

Query

Latency

Measurements

Response

Latency

Measurements

Pull

Data Center

Network

Stanford


Contributions Thus Far…

  • Aggregate Latency Estimation

    • Lossy Difference Aggregator – Sigcomm 2009

    • FineComb – Sigmetrics2011

    • mPlane – ReArch2009

  • Differentiated Latency Estimation

    • MultiflowEstimator – Infocom2010

    • Reference Latency Interpolation – Sigcomm 2010

    • RLI across Routers – Hot-ICE 2011

    • Delay Sketching – (under review at Sigcomm 2011)

  • Scalable Query Interface

    • MAPLE – (under review at Sigcomm 2011)

Per-flow latency measurements at every hop

Per-Packet Latency Measurements

Stanford


1) Per-Flow Measurements WITH REFERENCE LATENCY INTERPOLATION[sigcomm 2010]

Stanford


Obtaining Fine-Grained Measurements

  • Native router support: SNMP, NetFlow

    • No latency measurements

  • Active probes and tomography

    • Too many probes (~10000HZ) required wasting bandwidth

  • Use expensive high-fidelity measurement boxes

    • London Stock Exchange uses Corvil boxes

    • Cannot place them ubiquitously

  • Recent work: LDA [Kompella09Sigcomm]

    • Computes average latency/variance accurately within a switch

    • Provides a good start but may not be sufficient to diagnose flow-specific problems

Stanford


From Aggregates to Per-Flow

Delay

S/W

Average latency

Time

Interval

Queue

Small delay

Large delay

  • Observation: Significant amount of difference in average latencies across flows at a router

  • Goal of this paper: How to obtain per-flow latency measurements in a scalable fashion ?

Stanford


Measurement Model

Ingress I

Egress E

Router

  • Assumption: Time synchronization between router interfaces

  • Constraint: Cannot modify regular packets to carry timestamps

    • Intrusive changes to the routing forwarding path

Stanford


Naïve Approach

Egress E

Ingress I

+

=

22

20

27

10

15

=

+

32

30

23

13

18

Avg. delay = 22/2 = 11

Avg. delay = 32/2 = 16

  • For each flow key,

    • Store timestamps for each packet at I and E

    • After a flow stops sending, I sends the packet timestamps to E

    • E computes individual packet delays

    • E aggregates average latency, variance, etc for each flow

  • Problem: High communication costs

    • At 10Gbps, few million packets per second

    • Sampling reduces communication, but also reduces accuracy

Stanford


A (Naïve) Extension of LDA

Egress E

Ingress I

LDA

LDA

LDA

LDA

Coordination

LDA

LDA

2

28

Per-flow latency

1

15

Sum of timestamps

Packet count

  • Maintaining LDAs with many counters for flows of interest

  • Problem: (Potentially) high communication costs

    • Proportional to the number of flows

  • Stanford


    Key Observation: Delay Locality

    Delay

    D3

    WD1

    WD2

    Time

    D2

    D1

    True mean delay = (D1 + D2 + D3) / 3

    Localized mean delay = (WD1 + WD2 + WD3) / 3

    How close is localized mean delay to

    true mean delay as window size varies?

    Stanford

    WD3


    Key Observation: Delay Locality

    Local mean delay per key / ms

    1s: RMSRE=1.72

    10ms: RMSRE=0.16

    0.1ms: RMSRE=0.054

    True Mean delay per key / ms

    Data sets from real router and synthetic queueing models

    Global Mean

    Stanford


    Exploiting Delay Locality

    Delay

    Ingress

    Timestamp

    Reference

    Packet

    Time

    • Reference packets are injected regularly at the ingress I

      • Special packets carrying ingress timestamp

      • Provides some reference delay values (substitute for window averages)

      • Used to approximate the latencies of regular packets

    Stanford


    RLI Architecture

    R

    1) Reference

    Packet

    Generator

    2) Latency

    Estimator

    Ingress I

    Egress E

    3

    1

    2

    3

    1

    2

    Ingress

    Timestamp

    L

    • Component 1: Reference Packet generator

    • Injects reference packets regularly

    • Component 2: Latency Estimator

    • Estimates packet latencies and updates per-flow statistics

    • Estimates directly at the egress with no extra state maintained at ingress side (reduces storage and communication overheads)

    Stanford


    Component 1: Reference Packet Generator

    • Question: When to inject a reference packet ?

    • Idea 1: 1-in-n: Inject one reference packet every npackets

      • Problem: low accuracy under low utilization

    • Idea 2: 1-in-τ: Inject one reference packet every τseconds

      • Problem: bad in case where short-term delay variance is high

    • Our approach: Dynamic injection based on utilization

      • High utilization  low injection rate

      • Low utilization  high injection rate

      • Adaptive scheme works better than fixed rate schemes

      • Details in the paper

    Stanford


    Component 2: Latency Estimator

    Linear interpolation

    line

    Interpolated

    delay

    Estimated

    delay

    L

    Delay

    Error in

    delay estimate

    Error in

    delay estimate

    Reference

    Packet

    Arrival time and delay are known

    Arrival time is known

    R

    R

    Regular

    Packet

    Time

    • Question 1:How to estimate latencies using reference packets ?

      • Solution: Different estimators possible

        • Use only the delay of a left reference packet (RLI-L)

        • Use linear interpolation of left and right reference packets (RLI)

        • Other non-linear estimators possible (e.g., shrinkage)

    Stanford


    Component 2: Latency Estimator

    Interpolation buffer

    Flow Key

    L

    R

    Delay

    4

    1

    5

    Update

    Square of delay

    Estimate

    Right Reference Packet arrived

    16

    1

    25

    Update

    Any flow selection

    strategy

    When a flow is exported

    Selection

    Avg. latency = C2 / C1

    • Question 2: How to compute per-flow latency statistics

    • Solution: Maintain 3 counters per flow at the egress side

      • C1: Number of packets

      • C2: Sum of packet delays

      • C3: Sum of squares of packet delays (for estimating variance)

      • To minimize state, can use any flow selection strategy to maintain counters for only a subset of flows

    Stanford


    Experimental Setup

    • Data sets

      • No public data center traces with timestamps

      • Real router traces with synthetic workloads: WISC

      • Real backbone traces with synthetic queueing: CHIC and SANJ

    • Simulation tool: Open source NetFlow software – YAF

      • Supports reference packet injection mechanism

      • Simulates a queueing model with RED active queue management policy

    • Experiments with different link utilizations

    Stanford


    Accuracy under High Link Utilization

    CDF

    Median relative error

    is 10-12%

    Relative error

    Stanford


    Comparison with Other Solutions

    Packet sampling rate = 0.1%

    Average relative error

    1-2 orders of magnitude difference

    Utilization

    Stanford


    Overhead of RLI

    • Bandwidth overhead is low

      • less than 0.2% of link capacity

    • Impact to packet loss is small

      • Packet loss difference with and without RLI is at most 0.001% at around 80% utilization

    Stanford


    Summary

    A scalable architecture to obtain high-fidelity per-flow latency measurements between router interfaces

    Achieves a median relative error of 10-12%

    Obtains 1-2 orders of magnitude lower relative error compared to existing solutions

    Measurements are obtained directly at the egress side

    Stanford


    Contributions Thus Far…

    • Aggregate Latency Estimation

      • Lossy Difference Aggregator – Sigcomm 2009

      • FineComb – Sigmetrics2011

      • mPlane – ReArch2009

    • Differentiated Latency Estimation

      • MultiflowEstimator – Infocom2010

      • Reference Latency Interpolation – Sigcomm 2010

      • RLI across Routers – Hot-ICE 2011

      • Virtual LDA – (under review at Sigcomm 2011)

    • Scalable Query Interface

      • MAPLE – (under review at Sigcomm 2011)

    Stanford


    2) Scalable PER-PACKET LATENCY MEASUREMENT ARCHITECTURE (Under REVIEW at SIGCOMM 2011)

    Stanford


    MAPLE Motivation

    • LDA and RLI are ossified in the aggregation level

    • Not suitable for obtaining arbitrary sub-population statistics

      • Single packet delay may be important

    • Key Goal: How to enable a flexible and scalable architecture for packet latencies ?

    Stanford


    MAPLE Architecture

    • Timestamping not strictly required

      • Can work with RLI estimated latencies

    Router B

    Router A

    P1

    P1

    1) Packet Latency

    Store

    2) Query

    Engine

    A(P1)

    Q(P1)

    Timestamp

    Unit

    Central

    Monitor

    P1

    P1

    T1

    D1

    Stanford


    Packet Latency Store (PLS)

    • Challenge: How to store packet latencies in the most efficient manner ?

    • Naïve idea:Hashtables does not scale well

      • At a minimum, require label (32 bits) + timestamp (32 bits) per packet

      • To avoid collisions, need a large number of hash table entries (~147 bits/pkt for a collision rate of 1%)

    • Can we do better ?

    Stanford


    Our Approach

    • Idea 1: Cluster packets

      • Typically few dominant values

      • Cluster packets into equivalence classes

      • Associate one delay value with a cluster

      • Choose cluster centers such that error is small

    • Idea 2: Provision storage

      • Naïvely, we can use one Bloom Filter per cluster (Partitioned Bloom Filter)

      • We propose a new data structure called Shared Vector Bloom Filter (SVBF) that is more efficient

    Stanford


    Selecting Representative Delays

    • Approach 1: Logarithmic delay selection

      • Divide delay range into logarithmic intervals

        • E.g., 0.1-10,000μs 0.1-1μs, 1-10μs …

      • Simple to implement, bounded relative error, but accuracy may not be optimal

    • Approach 2: Dynamic clustering

      • k-means (medians) clustering formulation

      • Minimizes the average absolute error of packet latencies (minimizes total Euclidean distance)

    • Approach 3: Hybrid clustering

      • Split centers equally across static and dynamic

      • Best of both worlds

    Stanford


    K-means

    • Goal: Determine k-centers every measurement cycle

      • Can be formulated as a k-means clustering algorithm

    • Problem 1: Running k-means typically hard

      • Basic algorithm has O(nk+1 log n) run time

      • Heuristics (Lloyd’s algorithm) also complicated in practice

    • Solution: Sampling and streaming algorithms

      • Use sampling to reduce n to pn

      • Use a streaming k-medians algorithm (approximate but sufficient)

    • Problem 2: Can’t find centers and record membership at the same time

    • Solution: Pipelined implementation

      • Use previous interval’s centers as an approximation for this interval

    Stanford


    Streaming k-Medians [CharikarSTOC03]

    np packets

    at i-thepoch

    O(k log(np)centers

    at (i+1)th epoch

    SOFTWARE

    Packet

    Sampling

    Online

    Clustering

    Stage

    Offline

    Clustering

    Stage

    Packet

    Stream

    k-centers

    HARDWARE

    Storage Data Structure

    Packets in (i+2)th epoch

    Flushed after every

    epoch for archival

    DRAM/SSD

    Data

    Stanford


    Naïve: Partitioned BF (PBF)

    INSERTION

    c1

    c2

    1

    1

    Packet

    Latency

    1

    1

    1

    1

    0

    0

    1

    1

    1

    1

    1

    1

    Bits are set by hashing packet contents

    c3

    Parallel matching ofclosest center

    c4

    0

    0

    1

    1

    1

    1

    LOOKUP

    c1

    0

    0

    0

    0

    1

    1

    1

    0

    0

    0

    1

    1

    1

    1

    0

    0

    1

    0

    1

    1

    1

    1

    c2

    1

    1

    0

    0

    0

    0

    1

    1

    1

    1

    Packet

    Contents

    All bits are 1

    c3

    Query all Bloom filters

    c4

    Stanford


    Problems with PBF

    • Provisioning is hard

      • Cluster sizes not known apriori

      • Over-estimation or under estimation of BF sizes

    • Lookup complexity is higher

      • Need the data structure to be partitioned every cycle

      • Need to lookup multiple random locations in the bitmap (based on number of hash functions)

    Stanford


    Shared-Vector Bloom Filter

    Bit position is located by hashing

    Packet

    Contents

    INSERTION

    c1

    H1

    H2

    c2

    Packet

    Latency

    1

    1

    c3

    Parallel matching ofclosest center

    Bit is set to 1 after offset by the number of matched center

    c4

    LOOKUP

    # of centers

    0

    0

    0

    0

    1

    0

    0

    0

    0

    0

    1

    1

    1

    1

    0

    0

    0

    1

    1

    1

    1

    1

    Packet

    Contents

    H1

    H2

    Bulk read

    AND

    c2

    0

    0

    1

    1

    0

    0

    0

    0

    Offset is center id

    0

    1

    1

    1

    Stanford


    Comparing PBF and SVBF

    • PBF

      − Lookup is not easily parallelizable

      − Provisioning is hard since number of packets per BF is not known apriori

    • SVBF

      + One Bloom filter is used

      +Burst read at the length of word

    • COMB [Hao10Infocom]

      + Single BF with groups of hash functions

      − More memory usage than SVBF and burst read not possible

    Stanford


    Comparing Storage Needs

    For same classification failure rate of 1% and 50 centers (k=50)

    Stanford


    Tie-Breaking Heuristic

    • Bloom filters have false positives

    • Lookups involve search across all BFs

      • So, multiple BFs may return match

    • Tie-breaking heuristic returns the group that has the highest cardinality

      • Store a counter per center to store number of packets that match the center (cluster cardinality)

      • Works well in practice (especially when skewed distributions)

    Stanford


    Estimation Accuracy

    CDF

    Absolute error (μs)

    Stanford


    Accuracy of Aggregates

    CDF

    Relative error

    Stanford


    MAPLE Architecture

    Router B

    Router A

    2) Query

    Engine

    A(P1)

    Q(P1)

    Central

    Monitor

    Stanford


    Query Interface

    • Assumption: Path of a packet is known

      • Possible to determine using forwarding tables

      • In OpenFlow-enabled networks, controller has the information

    • Query answer:

      • Latency estimate

      • Type: (1) Match, (2) Multi-Match, (3) No-Match

    Stanford


    Query Bandwidth

    Continuous IPID block

    Query

    message:

    f1

    1

    5

    f1

    20

    35

    • Query method 1: Query using packet hash

      • Hashed using invariant fields in a packet header

      • High query bandwidth for aggregate latency statistics (e.g., flow-level latencies)

    • Query method 2: Query using flow key and IP identifier

      • Support range search to reduce query bandwidth overhead

      • Inserts: use flow key and IPID for hashing

      • Query: use a flow key and ranges of continuous IPIDs are sent

    Stanford


    Query Bandwidth Compression

    Median compression per flow reduces bw by 90%

    CDF

    Compression ratio

    Stanford


    Storage

    • OC192 interface

      • 5 Million packets

      • 60Mbits per second

      • Assuming 10% utilization, 6 Mbits per second

    • DRAM – 16 GB

      • 40 minutes of packets

    • SSD – 256 GB

      • 10 hours – enough time for diagnosis

    Stanford


    Summary

    • RLI and LDA are ossified in their aggregation level

    • Proposed MAPLE as a mechanism to compute measurements across arbitrary sub-populations

      • Relies on clustering dominant delay values

      • Novel SVBF data structure to reduce storage and lookup complexity

    Stanford


    Conclusion

    • Many applications demand low latencies

    • Network operators need high-fidelity tools for latency measurements

    • Proposed RLI for fine-grained per-flow measurements

    • Proposed MAPLE to:

      • Store per-packet latencies in a scalable way

      • Compose latency aggregates across arbitrary sub-populations

    • Many other solutions (papers on my web page)

    Stanford


    Sponsors

    • CNS – 1054788: NSF CAREER: Towards a Knowledge Plane for Data Center Networks

    • CNS – 0831647: NSF NECO: Architectural Support for Fault Management

    • Cisco Systems: Designing Router Primitives for Monitoring Network Health

    Stanford


  • Login