Every Microsecond Counts: Tracking Fine-Grain Latencies with a Lossy Difference Aggregator

Every Microsecond Counts:Tracking Fine-Grain Latencies with a Lossy Difference Aggregator Ramana Rao Kompella KirillLevchenko, Alex C. Snoeren, George Varghese

Low latency networks • Networks with end-to-end microsecond latency guarantees important for many applications • Automated Trading – lose arbitrage opportunities • High Performance Computing – lose parallelism • Cluster Computing, Storage – lose performance SIGCOMM 2009

Obtaining fine-grain measurements “When considering how to reduce latency, the first step is to measure it.” --Joanne Kinsella, Head of Portfolio, British Telecom SIGCOMM 2009

Obtaining fine-grain measurements How to obtain fine-grain measurements using simple low-cost hardware primitives ? • Native Router Support: SNMP, NetFlow • Coarse counters, per-flow statistics • No latency measurements • Active probes • Measuring microseconds requires too many probes • Wastes bandwidth and interferes with regular traffic • State-of-the-art: • Expensive high-fidelity measurement boxes • London Stock Exchange uses Corvil boxes SIGCOMM 2009

Lossy Difference Aggregator (LDA) Computes loss rate, average delay and variance, loss distribution Uses only small amount of hardware (registers and hashing) Measures real traffic with no injected probes SIGCOMM 2009

Measurement model Sender S Receiver R Router SIGCOMM 2009

Measurementmodel Sender S Receiver R DS DR State • Packets always travel from S to R • R to S is considered separately • Divide time into equal bins (measurement intervals) • Interval depends on granularity required (typically sub-second) • Both S and R maintain some state D about packets • State is updated upon packet departure • S transmits DS to R • R computes the required measurement as f(DS, DR) SIGCOMM 2009

Computing loss X Sender S Receiver R 1 2 1 3 2 Loss = 3 – 2 = 1 • Trivially, we can obtain loss rates, however small • Store a packet counter at S and R. • S sends the counter value to R periodically SIGCOMM 2009

Computing delay (naïve) Sender S Receiver R 10 23 13 − = 12 26 14 15 35 20 Avg. Delay = 47/3 = 15.7 Observation: computing delay is trivial if no loss. • A naïve first cut: timestamps • Store timestamps for each packet at S and R • After every cycle, S sends the packet timestamps toR • R computes individual delays, and computes average • Problem: High communication costs • 5 million packet timestamps require ~ 25,000 packets • Sampling reduces communication, but also reduces accuracy SIGCOMM 2009

Estimating delay under no loss Sender S Receiver R 10+12 10+12+15 23+26+35 23+26 10 23 Avg. delay = 84-37/3 = 15.7 Works great, if packets were never lost… • Observation: Aggregation can reduce cost • Store sum of the timestamps at S & R in timestamp accumulator • After every cycle, S sends its sum CS to R • R computes average delay as (CS – CR) / N • Only one counter and one packet to send SIGCOMM 2009

Delay in the presence of loss Unusable since no. of packets does not match X Sender S Receiver R LDA stores the synopsis 10 + 15 23 + 29 Hash Hash + 17 39 12 2 2 2 25 52 27 = − 2 0 1 39 0 29 Avg. delay = 27/2 = 13.5 • (Much) better idea:Lossy Difference Aggregator (LDA) • Hash table with packet count and timestamp sum • Spread loss across several buckets • Consistent hashing to ensure pkts hash to same buckets • Discard buckets with lost packets SIGCOMM 2009

Tuning LDA • Problem 1: High loss implies many bad buckets • Solution: Sampling • Control sampling rate such that not too many buckets corrupted • For a given loss, we can analytically derive an optimal sampling probability that maximizes number of delay samples • Problem 2: Loss rate is unpredictable • Solution: Multiple banks tuned to different loss rates • Logarithmic copies suffice in theory • Smaller in practice (two-bank LDA sufficient in our evaluation) SIGCOMM 2009

Comparison with active probes Relative Error (log scale) Loss Rate SIGCOMM 2009

Relative error with known loss rate Relative Error Loss Rate SIGCOMM 2009

Multiple bank LDA Relative Error Loss Rate SIGCOMM 2009

Computing variance using LDA • Aggregation idea for average delay does not work here directly • Idea: Keep a “plus-minus” counter • Easy implementation based on packet hash [AMS96] • Each TS is added (subtracted) with probability ½ • Cross products cancel since +ive, –ive terms same probability • Results: Average std. deviation around 5% SIGCOMM 2009

Summary • Low latency networks for automated trading, data centers • Active probes require extremely high frequency probes • Specialized boxes too expensive • Simple to implement in modern routers • LDA requires 0.13 mm2 (<1% of 65nm ASIC) • Uses only counters plus increment/decrement operations • Hash function implementation using XOR arrays • Exploits FIFO ordering and fine-grain time synchronization • In comparison with active probes • 25-500x lower relative error, for a fixed communication budget • 50-60x lower overhead, for a target error rate • Future Work: Scalable per-flow latency measurements SIGCOMM 2009

Thanks! Questions ? SIGCOMM 2009

Every Microsecond Counts: Tracking Fine-Grain Latencies with a Lossy Difference Aggregator

Every Microsecond Counts: Tracking Fine-Grain Latencies with a Lossy Difference Aggregator

Presentation Transcript

Front Tracking

Grain Boundary Properties: Energy (L21)

Grain Boundary character: 5-parameter descriptions

Efficient Runahead Execution Processors A Power-Efficient Processing Paradigm for Tolerating Long Main Memory Latencies

2000 - 2001 MATH COUNTS

MATH COUNTS

UNIVERSE’S FINE TUNING = DESIGN

BOT-2 Fine Motor Assessment

Assessment

Eliminating the Hardware/Software Divide

MATH COUNTS ®

The difference that makes the difference

Fine Atmospheric Particles: Do we need to worry about them??

12. Multithreaded Processors

Goal For Farmers

Gate Transfer Level Synthesis as an Automated Approach to Fine-Grain Pipelining

MATH COUNTS

Eye tracking experiments

MATH COUNTS 

Face and Facial feature tracking ASM, AAM , CLM