Counter Stacks: Storage Workload Analysis via Streaming Algorithms

Counter Stacks:Storage Workload Analysisvia Streaming Algorithms Nick Harvey University of British Columbia and Coho Data Joint work with Zachary Drudi, Stephen Ingram, Jake Wires, Andy Warfield

CachingWhat data to keep in fast memory? Fast, Low-Capacity Memory Slow, High-Capacity Memory

CachingHistorically Registers Belady, 1966: FIFO, RAND, MIN RAM Disk Denning, 1968: LRU

CachingModern Registers,L1, L2, L3 Associative map from 1968 RAM CPUs are >1000x fasterDisk latency is <10x betterCache misses are increasingly costly LRU etc. SSD Disk Proxy LRU ConsistentHashing... CloudStorage CDN

Challenge: ProvisioningHow much cache should you buy to support your workload?

Challenge: Virtualization • Modern servers are heavily virtualized • How should we allocate the physical cache among virtual servers to improve overall performance? • What is “marginal benefit” to giving server more cache?

Understanding Workloads • Understanding workloads better can help • Administrators make provisioning decisions • Software make allocation decisions • Storing a trace is costly: GBs per day • Analyzing and distilling traces is a challenge

Hit Rate Curve MSR Cambridge “TS” Trace, LRU Policy “Elbow” “Knee”“Working Set” Not muchmarginal benefitof a bigger cache Hit rate 0 0.1 0.2 0.3 0.4 0.5 Cache Size (GB) • Fix a particular workload and caching policy • If cache size is x, what would hit rate be? • HRCs are useful for choosing an appropriate cache size

Hit Rate Curve MSR Cambridge “Web” Trace, LRU Policy • Real-world HRCs need not be concave or smooth • “Marginal benefit” is meaningless • “Working set” is a fallacy “Elbow”? “Knee”?“Working Set”? Hit rate 0 20 40 60 80 Cache Size (GB)

LRU Caching • Policy: An LRU cache of size x always contains thex most recently requested distinct symbols. A B C A D A B … • If cache size >3 then B will still be in the cacheduring the second request for B. • Second request for B is a hit for cache size x if x>3. • Inclusive: Larger caches always include contents of smaller caches. 3 distinct symbols“Reuse Distance”

Mattson’s Algorithm A B C A D A B … Requests: • Maintain LRU cache of size n; simulate cache of all sizes x·n. • Keep list of all blocks, sorted by most recent request time. • Reuse distance of a request is its position in that list. • If distance is d, this request is a hit for all cache sizes ¸d. • Hit rate curve is CDF of reuse distances. B A C B A A C B D A C B A D C B B A D C A List:

Faster Mattson[Bennett-Kruskal 1975, Olken 1981, Almasi et al. 2001, ...] A B C A D A B … m = length of trace • Maintain table mapping block to time of last request # of blocks whose last request time is ¸ t= # of distinct blocks seen since time t • Can compute this in O(log n) time with a balanced tree • Can compute HRC in O(m log n) time n = # blocks Space is (n)

Is linear space OK? • A modern disk is 8TB, divided in 4kB blocks )n = 2B • The problem is worse in multi-disk arrays )n = 15B • If the algorithm for improving memory usageconsumes 15GB of RAM, that’s counterproductive! 60TB JBOD

Is linear space OK? • We ran an optimized C implementation of Mattson on theMSR-Cambridge traces of 13 live servers over 1 week • Trace file is 20GB in size, 2.3B requests, 750M blocks (3TB) • Processing time: 1 hour • RAM usage: 92GB • Lesson: Cannot afford linear space to process storage workloads • Question:Can we estimate HRCs in sublinear space?

Quadratic Space A B C A D A B Requests: • Reuse distance is size of oldest set that grows. • Hit rate curve is CDF of reuse distances. Reuse Distance = 1 Reuse Distance = 3 Reuse Distance = 2 A A A A A A Set of all subsequent items: B B B B B B B C C C D D D D D Items seen since second request Items seen since first request

Quadratic Space A B C A D A B Requests: A A A A A A Set of all subsequent items: B B vj = 3 C C C D D D D D j=3 For t=1,…,m Receive request bt; Find minimum j such that btis not in jth set Let vj be cardinality of jth set Record a hit at reuse distance vj Insert bt into all previous sets

More Abstract Version A B C A D A B Requests: A A A A A A Set of all subsequent items: B B vj = 3 C C C D D D D D ±j: 0 0 1 1 1 1 ±j-±j-1: How should we represent these sets? Hash table? 0 0 1 0 0 0 For t=1,…,m Let vj be cardinality of jth set Receive request bt Let ±j be change in jth set’s cardinality when adding bt For j=2,…,t Record (±j-±j-1) hits at reuse distance vj ; Insert bt into all previous sets

Random Set Data Structures Hash Table Bloom Filter F0 Estimator Insert Delete Member? Cardinality? Space (in bits) Yes Yes Yes Yes (n log n) Yes No Yes* No • (n) Yes No No Yes* O(log n) Operations Aka “HyperLogLog” “Probabilistic Counter” “Distinct Element Estimator” * allowing some error

Subquadratic Space A B C A D A B Requests: • Reuse distance is size of oldest set that grows (cardinality query) • Hit rate curve is CDF of reuse distances. Insert Insert Insert F0 Estimator F0 Estimator F0 Estimator Set of all subsequent items: … For t=1,…,m Let vj be value of jth F0-estimator Receive request bt Let ±j be change in jth F0-estimator when adding bt For j=2,…,t Record (±j-±j-1) hits at reuse distance vj Items seen since second request Items seen since first request

Towards Sublinear Space A B C A Requests: • Note that an earlier F0-estimator is a superset of later one • Can this be leveraged to achieve sublinear space? F0 Estimator F0 Estimator F0 Estimator F0 Estimator Set of all subsequent items: ¶ ¶ ¶ …

F0 Estimation[Flajolet-Martin ‘83, Alon-Mattias-Szegedy ‘99, …, Kane-Nelson-Woodruff ‘10] • Operations: • Insert(x) • Cardinality(), with (1+²) multiplicative error • Space: log(n)/²2 bits £(²-2+log n) is optimal log n rows ²-2 columns

F0 Estimation Operations: Insert(x), Cardinality() A B C A D A B … Hash function g (geometric) Hash function h (uniform) log n rows ²-2 columns

F0 Estimation Operations: Insert(x), Cardinality() Suppose we insert n distinct elements # of 1s in a column is max of ¼n²2 geometric RVs, so ¼log(n²2) Averaging over all columns gives a concentrated estimate for log(n²2) Exponentiating and scaling gives concentrated estimate for n log n rows ²-2 columns

F0 Estimation for a chain • Operations: • Insert(x) • Cardinality(t), estimate # distinct elements since tth insert • Space: log(n)/²2 words log n rows ²-2 columns

F0 Estimation for a chain Operations: Insert(x), Cardinality(t) Space: log(n)/²2 words A B C A D A B … Hash function g (geometric) Hash function h (uniform) log n rows ²-2 columns

F0 Estimation for a chain Operations: Insert(x), Cardinality(t) A B C A D A B … Hash function g (geometric) Hash function h (uniform) log n rows ²-2 columns

F0 Estimation for a chain Operations: Insert(x), Cardinality(t) • The {0,1}-matrix consisting of all entries ¸t is the same as the matrix for an F0 estimator that started at time t. • So, for any t, we can estimate # distinct elements since time t. log n rows ²-2 columns

Theorem: Let n=B¢W.Let C : [n] ! [0,1] be true HRC.Let Ĉ : [n] ! [0,1] be estimated HRC.Using O(B2¢log(n)¢log(m)/²2) words of space, can get C((j-1)¢W)-²·Ĉ(j¢W) ·C(j¢W)+² 8j=1,…,B Horizontalerror Vertical error n = # distinct blocks m = # requests B = # “bins” W = width of each “bin” 1 C C(j¢W) Ĉ C((j-1)¢W) Hit Rate Ĉ(j¢W) C((j-1)¢W)-² 0 n 0 W B bins

Experiments:MSR-Cambridge traces of 13 live servers over 1 week • Trace file is 20GB in size, 2.3B requests, 750M blocks • Optimized C implementation of Mattson’s algorithm • Processing time: ~1 hour • RAM usage: ~92GB • Java implementation of our algorithm • Processing time: 17 minutes (2M requests per second) • RAM usage: 80MB (mostly the garbage collector)

Experiments:MSR-Cambridge traces of 13 live servers over 1 week • Trace file has m=2.3B requests, n=750M blocks heuristic counter stacks

Experiments:MSR-Cambridge traces of 13 live servers over 1 week • Trace file has m=585M requests, n=62M blocks heuristic counter stacks

Experiments:MSR-Cambridge traces of 13 live servers over 1 week • Trace file has m=75M requests, n=20M blocks heuristic counter stacks

Conclusions • Workload analysis by measuring uniqueness over time. • Notion of “working set” can be replaced by “hit rate curve”. • Can estimate HRCs in sublinear space, quickly and accurately. • On some real-world data sets, its accuracy is noticeably better than heuristics that have been proposed in the literature.

Open Questions • Does algorithm use optimal amount of space?Can it be improved to O(B¢log(n)¢log(m)/²2) words of space? • We did not discuss runtime.Can we get runtime independent of B and ²? • We are taking difference of F0-estimators by subtraction.This seems crude. Is there a better approach? • Streaming has been used in networks, databases, etc.To date, not used much in storage. Potentially more uses here.

Counter Stacks: Storage Workload Analysis via Streaming Algorithms