1 / 40

Characterizing and Exploiting Reference Locality in Data Stream Applications

Characterizing and Exploiting Reference Locality in Data Stream Applications. Feifei Li, Ching Chang, George Kollios, Azer Bestavros Computer Science Department Boston University. Query (e.g. Joins over two streams). Result. Select tuples that maximize the query metrics.

wes
Download Presentation

Characterizing and Exploiting Reference Locality in Data Stream Applications

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Characterizing and Exploiting Reference Locality in Data Stream Applications Feifei Li, Ching Chang, George Kollios, Azer Bestavros Computer Science Department Boston University

  2. Query (e.g. Joins over two streams) Result Select tuples that maximize the query metrics Data Stream Management System Application Unselected tuples Query Processor Memory Data Stream Management System (DSMS)

  3. Cast as “caching” problems • Query processing with memory constraint. Observations • Storage / Computation limitation • Full contents of tuples of interest cannot be stored in memory.

  4. window size Locality of reference properties (Denning & Schwatz) “Caching” Problem in DSMS sliding window joins sum of is the memory size What tuples to store to max the size of join results?

  5. Our Locality-aware algorithms Previous algorithms Locality-Aware Algorithms

  6. Our Contributions • Cast query processing with memory constraint in DSMS as “caching” problem and analyze the two causes of reference locality Provide a mathematical model and simple method to infer it to characterize the reference locality in data streams Show how to improve performance of data stream applications with locality-aware algorithms

  7. Reference Locality - Definition • In a data stream recently appearing tuples have a high probability of appearing in the near future.

  8. 2 2 4 10 4 10 7 7 4 … IAD 0 1 1 0 3 Inter Arrival Distance (IAD) • A random variable that corresponds to the number of tuples separating consecutive appearances of the same tuple.

  9. Where pi is the frequency of value i in this data stream Calculate distribution of IAD … i a b e c a i … xn xn+k distance is k

  10. For example: Stock Traces … MS MS MS IBM IBM GG IBM MS MS … George’s Company A listed today! Reference locality due to long-term popularity … A MS A A MS GG GG MS IBM … Reference locality due to short-term correlation Sources of Reference Locality • Long-term popularity vs. Short-term correlation (web traces, Bestavros and Crovella)

  11. Independent Reference Model • With the independent, identically-distributed (IID) assumption: Problem: only captures reference locality due to skewed popularity profile.

  12. MS MS GG A MS IBM IBM MS IBM … Random Permutation of S Metrics of Reference Locality • How to distinguish the two causes of reference locality? … A MS A A MS GG GG MS IBM … Original Data Stream S Compare IAD distribution of the two!

  13. Stock Transaction Traces Daily stock transaction data from INET ATS, Inc. Zipf-like Popularity Profile (log-log scale)

  14. Stock Transaction Traces Still has strong reference locality, due to skewed popularity distribution CDF of IAD for Original and Randomly Permuted Traces

  15. Network OD Flow Traces Network traces of Origin-Destination (OD) flows in two major networks: US Abilene and Sprint-Europe Zipf-like Popularity Profile (log-log scale)

  16. Network OD Flow Traces CDF of IAD for Original and Randomly Permuted Traces

  17. Outline • Motivation • Reference Locality: source and metrics • A Locality-Aware Data Stream Model • Application of Locality-Aware Model • Max-subset Join • Approximate count estimation • Data summarization • Performance Study • Conclusion

  18. 5 xn P(xn=xn-4)=a4 Popularity Distribution of S Locality-Aware Stream Model Recent h tuples … 2 2 4 10 5 10 7 7 Index xn-h xn-1 P Recent h tuples of S stream S

  19. 2 xn P(xn=2 from popularity profile)=b*p(2) P Popularity Distribution of S Recent h tuples of S Locality-Aware Stream Model Recent h tuples … 2 2 4 10 5 10 7 7 Index xn-h xn-1 stream S

  20. Xn-i with probability ai where 1  i  h, and Y is a IID random variable w.r.t P, and Xn= Y with probability b where (xk,c)=1 if xk=c, and 0 otherwise. Locality-Aware Stream Model Similar model appears for caching of web-traces, example Konstantinos Psounis, et. al

  21. Least square method: minimize over a1, … , ah, b: Infer the Model Expected value for xn: Make N observations, infer ai and b (h+1) parameters

  22. Model on Real Traces- Stock b: degree of reference locality due to long-term popularity 1-b: … due to short-term correlation

  23. Model on Real Traces- OD Flow

  24. Utilizing Model for Prediction S … xn-h … xn-1 xn xn+1 xn+2 … xn+T … T The expected number of occurrence for tuple with value e in a future period of T, ET(e). Using only T+1 constants calculated based on the locality model of S

  25. Outline • Motivation • Reference Locality: source and metrics • A Locality-Aware Data Stream Model • Application of Locality-Aware Model • Max-subset Join • Approximate count estimation • Data summarization • Performance Study • Conclusion

  26. window size Approximate Sliding Window Join sliding window joins sum of is the memory size What tuples to store to max the size of join results?

  27. Existing Approach • Metrics: Max-subset • Previous approach: • Random load shedding: poor performance(J. Kang et. al, A. Das et. al) • Frequency model: IID assumption (A. Das et. al) • Age-based model: too strict assumption (U. Srivastava et. al) • Stochastic model: not universal (J. Xie et. al)

  28. 8 10 … n-1 n Stream R … 6 5 10 8 10 12 10 … Stream S T=5 Marginal Utility

  29. P1 P2 x ? x ? … Based on locality model, we can show that: where F depends the characteristic equation of Pi which is a linear recursive sequence! Calculate Marginal Utility S … 10 x 13 x 8 x x 8 9 … n Tuple Index: R … 9 7 n

  30. ELBA • Exact Locality-Based Algorithm (ELBA) • Based on the previous analysis, calculate the marginal utility of tuples in the buffer, evict the victim with the smallest value • Expensive

  31. LBA • Locality-Based Algorithm (LBA) • Assume T is fixed, approximate marginal • utility based on the prediction power of locality • model. • Depends on only T+1 constants that could be • pre-computed.

  32. Space Complexity • A histogram stores both P over a domain size D and T+1 constants • histogram space usage is poly logarithm: O(poly[logN]) space usage for N values (A. Gilbert, et. al)

  33. Sliding window join: varying buffer size – OD Flow

  34. Sliding window join: varying buffer size - Stock

  35. Sliding window join: varying window size - stock

  36. Conclusion • Reference locality property is important for query processing with memory constraint in data stream applications. Most real data streams have strong temporal locality, i.e. short term correlations. How about spatial locality, i.e. correlation among different attributes of the tuple?

  37. Thanks!

  38. Approximate Count Estimation • Derive much tighter space bound for Lossy-counting algorithm (G. Manku et. al) using locality-aware techniques. • Tight space bound is important, as it tells us how much memory space to allocate.

  39. 1 2 3 2 1 2 3 1 3 … … 1 1 1 2 2 2 3 3 3 … Data Summarization • Define Entropy over a window in data stream using locality-aware techniques, instead of the normal way of entropy definition. Important for data summarization, change detection, etc. For example:

  40. Data Stream Entropy Higher degree of reference locality infers less entropy

More Related