1 / 24

Space-efficient Tracking of Persistent Items in a Massive Data Stream

Space-efficient Tracking of Persistent Items in a Massive Data Stream. Bibudh Lahiri and Srikanta Tirthapura. Electrical & Computer Engg ., Iowa State University. Jaideep Chandrashekar. Technicolor Labs, Palo Alto. ACM DEBS 2011.

Download Presentation

Space-efficient Tracking of Persistent Items in a Massive Data Stream

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Space-efficient Tracking of Persistent Items in a Massive Data Stream Bibudh Lahiri and Srikanta Tirthapura Electrical & Computer Engg., Iowa State University Jaideep Chandrashekar Technicolor Labs, Palo Alto ACM DEBS 2011

  2. Temporal Persistence: A Not-so-Discussed Problem in Data Stream • Motivation from security, formulation as a problem in streams • Botnets, port scans, click fraud • Appear in a temporally regular manner • Do the damage, yet evade the radar • Not necessarily in large volume (stealthy) • Heavy-hitter algorithms do not work

  3. State of the Art in Data Stream Research • Frequency moments, heavy-hitter, entropy, variance • Enough to know how many times i Є {1,…m} occurs in stream, for all i • Persistence: When does i occur in the stream? In how many slots, in total?

  4. Persistent Behavior in Botnet Traffic • Giroireet al1 • Consecutive connections to same destination often separated by an hour or more • Most bots occur in 100% slots in a window when slot-length (s) = 1 hr • MyBot-8926 in 100% slots when s = 16 hrs! • Li et al2 • Periodic botnet events about every ½ hr • “Exploiting Temporal Persistence to Detect Covert Botnet Channels”, RAID 2009 • “Automating Analysis of Large-scale Botnet Probing Events”, ASIACCS 2009

  5. Problem Definition • Time is split into slots 1,2,…n of equal length • Stream S = {<di, ti>}; di: itemID, tiЄ {1,2,…n} • Window Slr over [l, r] = {(di, ti) Є S | l ≤ ti ≤ r} • pd(l,r) = persistence of d in Slr = #distinct slots in [l,r] in which d appears pa(4,7) = 2, pb(4,7) = 3, pc(4,7) = 3, pd(4,7) = 1 a, d, b c, d, e a, c, d, b a, b, a, c b, c a, b, b c, c, d, c 1 2 3 4 5 6 7

  6. Problem Definition • Item d is α-persistent in Slr : appears in at least α(r-l+1) slots • With α = 0.5, a, b and c are α-persistent in [4,7], d is not • Goal: To detect α-persistent items pa(4,7) = 2, pb(4,7) = 3, pc(4,7) = 3, pd(4,7) = 1 a, d, b c, d, e a, c, d, b a, b, a, c b, c a, b, b c, c, d, c 1 2 3 4 5 6 7

  7. Our Contributions • Lower bound: Exact tracking needs Ω(|D|.log nα) space • Approximate tracking: • Detect items with pd ≥ (α-ε)n with high probability • Items with pd < (α-ε)n not reported as persistent

  8. Our Contributions • First algorithm for this problem with any provable guarantee • Small-space algorithm • Space complexity O(1/ε) for Zipfian distributions • Upto 85% less physical memory than naïve algorithm • Typical FPR < 1%, FNR < 4%

  9. Talk Organization • Introduction • Fixed-window algorithm • Sliding-window algorithm • Evaluation

  10. Approximate Tracking • Detect items with pd ≥ (α-ε)n whp • Do not report items with pd < (α-ε)n • Fixed window: pd computed over slots [1,n]

  11. Talk Organization • Introduction • Fixed-window algorithm • Sliding-window algorithm • Evaluation

  12. Intuition: Fixed-Window Algorithm • “Sample and count” • Sample a random element in stream • Once sampled, count occurrences of the item exactly • Persistence: count only one occurrence/slot • Sampling method • Send every (d,t) through a hash-based filter • Chance of passing filter = h(d,t) << 1 (in fact, 2/εn)

  13. Intuition: Fixed-Window Algorithm • Same d, same t: h(d,t) remains same • Re-occurrences in same slot does not help • Same d, different t: h(d,t)’s are independent • (d,td,nd) initialized when (d,t) first passes filter • Persistent item: Enough chances to cross filter • Transient item: Fewer chances

  14. Intuition: Fixed-Window Algorithm a b b b c c a a f c a a Slot 1 Slot 2 Slot 3 Slot 4 (b,1) (a, 1) No No d Є S? h(d,t) < 1/2? (c,1) (c,2) Yes Yes (a, 2) (a, 3) (a, 4) td < t ? (f, 3) (c,4, 2) Yes (c,2, 1) (c, 4) (a, 4,2) (a, 4) (a, 3,1) No

  15. Performance: Fixed-Window Algorithm • False Neg.: pd ≥ αn => Pr(reported transient) ≤ e-2 = 13% • Drops to δ with O(log(1/δ)) parallel instances • pd < (α-ε)n => d never reported as persistent • Space = O(P.log(1/δ)/εn), where P = ∑d Є D(S) pd • Reduces to O(1/ε) for Zipfian distribution • Processing time per element O(log(1/δ))

  16. Talk Organization • Introduction • Fixed-window algorithm • Sliding-window algorithm • Evaluation

  17. Sliding Window Algorithm • pdc: persistence of d in [c-n+1,c] • Detect items with pdc ≥ (α-ε)n whp • Do not report items with pdc < (α-ε)n • Intuition • Start a new fixed-window data structure St in every distinct slot t where d occurs • Won’t that take too much space? • No…

  18. Intuition: Sliding-Window Algorithm • Observations • Only in few slots, d will pass filter and initialize St • In [c-n+1,…, j,…, c], if d passes filter first in j, then Sj represents pdc most accurately • Note: We save the space for Sc-n+1,Sc-n+2,…Sj-1 • At c, we can discard any Sr where r ≤ c-n • Sketch is {(d, t, nd,t,td,t)} • when initialized, how many slots, most recent slot

  19. Intuition: Sliding-Window Algorithm a b c c a a f c c a Slot 1 Slot 3 Slot 2 Slot 4 (b,1) (a, 1) No No (d,t) Є S? h(d,t) < 1/2? (c,1) (c,2) Yes Yes (a, 2) (a, 3) (f, 3) (c,3) (a,4, 1,4) (c,3, 1,3) (c,2, 1,2) (a,3, 1,3) (c,3, 2, 4) (c, 4) (a,3, 2,4) (c,2, 2,3) (a,2, 1,2) (a,2, 2,3) (a, 4)

  20. Talk Organization • Introduction • Fixed-window algorithm • Sliding-window algorithm • Evaluation

  21. Evaluation • Typically skewed distn • 885 million packets, 30-sec slots => 350 slots in ~ 3 hrs data • Query windows: [1,100], [26,125],…,[251,350] • In [1,100] window, ~570k distinct IPs, but ~500k of them occur in < 10 slots • Storing a counter for every distinct item is a waste of space

  22. Evaluation • FNR is mostly within 5%, even when ε = 0.49 for α = 0.7 • Even the highest FPR is < 3% • Small-space algo saves up to 85% space compared to naïve • 445 MB instead of 3 GB

  23. Summary • Persistent items: important on its own • Motivation: botnet detection, port scans • Exact solution needs storing all distinct items • Approximate, small-space solutions for fixed and sliding windows • Asymptotically same space for both • 70-85% saving in memory for typical values of α (0.5, 0.7) and ε (0.4α – 0.6α)

  24. Thank You !

More Related