ABSTRACT

SUMMARY OF RESULTS

PROBLEM STATEMENTS

Result 1.1: We designed a one-pass algorithm that estimates FH to a (1 + ε)-factor in Õ(1) space, assuming FH is sufficiently large.

Result 1.2: If FH is too small to satisfy the assumption above, then we proved that no one-pass, Õ(1)-space algorithm can approximate it to even a constant factor.

Result 2.1: We designed a one-pass algorithm that estimates H to a (1 + ε)-factor in Õ(1) space if H is sufficiently large, or Õ(m2/3) space in general.

Result 2.2: We designed a two-pass algorithm that estimates H to a (1 + ε)-factor in Õ(1) space for any H.

We consider the problem of computing information theoretic functions such as entropy on a data stream, using sublinear space.

Our first result deals with a measure we call the “entropy norm” of an input stream: it is closely related to entropy but is structurally similar to the well-studied notion of frequency moments. We give a polylogarithmic space one-pass algorithm for estimating this norm under certain conditions on the input stream. We also prove a lower bound that rules out such an algorithm if these conditions do not hold.

Our second group of results are for estimating the empirical entropy of an input stream. We first present a sublinear space one-pass algorithm for this problem. For a stream of m items and a given real parameter α, our algorithm uses space Õ(m2α) and provides an approximation of 1/α in the worst case and (1 + ε) in “most” cases. We then present a two-pass polylogarithmic space (1 + ε)-approximation algorithm.

- We want to estimate the following statistics on the input stream in o(n), possibly Õ(1), space and a single pass.
- Entropy norm:
- Empirical entropy:

CONCLUSIONS

BACKGROUND AND MOTIVATIONS

Our complete results have been published as a technical report (DIMACS-TR-2005-33). We believe our algorithms will be of practical interest in data stream systems, as recent work in the networking community appear to be converging on entropy as a reasonable approach to anomaly detection [3, 5].

In future work, although we have proven our entropy norm algorithm to be optimal, it appears feasible to improve our last algorithm for estimating empirical entropy (Result 2.2) to complete in a single pass. It will also be of interest to study these problems on streams in the presence of deletions as well as insertions.

Already the focus of much recent research, algorithms for computational problems on data streams grow increasingly essential in today’s highly connected world. In this model, the input is a stream of “items” too long to be stored completely in memory, and a typical problem involves computing some statistic on this stream. The challenge is to design algorithms efficient not only in terms of running time, but also in terms of space: sublinear is a must and polylogarithmic is often the goal.

The quintessential need for such algorithms arises in analyzing IP network traffic at packet level on high speed routers. In monitoring IP traffic, one cares about anomalies, which in general are hard to define and detect since there are subtle intrusions and sophisticated dependence amongst network events and agents. An early attempt to capture the overall behavior of a data stream, now a classical problem in the area, was a family of statistics called frequency moments. If a stream of length m contains mi occurrences of item i (1 in), then its kth frequency moment, denoted Fk, is defined by . In their seminal paper, Alon et al. [1] showed that Fk can be estimated arbitrarily well in o(n) space for all integers k 0 and in Õ(1) space for k 2. Their results were later improved by Coppersmith and Kumar [2] and Indyk and Woodruff [4].

REFERENCES

[1] N. Alon, Y. Matias, M. Szegedy. The space complexity of approximating the frequency moments. Proc. ACM STOC, 20-29, 1996.

[2] D. Coppersmith and R. Kumar. An improved data stream algorithm for frequency moments. ACM-SIAM SODA, 151-156, 2004.

[3] Y. Gu, A. McCallum, D. Towsley. Detecting anomalies in network traffic using maximum entropy estimation. Proc. Internet Measurement Conference, 2005.

[4] P. Indyk and D. Woodruff. Optimal approximations of the frequency moments of data streams. ACM STOC, 202-208, 2005.

[5] K. Xu, Z. Zhang, S. Bhattacharya. Profiling internet backbone traffic: behavior models and applications. Proc. ACM SIGCOMM, 2005.

Estimating Entropy and Entropy Norm on Data StreamsAmit Chakrabarti† Khanh Do Ba†,* S. Muthukrishnan‡† Department of Computer Science, Dartmouth College‡ Department of Computer Science, Rutgers University* Work done during DIMACS REU 2005; supported by a Dean of Faculty Undergraduate Research Grant

OUTLINE OF ALGORITHMS

- A subroutine computes the basic estimator: random variable X whose mean is the target quantity and whose variance is small. The algorithm uses this subroutine to maintain s1s2 independent basic estimators Xij, for 1 is1, 1 is2. It outputs a final estimatorY, defined by
- For entropy norm, our basic estimator subroutine is
- Input stream:A = a1, a2,…, am, where each ai {1,…, n}.
- Choose p uniformly at random from {1,…, m}.
- Let r = |{q : aq = ap, pqm}|.
- Let X = m[r lg r – (r – 1)lg(r – 1)], where 0 lg 0 = 0.

- For empirical entropy, the subroutine is identical except for line 3, where we now have