1 / 21

Statistical Reconstruction of Largest Contributors to Network Traffic (Fisherman’s Dilemma)

Statistical Reconstruction of Largest Contributors to Network Traffic (Fisherman’s Dilemma). VALERY KANEVSKY Agilent Laboratories. Fisherman’s Dilemma. How does this catch represent the most numerous species in the sea?. Pike. Trout.

rhenshaw
Download Presentation

Statistical Reconstruction of Largest Contributors to Network Traffic (Fisherman’s Dilemma)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Statistical Reconstruction of Largest Contributors to Network Traffic(Fisherman’s Dilemma) VALERY KANEVSKY Agilent Laboratories

  2. Fisherman’s Dilemma How does this catch represent the most numerous species in the sea? Pike Trout Salmon Agilent Technologies

  3. • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • Packet Samples Sample 1 Sample 2 • • • Destination Agilent Technologies

  4. Fisherman’s Formulation: If a certain % of the fish he catches (samples) are in a set of species, then how likely is it to find that fish from the same set of species constitute “almost” the same % in the entire sea? Agilent Technologies

  5. Mathematical Formulation Let S be a finite or enumerable set of features/characteristics of Internet traffic and F an a priori probability distribution over S. Let arbitrary n be the size of a sample from S (made with replacement) and s be the set of different features therein. Letube a subset of s andFn(u) the empirical distribution which is the fraction of the sample with features in u. If u is a subset of high contributors observed in a sample, i.e., Fn(u)  a + , (0 < a < 1,  > 0), then how likely is it that F(u) > a?I.e.,what is the confidence of the inference: Fn(u)  (a + ) ”F(u) > a, or in other words what is the probability P(F(u) > a) and how does it depend on sample size n, contribution level a and error margin ? Agilent Technologies

  6. What’s the difference? Classical Statistics: Given some a priori assumption about underlying distribution and a sample, estimate the probability of an event E (e.g., E={traffic related to a given set of features w whichconstitute at least a% of the total}) along with the confidence interval and corresponding confidence level. Current context: A set of featuresw, whose corresponding trafficconstitutes at least a% of the total, depends on a random sample as opposed to be fixed in the classical case. Agilent Technologies

  7. We do not estimate the “true” fraction F(u) of the traffic related to a given setu of features! Neither do we estimate the probability P(F(u) > a) of a true fraction F(u), for agivenu, to contribute at the level a, since the latter is either 1 or 0. We estimate the confidence in the inference Fn(u)  (a + ) ”F(u) > a. Agilent Technologies

  8. Statistical Game Every sample of size n yields a set of contributors to a certain level a+  .After N of such sampling we generate a collection: set1,…, setN. 99% of times these sets are contributors to the level a. What we want to do is for given  to find such a sample size n that makes the previous assertion true. Agilent Technologies

  9. Test underlying assumptions We can’t test a theorem, provided it is correct, but we can test the underlying assumptions by actually looking at various packet traffic data records and find out how the frequency of the inference Fn(u)  (a +  ) ”F(u) > a deviates from the guaranteed by the theorem value. LBL-TCP-3 Description This trace contains two hours' worth of all wide-area TCP traffic between the Lawrence Berkeley Laboratory and the rest of the world. Format The trace was reduced from tcpdump format to ASCII using the sanitize-tcp and sanitize-syn-fin scripts. The first script was used to produce lbl-tcp-3.tcp, which has six columns: timestamp, (renumbered) source host, (renumbered) destination host, source TCP port, destination TCP port, and number of data bytes (zero for "pure-ack" packets). The second script generated lbl-tcp-3.sf, which includes the same first five columns, plus TCP flags (SYN/FIN/RST/PSH etc.), sequence number, and acknowledgement number (0 for initial SYN). Agilent Technologies

  10. What do we need? We need an uniform estimate for the confidence of the inference Fn(u)  (a + ) ”F(u) > a, spread over a collection of subsets of features u, which may appear as subsets of “large contributors”. Warning! Given everything equal, the greater is the collection of subsets the lower confidence level may be. Agilent Technologies

  11. How to select a Collection? A collection has to be: “Well defined” “Tractable”- “easy to compute” As small as possible Agilent Technologies

  12. Minimal subsets One candidate is the collection of all subsets of features found in a sample. Though obvious, this choice may not be terribly good. Can we do better? Definition: Given contribution level ata%, a subset of features u is called minimal if there is no other subset in s of smaller cardinality that contribute to the same level. For a given set of features there can be more than one minimal subset present in s, though their multitude, generally speaking, shrinks as a%approaches 100%. Agilent Technologies

  13. An answer: Let p1, …, pk, … be an ordereda priori distribution of features. For a given sample size n, k is defined as the smallest solution of the inequality p1pk > 1-1/n . . . + + Average # of minimal subsets e-2n P(F(u) > a) > 1- 2 EL EL<Constant·2k/k1/2, where Constant  2.36 ––– e-2n Can offset the growing term 2k/k1/2? 2 It depends on how fast k grows with n. Agilent Technologies

  14. Examples and Analysis Sample2 0% 25% 5% 15% 35% 5% 15% Destination Sample1 1 20% 2 15% 3 10% 4 15% 5 30% 6 5% 7 5% 90% (1,2,3,4,5) (5,2,4,7) Contributors to 60% (5,1,3) (5,2,4) (5,1,2) (5,2) Agilent Technologies

  15. Exponential distribution Cumulative distribution function: F(l)=1- l , (0< <1) Confidence level: 99% Error margin:  = 5% k(n)  ln(n)/ln +1 EL< e13/12 (2/)1/2 (ln )1/2 n-ln2/ln/(ln(n))1/2 Example: When  = 1/2, n  5150 Agilent Technologies

  16. Confidence: exponential Agilent Technologies

  17. Power Law Qualitatively different result follows if the tail of the distribution is “heavier” then exponential, e.g., obeys the power law: F(l)=1 - l1/. In this case k(n)=n1/, . For an arbitrary value of  and when  > 1 . For instance if  = 2 , under the same conditions as in the previous case, n  79,000 . n1/ EL=O(2 / n1/2 ) Agilent Technologies

  18. Confidence: power law Agilent Technologies

  19. Weibull Distribution: F(x) = 1- e -x c In this case k = (ln(n))1/c. To achieve the same 99% in confidence level with c=0.3 n should be around 2,083,500. The remarkable increase in the sample size is due to the fact that though tail of the Weibull distribution goes to zero faster than, say the power law with =2 , the decay kicks in for large values of n. E.g.,e -n becomes smaller than 1/n2 only when n is about 22,000. 0.3 Agilent Technologies

  20. Confidence: Weibull’s Agilent Technologies

  21. Acknowledgments: Andrei Broido Jim Davis Sergey Nagaev Graham Pollock Joe Sventek Lance Tatman Agilent Technologies

More Related