1 / 15

Mining Port-level IP Traffic Data

Mining Port-level IP Traffic Data. Errol Caby AT&T Labs. Outline. IP traffic metrics Exploring the relationship between IP traffic metrics Classifying IP traffic patterns Making IP traffic projections. IP Traffic Metrics.

kiele
Download Presentation

Mining Port-level IP Traffic Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Mining Port-level IP Traffic Data • Errol Caby • AT&T Labs

  2. Outline • IP traffic metrics • Exploring the relationship between IP traffic metrics • Classifying IP traffic patterns • Making IP traffic projections

  3. IP Traffic Metrics • IP services such as VPN (Virtual Private Network) are provided through ports which are identified by IP address and circuit ID. High utilization levels (high traffic levels compared to the port’s bandwidth) may cause degradation in these services. Consequently, it is of value to analyze IP traffic data at the port level to identify/predict those ports that currently have high utilization or will have high utilization within a given period of time. • Two IP traffic metrics: • Monthly utilization • The monthly utilization of a circuit is the average of the daily peak utilization for the month where utilization measures the fraction/percent of bandwidth used. • Hours of over-utilization • The hours of over-utilization of a circuit is the length of time (in hours) that the utilization exceeds a specified threshold in a month

  4. Exploring The Relationship Between The Two IP Traffic Metrics • Let m1 and m2 denote the monthly utilization and hours of over-utilization metrics, respectively. (That is, if x is a port, then m1(x) and m2(x) will denote its monthly utilization and hours of over-utilization, respectively.) We would like to examine the relationship between m1 and m2, in particular, we would like tofind a mapping f such that • m1(x) = f(m2(x)) • for any port x. • The challenge • The data that was available consisted of the two traffic metrics evaluated on disjoint sets of ports, i.e., monthly utilization was calculated for one set and hours of over-utilization was calculated for a different disjoint set.

  5. Exploring The Relationship Between The Two IP Traffic Metrics (cont.) • A definition – consistency: • Let n1 and n2 be two metrics on a set W, we will say that n1 and n2are consistent on W if n1(u) < n1(v) if and only if n2(u) < n2(v) where u and v are in W. • Assume that m1 and m2 (the monthly utilization and hours of over-utilization, respectively) are consistent on the set of ports y for which m2(y) > 0, then some consequences are the following: • if Y is a set of ports y with m2(y) > 0 for all y in Y, then f maps the pth percentile in {m2(y) | y in Y} into the pth percentile in {m1(y) | y in Y}, i.e., f maps percentiles into corresponding percentiles. • furthermore, if Y is a set of ports with m2(y) > 0 for all y in Y and if X is a set of ports on which m1 has been evaluated such that {m1(x) | x in X} and {m1(y) | y in Y} can be considered to be samples from the same distribution (note that the values in {m1(y) | y in Y} are assumed to be unknown but the values in {m1(x) | x in X} are known), then the mapping f can be determined from the above result. That is, if m2(y0) is the pth percentile in {m2(y) | y in Y}, then f(m2(y0)) can be estimated by the pth percentile in {m1(x) | x in X}.

  6. Illustration – Exploring The Relationship Between The IP Traffic Metrics At The Circuit Level Monthly Utilization • Plot of estimated points of the mapping f. Over-Utilization Hrs

  7. Illustration – Exploring The Relationship Between The IP Traffic Metrics At The Circuit Level (cont.) Monthly Utilization • A closed form of the mapping f may be estimated through curve fitting. • A good fit was found using a curve of the form Over-Utilization Hrs

  8. Mining IP Traffic Patterns • Objective – devise an algorithm for mining the time series history of the monthly utilization for a large number of ports that: • classifies the time series pattern for each port • forecasts the monthly utilization a number of months out in the future port by port in order to identify ports whose utilization would soon exceed the over-utilization threshold • A desirable quality is that the algorithm be simple so that it runs quickly and so that there are few requirements on the computing environment (e.g., it does not require any sophisticated computing platform).

  9. Normalizing Utilization • The IP environment is dynamic; bandwidth may change. Consequently, since monthly utilization expresses the percent of the bandwidth used, adjustments to the monthly utilization are needed to get the true pattern of the traffic. • This can be done by normalizing monthly utilization, expressing it in terms of a single bandwidth for the entire time period considered.

  10. Example – Normalizing Traffic Patterns Adj. Utilization Utilization Month Month • The plot on the left is the original time series of monthly utilization; the plot on the right is the normalized monthly utilization. • Note that the patterns are different.

  11. Traffic Pattern Classification (cont.) • In describing a port’s traffic, the curve (from a small set of families of curves) that is the closest to the traffic time series is then found. This curve together with the root-mean-square error describes the traffic pattern (the curve gives the general trend of the traffic; the root-mean-square error captures the fluctuation about this trend). The traffic pattern, consequently, can be classified according to the family to which it belongs. • For simplicity, the families of curves considered were 2-parameter families of the form y = a*f(x) + b where f is a function of x. It was found that the following three functions f(x) = x, f(x) = x2 and f(x) = loge(x) were sufficient to capture many of the patterns occurring. The resulting three families of curves being: y = a*x + b -- constant growth rate y = a*x2 + b -- increasing growth rate y = a*loge(x) + b -- slowing growth rate • Since the curves (models) are linear in the parameters, the best-fitting curve in a family can be found by the usual least squares technique. Also, note that since thecurves all havetwo parameters, the best fitting curve can be found by choosing the one that minimizes R2.

  12. IP Traffic Projection • To evaluate how well the three families of curves succeeded in describing/differentiating traffic patterns and how well they predicted future traffic, • the set of points (say n points) in the available time series were divided into two sets, the first n – k and the last k points, where k < n – k. • the curve (from all three families of curves) that best fitted the first n – k points, i.e. minimized R2, was selected as the one describing the traffic pattern. • the mean absolute error between this curve and the traffic time series, calculated for the last k points, was then compared with the corresponding mean absolute errors for the best-fitting curves (based on the first n – k points) from the other two classes of curves.

  13. IP Traffic Projection – Example 1 • The best-fitting curve to the first 17 points is of the form y = a*loge(x) + b • The mean absolute error between this curve and the last 5 points of the traffic time series is smaller than the mean absolute errors of the best-fitting curves from the other families.

  14. IP Traffic Projection – Example 2 • The best-fitting curve to the first 17 points is of the form y = a*x2 + b • The mean absolute error between this curve and the last 5 points of the traffic time series is smaller than the mean absolute errors of the best-fitting curves from the other families.

  15. Conclusion • Testing the algorithm on a small set of ports have yielded results that suggest that the three families of 2-parameter curves may be sufficient to capture the key elements of the traffic patterns. • Full evaluation awaits the full-scale implementation of the algorithm.

More Related