Computational Statistics – Graphical and Analytic Methods for Streaming Data

1 / 104

# Computational Statistics – Graphical and Analytic Methods for Streaming Data - PowerPoint PPT Presentation

Computational Statistics – Graphical and Analytic Methods for Streaming Data. Edward J. Wegman Center for Computational Statistics George Mason University. Streaming Data: An Outline. Streaming Data Introduction Internet Traffic – Prototypical Example Background on Internet Messaging

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## PowerPoint Slideshow about 'Computational Statistics – Graphical and Analytic Methods for Streaming Data' - hilda-sawyer

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Computational Statistics – Graphical and Analytic Methods for Streaming Data

Edward J. Wegman

Center for Computational Statistics

George Mason University

Streaming Data: An Outline
• Streaming Data
• Introduction
• Internet Traffic – Prototypical Example
• Background on Internet Messaging
• Block Recursion and Evolutionary Graphics
• Geometric Quantization
• Analytic Methods
• Density Estimation for Streaming Data
Streaming Data
• Some data now come essentially without end
• Stock and commodities markets
• Supermarket scanner data
• Consumer price indices
• Telephone transaction data
• Fraud detection
• Internet traffic data
• Intrusion detection
• This will be my prototypical example
Scope of the Problem
• Most of us have seen IP addresses
• An IPv4 is a 32 bit number usually represented as 4 dotted fields
• field1.field2.field3.field4
• These IP addresses uniquely identify a machine.
• In theory, there are 4,294,967,296 addressable machines
Types of Networks
• Class A – field1 identifies the network, fields2-4 identify the specific host
• field1 is smaller than 127, e.g. 1.1.1.1
• Class B – field1.field2 identifies the network field3.field4 identifies the specific host, field3 sometimes used for subnet
• Field1 is larger than 127, e.g. 130.103.40.210
• Class C- field1.field2.field3 identifies the network, field4 the host
• E.g. 192.9.200.15

Application Data

Application Layer

Protocol Layer

IP Layer

Hardware Layer

Common Protocols
• TCP=Transmission Control Protocol
• http, smtp
• UDP=User Datagram Protocol
• ICMP=Internet Control Message Protocol
• ping, other diagnostics

• Some Flag Types
• ACK – used to acknowledge receipt of a packet
• PSH – data should be pushed to application ASAP
• RST – reset
• SYN – synchronize connection so each host knows order of packets
• FIN – finish the connection

HOST 1

HOST 2

SYN

Possible TCP Session

SYN/ACK

ACK

PSH

PSH

PSH

ACK

PSH

ACK

FIN

FIN/ACK

PSH

ACK

FIN

FIN/ACK

Ports
• There are some 216 = 65,536 ports for each host
• Some standard services use standard ports
• e.g. ftp – 21, ssh – 22, telnet – 23, smtp – 25, http – 80, pop3 – 110, nfs – 2049, even directv and aol have standard ports.
• Unprotected (open) ports allow possible intrusion
• Scanning for ports is a hacker attack strategy
Evolutionary Graphics from Streaming Internet Data

The major problem is to detect intrusions or other unwanted events in streaming Internet traffic.

Evolutionary Graphics from Streaming Internet Data
• Internet traffic is a prototypical example of streaming data and prefigures future streaming data types. I believe streaming data represents a fundamentally new data structure.
• We look only at time stamps, destination IP, destination port, source IP, source port, number of bytes, number of packets, duration of session.
• We ignore the data content of the packet, and seek to make inferences based only on the header data described above.
Evolutionary Graphics from Streaming Internet Data
• Streaming data arrives as such a rate that it is impossible to store the data.
• At George Mason University, we collect 26 terabytes of Internet header data per year.
Evolutionary Graphics from Streaming Internet Data
• I first want to describe evolutionary graphics
• In the simplest framework, the idea is to accumulate data for a very small epoch (even instantaneously), plot the new data, and discard the old.
• In practice for Internet traffic, the epoch may last for perhaps 10 milliseconds.
• Our initial suggestion is a Waterfall diagram. The first epoch is plotted at the top of the graphic. As additional data are accumulated during the second epoch, the graphic for the first epoch is pushed down and the second epoch is now plotted on the top. This continues until perhaps 1000 epochs have passed.
Evolutionary Graphics from Streaming Internet Data

Evolutionary Graphics with Explicit Dependence on Time

Evolutionary Graphics from Streaming Internet Data

Waterfall for Destination IP versus Time for only one hour

Evolutionary Graphics from Streaming Internet Data

Waterfall for Source Port versus time. Diagonals are characteristics of distinct operating systems.

Evolutionary Graphics from Streaming Internet Data
• The preponderance of relatively short sessions can be seen in the next figure, which displays the session durations as horizontal lines that extend from the start time to the end time. Because these data are collected in the order in which they occurred, the session start times range from time 0 (bottom line) to 59.971 (nearly the end of the hour).
• The second figure shows the same information, but each line is shifted back to 0. The data are censored after one hour. With continuously monitored data, the session duration lines would continue past the censoring point. Most data are not censored because 92.3% of the sessions lasted less than 30 seconds.
Evolutionary Graphics from Streaming Internet Data
• I want to introduce two critical ideas for streaming data.
• Block Recursion
• Instead of adjusting the statistic or the graphic by the new observation, keep a small number of observations in a moving window.
• Adjust the statistic or graphic by dropping off the effect of the oldest observation and inserting the effect of the newest.
• The graphic need not explicitly how dependence on time.
• Visual Analytics
• Be willing to dynamically transform variables so as to exploit structure.
Evolutionary Graphics from Streaming Internet Data

Examples of Block Recursion Graphics that do not Explicitly Show a Time Variable

Evolutionary Graphics from Streaming Internet Data
• Most destination port numbers occur only once or twice during the hour; of the 380 distinct DPorts, 293 occurred only once, 47 occurred twice, 8 occurred 3 times, 5 occurred 4 times.
• Setting aside the “well-known” ports 0-1023, we plot the occurrence of destination ports numbered 1024 and above, which should arise more or less at random, and flag as unusual any Dport that is referenced more than 10 times.
Evolutionary Graphics from Streaming Internet Data
• The next figure shows such a plot; once a destination port number occurs more than 10 times, the color changes to red, indicating potentially high traffic on this destination port.
• The construction of this plot resembles the tracing of a skyline, so we call it a “skyline plot.” A similar figure can be used to monitor SPort activity; however, the plot is more dense because this file contained 6,742 unique SPorts, versus only 380 distinct DPorts.
• Also, while most of the 380 DPorts appeared only once in the file, a SPort typically occurred four times, with 20% of the 6742 accessed source ports occurring between 14 and 88 times.
Evolutionary Graphics from Streaming Internet Data

Animation of this plot could easily be accomplished as new data enter the block and old data depart.

Evolutionary Graphics from Streaming Internet Data
• Source and destination IP addresses can be monitored also using skyline plot. In the present data file, source IP addresses are numerous (2504 unique SIPs) and frequent (the median number of occurrences is 4, and 10% occur more than 135 times).
• The next figure shows this type of plot for source IP addresses in the first 10000 records, where the colors change as the number of hits exceeds multiples of 50.
• Four unusually frequent source IP addresses are immediately evident in this plot: 4837, 13626, 33428,and 65246, which occur 371, 422, 479, and 926 times, respectively, in the first 10,000 sessions.
• The limit for unusually frequent SIP addresses may depend upon the network and the time of day, so the limits on this “skyline” plot may need to be adjusted accordingly.
Evolutionary Graphics from Streaming Internet Data

Dynamic Transformation of Variables may be Extremely Helpful in Evolutionary Graphics. In General, for Streaming Internet Data, “Size Variables” Tend to Cluster Near Zero, so LOG and/or SQRT Transformations Are Helpful for Visual Analytics

Evolutionary Graphics from Streaming Internet Data

Number of Bytes versus Duration at full scale with no transformation.

Visual Analytics for Streaming Internet Traffic

Same data as previous image with log(1+sqrt(.))transformation applied. Notice the additional structure visible.

Visual Analytics for Streaming Internet Traffic

Conditional plots show additional points of interest.

Evolutionary Graphics from Streaming Internet Data
• The next figure shows a barplot of the number of active sessions during each 30-second subset of this one-hour period (a time frame of 30 seconds is selected to minimize the correlation between counts in adjacent bars).
• The mean number of active sessions in any one 30-second interval during this hour is 923, and the standard deviation is about 140, suggesting an approximate upper 3-sigma limit of 1343 sessions.
• A square root transformation may be appropriate. The mean and standard deviation of the square root of the counts is 30.29 and 2.23, respectively, resulting in an approximate upper 3-sigma limit of 1367, very close to the limit on the raw counts, since the Poisson distribution with a high mean resembles closely the Gaussian distribution.
Visual Analytics for Streaming Internet Traffic
• Box plots are useful for displaying the relationship between two variables. In the next plot, we plot log.len versus log.byte.
• The first box contains the 2611 values for which Nbyte is zero, the next box contains 1216 values where 1<Nbyte365.
• The distributions are clearly skewed to the left (a lot of outliers of large values. The distribution of duration as a function of Nbytes is fairly smooth and has a reasonable trend upwards.
Visual Analytics for Streaming Internet Traffic
• The dense set of 55 points (points in Panel (d), I.e. 3.4 < log.pkt < 3.8) are plotted in the next figure; they lie on or very near the line log.byte = 3 + log.pkt, and 43 of them correspond to DPort 43.
• The extent to which such a pattern could occur by chance alone should be investigated, particularly if they occur all within a few seconds of each other (these did not).
Visual Analytics for Streaming Internet Traffic
• This hour of internet activity involved 380 different destination ports, DPort.
• DPort 80 (web) is the most common, comprising 116,134 of the 135,605 records.
• The next most common destination port is DPort 443 (secure web, https), used 11,627 times.
• Followed by DPort 25 (mail SMTP) accessed 6,186 times.
• Ports 554, 113, 10000, 8888 occur 200, 128, 97, 94 times, respectively.
• Displaying all 135,605 points on one plot is not very informative, so instead we provide conditional plots according to their destination ports.
Visual Analytics for Streaming Internet Traffic

The next figure plots log.byte versus log.len for only the web sessions. A few horizontal lines of the sort noticed in previous plots appear, but otherwise no real structure is apparent.

Visual Analytics for Streaming Internet Traffic

These same plots can be constructed when the data are conditioned by source IP address, SIP, as opposed to destination port number, DPort. The number of source IP addresses that may be active during a given hour of activity is likely to be very much higher than the number of destination ports; in this data set, only 380 unique destination ports were accessed, while 3548 source IPs are in the file.

Visual Analytics for Streaming Internet Traffic
• The three session “size” variables, log.len, log.pkt, log.byte, are somewhat correlated and are amenable to a “control chart” procedures, where the statistic being plotted is a weighted linear combination of the previously plotted variable () and the current value of Hotelling's T2 statistic (1 - ).
• Calculating a Hotelling's T2 statistic on three successive observations, denoted Ht, a multivariate exponentially weighted moving average (MEWMA) chart using  = 0.5 is shown in the next figure (last 10,202 observations only).
• Most values (99.7%) are below 60; a successive run of observations above 60 might suggest abnormal session sizes.
Visual Analytics for Streaming Internet Traffic

This also is a evolutionary graphic, but with a distinctly visual analytic interpretation.

Geometric Quantization
• An alternative approach to dealing with streaming data is to quantize
• Divide data domain in to small bins and assign new observations to an appropriate bin
• Allows conventional analysis
Geometric Quantization
• Squishing, Squashing, Thinning, Binning
• Squishing = # cases reduced
• Sampling = Thinning
• Quantization = Binning
• Squashing = # dimensions (variables) reduced
• Depending on goal, one of sampling or quantization may be preferable
Geometric Quantization

Thinning vs Binning

• People’s first thoughts about massive or streaming data usually is statistical subsampling
• Quantization is engineering’s success story
• Binning is statistician’s quantization
Geometric Quantization
• Images are quantized in 8 to 24 bits, i.e. 256 to 16 million levels.
• Signals (audio on CDs) are quantized in 16 bits, i.e. 65,536 levels
• Ask a statistician how many bins to use, likely response is a few hundred, ask a CS data miner, likely response is 3
• For a terabyte data set, 106 bins
Geometric Quantization
• Binning, but at microresolution
• Conventions
• d = dimension
• k = # of bins
• n = sample size
• Typically k << n
Geometric Quantization
• Choose E[W|Q = yj] = mean of observations in jth bin = yj
• In other words, E[W|Q] = Q
• The quantizer is self-consistent
Geometric Quantization
• E[W] = E[Q]
• If  is a linear unbiased estimator, then so is E[|Q]
• If h is a convex function, then E[h(Q)] ≤E[h(W)].
• In particular, E[Q2] ≤E[W2] and var (Q) ≤ var (W).
• E[Q(Q -W)] = 0
• cov (W- Q) = cov (W) - cov (Q)
• E[W - P]2≥ E[W - Q]2 where P is any other quantizer.
Geometric Quantization
• Distortion is the error due to quantization.
• In simple terms, E[W-Q]2.
• Distortion is minimized when the quantization regions, Sj, are most like a (hyper-) sphere.
Geometric Quantization
• Need space-filling tessellations
• Need congruent tiles
• Need as spherical as possible
Geometric Quantization
• In one dimension
• Only polytope is a straight line segment (also bounded by a one-dimensional sphere).
• In two dimensions
• Only polytopes are equilateral triangles, squares and hexagons
Geometric Quantization
• In 3 dimensions
• Tetrahedrons (3-simplex), cube, hexagonal prism, rhombic dodecahedron, truncated octahedron.
• In 4 dimensions
• 4 simplex, hypercube, 24 cell

Truncated octahedron tessellation

Geometric Quantization

Tetrahedron* .1040042…

Cube* .0833333…

Octahedron .0825482…

Hexagonal Prism* .0812227…

Rhombic Dodecahedron* .0787451…

Truncated Octahedron* .0785433…

Dodecahedron .0781285…

Icosahedron .0778185…

Sphere .0769670

Dimensionless Second Moment for 3-D Polytopes

Geometric Quantization

Tetrahedron

Cube

Octahedron

Truncated Octahedron

Dodecahedron

Icosahedron

Geometric Quantization

Rhombic Dodecahedron

http://www.jcrystal.com/steffenweber/POLYHEDRA/p_07.html

Geometric Quantization

24 Cell with Cuboctahedron Envelope

Hexagonal Prism

Geometric Quantization
• Using 106 bins is computationally and visually feasible.
• Fast binning, for data in the range [a,b], and for k bins

j = fixed[k*(xi-a)/(b-a)]

gives the index of the bin for xi in one dimension.

• Computational complexity is 4n+1=O(n).
• Memory requirements drop to 3k- location of bin + # items in bin + representor of bin, I.e. storage complexity is 3k.
Geometric Quantization
• In two dimensions
• Each hexagon is indexed by 3 parameters.
• Computational complexity is 3 times 1-D complexity,
• I.e. 12n+3=O(n).
• Complexity for squares is 2 times 1-D complexity.
• Ratio is 3/2.
• Storage complexity is still 3k.
Geometric Quantization
• In 3 dimensions
• For truncated octahedron, there are 3 pairs of square sides and 4 pairs of hexagonal sides.
• Computational complexity is 28n+7 = O(n).
• Computational complexity for a cube is 12n+3.
• Ratio is 7/3.
• Storage complexity is still 3k.
Geometric Quantization
• Optimally for purposes of minimizing distortion, use roundest polytope in d-dimensions.
• Complexity is always O(n).
• Storage complexity is 3k.
• # tiles grows exponentially with dimension, so-called curse of dimensionality.
• Higher dimensional geometry is poorly known.
• Computational complexity grows faster than hypercube.
Geometric Quantization
• For purposes of simplicity, always use hypercube or d-dimensional simplices
• Computational complexity is always O(n).
• Methods for data adaptive tiling are available
• Storage complexity is 3k.
• # tiles grows exponentially with dimension.
• Both polytopes depart spherical shape rapidly as d increases.
• Hypercube approach is known as datacube in computer science literature and is closely related to multivariate histograms in statistical literature.
Geometric Quantization
• Conclusions on Geometric Quantization
• Geometric approach good to 4 or 5 dimensions.
• Adaptive tilings may improve rate at which # tiles grows, but probably destroy spherical structure.
• Good for large n, but weaker for large d.
Geometric Quantization
• Alternate Strategy
• Form bins via clustering
• Known in the electrical engineering literature as vector quantization.
• Distance based clustering is O(n2) which implies poor performance for large n.
• Not terribly dependent on dimension, d.
• Clusters may be very out of round, not even convex.
• Conclusion
• Cluster approach may work for large d, but fails for large n.
• Not particularly applicable to “massive” data mining.
Geometric Quantization
• Third strategy
• Density-based clustering
• Density estimation with kernel estimators is O(n).
• Uses modes m to form clusters
• Put xi in cluster  if it is closest to mode m.
• This procedure is distance based, but with complexity O(kn) not O(n2).
• Normal mixture densities may be an alternative approach.
• Roundness may be a problem.
• But quantization based on density-based clustering offers promise for both large d and large n.
Geometric Quantization
• Binning does not lose fine structure in tails as sampling might.
• Roundoff analysis applies.
• With scale of binning, discretization not likely to be much less accurate than accuracy of recorded data.
• Discretization - finite number of bins implies discrete variables more compatible with categorical data.
Analytic Methods – Density Estimation for Streaming Data
• Density Estimation allows for understanding the broad structure
• Traditional kernel and other methods are too computationally intensive for streaming data.
• Need recursive methods
Density Estimation Outline
• Introduction
• Asymptotic Results
• Discounting Old Data
• Case Study
Data Streaming Process

Develop an Initial Estimate of the Density Using the First n Pieces of Data.

Update the Orthogonal Series Coefficients When the Next Data Element Arrives

Discount Old Data and Update the Orthogonal Series Coefficients

Data Streams
• Streaming Process
• Data structure is updated to include the new data element.
• “Old” data is discounted.
• Updated estimator is calculated using the new data element.
• The time to update the model must be kept relatively low, otherwise data will become “backed-up” in the data stream.
• Amount of data can become unbounded in size.
Density Estimation Using Orthogonal Series
• For a first approximation, let X1, X2,…, Xn represent a sequence of iid random variables from an unknown distribution f (x).
• Let f (x) satisfy the following standard conditions:
• f (x) is non-negative, integrates to one, and is a “smooth” function (i.e. f (x) and up to its first k derivatives are continuous.)
Density Estimation Using Orthogonal Series
• Let f2(x) represent the estimate of f (x) using the first two elements (X1 and X2).
• Let f3(x) represent the estimate of f (x) using the first three elements (X1, X2 and X3).
• The goal is to create a sequence of estimates {fi(x)} that converges to f (x).
Density Estimation Using Orthogonal Series
• The estimate of f (x)can be obtained by Fourier expansion using any orthogonal set of basis functions (e.g. sines, cosines, wavelets, etc.).
• is a set of orthogonal basis functions.
• bj =
Density Estimation Using Orthogonal Series
• Čencov [1962] is credited with linking function approximation using orthogonal series with density estimation.
• If the unknown function f (x)is a density, then the orthogonal series coefficients can be approximated as:
• If is an estimate of f (x) using n observaqtions, the estimate of f (x) using n + 1 observations can be obtained by recursively updating the orthogonal series coefficients.
• Once the coefficients are updated, the density can be updated using the new orthogonal series coefficients.
• Unbiased.
• Almost sure convergence.
• The have a the limiting distribution.

Theorem 1.

Let be a sequence of finite random variables with finite second moments. Define . . Thus, the sequence is a sequence of zero mean random variables with finite second moments. Let and

with .

If the are homoscedastic, then

.

Asymptotic Results(Exact Rate of Strong Convergence – Orthogonal Series Coefficients)

Theorem 2.

Let sup and as . In addition if

diverges to

then there exists a Brownian motion such that

holds for

. For

.

Discounting Old Data
• If none of the old data is discounted, eventually the new data will have virtually no effect on the estimated coefficients.
• This can be seen by taking the limit of the expression above as .
Discounting Old Data(Exponential Smoothing)
• The previous equation is the exact form of the recursive estimator with 1/(n+1) playing the role of and n /(n+1) playing the role of .
• By adjusting the value of , one can control the amount of emphasis placed on the old data.
Discounting Old Data(Exponential Smoothing)

Example – Matlab movie. Discounting old data by exponential smoothing.

Initial Estimate – 250 data points from N(0,1) density

Stream Size – 500 new points from a normal mixture distribution f(x) = 0.3 N(-1,0.4) + 0.7 N(1.5,0.5)

Exponential Weighting Factors – 0.99, 0.98 and 0.97.

Basis Functions – Daubechies 8 wavelets

Case Study(Internet Header Data – GMU)
• 135,605 records collected at GMU over the course of 1 hour.
• Sample data:

time duration SIP DIP DPort SPort Npacket Nbyte

1 39604.64 0.23 4367 54985 443 1631 9 3211

2 39604.64 0.27 18146 9675 3921 25 15 49

3 39604.65 0.04 18208 28256 1255 80 6 373

Case Study(Internet Header Data – GMU)
• Large amounts of data near the lower end of the scale required a data transformation.
• Consider the transformation:
• Adding one ensures the same range.
• Square root spread things out.
Case Study(Internet Header Data – GMU)

Example – Matlab movie. Nbytes from Internet header traffic data.

Initial Estimate – 1000 data points.

Window Size – 1000.

Stream Size – 1000 new points.

Thresholding – Hard,

Basis Functions – Daubechies 8 wavelets.

Case Study(Internet Header Data – GMU)

Detecting a change in the density.

Develop an Initial Estimate of the Density Using the First 1000 Observations

Store Density

Update the Density When the Next Data Element Arrives

Compare to Stored Density 50 Observations Earlier

Difference

Case Study(Internet Header Data – GMU)
• After processing the entire data set (135,605 record), approx. 2,700 comparisons were made.
• The average MSE was 0.06945 with a standard deviation of 0.04679
• 47 Differences (1.7%) exceeded the 3 sigma limit.
Processing Time
• Testing was performed on an Intel Pentium 4, 2.2 GHz chip with 512 MB of RAM running Windows XP.
• Comparisons were made between Matlab and C++ without graphics.
Processing Time

Comparisons

• Building a model using 5000 pieces of data read from a file.
• Simulation of a data stream of size 1000 without discounting old data.
• Simulation of a data stream of size 1000 using a window size of 200 data points.
• Simulation of a data stream of size 1000 using a window size of 200 points and making density comparisons every 50 data points.
• Simulation of a data stream of size 1000 using exponential smoothing to discount old data.
Contact Information

Edward J. Wegman

Center for Computational Statistics, MS 6A2

College of Science, George Mason University

Fairfax, VA 22030-4444

Email: ewegman@gmail.com