Loading in 2 Seconds...
Loading in 2 Seconds...
Computational Statistics – Graphical and Analytic Methods for Streaming Data. Edward J. Wegman Center for Computational Statistics George Mason University. Streaming Data: An Outline. Streaming Data Introduction Internet Traffic – Prototypical Example Background on Internet Messaging
Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.
Edward J. Wegman
Center for Computational Statistics
George Mason University
Application Header| Application Data
Protocol Header|Application Header|Application Data
IP Header|Protocol Header|Application Header|Application Data
Ethernet Header|IP Header|Protocol Header|Application Header|Application Data|Ethernet Trailer
The IP Header
TCP Packet Header
Possible TCP Session
The major problem is to detect intrusions or other unwanted events in streaming Internet traffic.
Evolutionary Graphics with Explicit Dependence on Time
Waterfall for Destination IP versus Time for only one hour
Waterfall for Source Port versus time. Diagonals are characteristics of distinct operating systems.
Examples of Block Recursion Graphics that do not Explicitly Show a Time Variable
Animation of this plot could easily be accomplished as new data enter the block and old data depart.
Dynamic Transformation of Variables may be Extremely Helpful in Evolutionary Graphics. In General, for Streaming Internet Data, “Size Variables” Tend to Cluster Near Zero, so LOG and/or SQRT Transformations Are Helpful for Visual Analytics
Number of Bytes versus Duration at full scale with no transformation.
Same data as previous image with log(1+sqrt(.))transformation applied. Notice the additional structure visible.
Conditional plots show additional points of interest.
An Evolutionary Graphic
The next figure plots log.byte versus log.len for only the web sessions. A few horizontal lines of the sort noticed in previous plots appear, but otherwise no real structure is apparent.
These same plots can be constructed when the data are conditioned by source IP address, SIP, as opposed to destination port number, DPort. The number of source IP addresses that may be active during a given hour of activity is likely to be very much higher than the number of destination ports; in this data set, only 380 unique destination ports were accessed, while 3548 source IPs are in the file.
This also is a evolutionary graphic, but with a distinctly visual analytic interpretation.
Thinning vs Binning
Truncated octahedron tessellation
Hexagonal Prism* .0812227…
Rhombic Dodecahedron* .0787451…
Truncated Octahedron* .0785433…
Dimensionless Second Moment for 3-D Polytopes
24 Cell with Cuboctahedron Envelope
j = fixed[k*(xi-a)/(b-a)]
gives the index of the bin for xi in one dimension.
Develop an Initial Estimate of the Density Using the First n Pieces of Data.
Update the Orthogonal Series Coefficients When the Next Data Element Arrives
Discount Old Data and Update the Orthogonal Series Coefficients
Let be a sequence of finite random variables with finite second moments. Define . . Thus, the sequence is a sequence of zero mean random variables with finite second moments. Let and
If the are homoscedastic, then
Let sup and as . In addition if
then there exists a Brownian motion such that
Example – Matlab movie. Discounting old data by exponential smoothing.
Initial Estimate – 250 data points from N(0,1) density
Stream Size – 500 new points from a normal mixture distribution f(x) = 0.3 N(-1,0.4) + 0.7 N(1.5,0.5)
Exponential Weighting Factors – 0.99, 0.98 and 0.97.
Basis Functions – Daubechies 8 wavelets
time duration SIP DIP DPort SPort Npacket Nbyte
1 39604.64 0.23 4367 54985 443 1631 9 3211
2 39604.64 0.27 18146 9675 3921 25 15 49
3 39604.65 0.04 18208 28256 1255 80 6 373
Example – Matlab movie. Nbytes from Internet header traffic data.
Initial Estimate – 1000 data points.
Window Size – 1000.
Stream Size – 1000 new points.
Thresholding – Hard,
Basis Functions – Daubechies 8 wavelets.
Detecting a change in the density.
Develop an Initial Estimate of the Density Using the First 1000 Observations
Update the Density When the Next Data Element Arrives
Compare to Stored Density 50 Observations Earlier
Edward J. Wegman
Center for Computational Statistics, MS 6A2
College of Science, George Mason University
Fairfax, VA 22030-4444