1 / 26

Multisite Internet Data Analysis

Multisite Internet Data Analysis. Alfred O. Hero, Clyde Shih, David Barsic University of Michigan - Ann Arbor hero@eecs.umich.edu http://www.eecs.umich.edu/~hero. Network Data Collection Distributed Data Analysis Dimension Reduction Model-Based Data Analysis Conclusions.

norris
Download Presentation

Multisite Internet Data Analysis

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Multisite Internet Data Analysis Alfred O. Hero, Clyde Shih, David Barsic University of Michigan - Ann Arbor hero@eecs.umich.edu http://www.eecs.umich.edu/~hero • Network Data Collection • Distributed Data Analysis • Dimension Reduction • Model-Based Data Analysis • Conclusions Research supported in part by: NSF CCR-0325571

  2. 1. Network Data Collection • Objectives • Global: monitoring centers aggregate statistics from sites distributed around network to detect, classify, or estimate global network state while ensuring information privacy constraints • Local: collection sites gather data relevant to local network state and share information as necessary to enhance local analysis. • Types of data measured • Active: queries and requests, packet probes • Passive: netflow, router fields, honeypots, backscatter

  3. ISP 2 Local data collection and probing site ISP 1 Monitoring Center Data collection site ISP 3 : Data collector

  4. Abilene Netflow Data No. Flows Avg. Duration Std. Duration Avg Packets Std. Packets Avg Bytes Std. Bytes Protocol Dataset 1 No. Flows Avg. Duration Std. Duration Avg Packets Std. Packets Avg Bytes Std. Bytes Dataset 2

  5. Abilene Netflow Data No. Flows Avg. Duration Std. Duration Avg Packets Std. Packets Avg Bytes Std. Bytes Router Dataset 1 No. Flows Avg. Duration Std. Duration Avg Packets Std. Packets Avg Bytes Std. Bytes Dataset 2

  6. Abilene Netflow Data

  7. Challenges and Approaches • Challenges • High dimensional measurement space • Non-linear dependencies and non-stationarity • Privacy and proprietary concerns • Insufficient bandwidth for cts sampled data • Approaches • Dimension reduction • Model-based distributed inference • Controlled information sharing • Hierarchical and modular collection/analysis

  8. Hierarchical Architecure

  9. 2. Distributed Data Analysis Site C Site A • Hypothesis: data collected at sites A,B,C follow a statistical distribution defined over a lower dimensional manifold. • Overall objective: Find distributed strategies to perform reliable statistical inference with minimum amount of data sharing Site B

  10. 2.1 Distributed Dimension Reduction Unknown Embedding Unknown Manifold Unknown Distribution Sampling Observed Sample

  11. Geodesic Entropic GraphsA Planar Sample and its Euclidean MST

  12. GMST Dimension Estimation GMST Estimates d=13 H=120(bits)_

  13. Distributed GMST Estimator • Principal MST convergence result: • Distributed BHH (Aggregation rule): • Tight upper and lower bounds on limit: if exchange rooted dual graphs [Yukich:97] among sites BHH Theorem:

  14. 2.2 Distributed Model-based Inference • Global likelihood model • Global M-estimator recursion: • Global Fisher score function • Local Fisher score functions

  15. Distributed M-estimator Compute Compute k=k+1 k=k+1 A B

  16. Properties • Communication requirement is: • 2p bytes/update/site. • If data are independent attain stationary points of global likelihood • All local MLE’s are available to each site. • For multimodal likelihood, improvement on local MLE’s can be achieved by aggregation under mixture model.

  17. Global Likelihood Function Global maximum Local maxima Local MLE’s x xx x xx xxxx x xx

  18. Key Theoretical Result • The asymptotic distribution of local estimates is a Gaussian mixture dependent on global likelihood • Parameters Proof: asymptotic normal theory of local maxima (Huber:67):see Blatt&Hero:2003

  19. Local Estimator Aggregation Algorithm Estimator 1 Estimator 2 Estimator N Estimation of Gaussian Mixture Parameters (FS,EM…) Sample Covariance Analysis Aggregation To Final Estimate

  20. Simple Example IID Observation Model: • Each site observes 2 component Gaussian mixture • Identical component variances • Unknown mixing parameters • Unknown component means • 200 data collection sites • 100 samples/site • CEM2 algorithm implemented for estimation and aggregation Global maximum Local maximum Ambiguity function.

  21. Clustering and Discrimination 3 2.5 2 1.5 1 0.5 0 0 0.5 1 1.5 2 2.5 3 Local maximum Inverse FIM Global maximum 2 m Empirically estimated covariances via CEM2 m 1

  22. Validation of Key Result QQ for Cluster 1 QQ for Cluster 2

  23. Conclusions • Lossless distributed dimension reduction and model-based inference requires: • Reliable local inference methods • Aggregation rules for combining local statistics • Information sharing constraints? • Effects of bandwidth constraints - data compression? • Tracking in dynamical models?

  24. References • A. O. Hero, B. Ma, O. Michel and J. D. Gorman, “Application of entropic graphs,” IEEE Signal Processing Magazine, Sept 2002. • J. Costa and A. O. Hero, “Manifold learning with geodesic minimal spanning trees,” accepted in IEEE T-SP (Special Issue on Machine Learning), 2004. • D. Blatt and A. Hero, "Asymptotic distribution of log-likelihood maximization based algorithms and applications," in Energy Minimization Methods in Computer Vision and Pattern Recognition (EMM-CVPR), Eds. M. Figueiredo, R. Rangagaran, J. Zerubia, Springer-Verlag, 2003 • M.F. Shih and A. O. Hero, "Unicast-based inference of network link delay distributions using mixed finite mixture models," IEEE T-SP, vol. 51, No. 9, pp. 2219-2228, Aug. 2003 • N. Patwari, A. O. Hero, and Brian Sadler, "Hierarchical censoring sensors for Change Detection,” Proc. Of SSP, St. Louis, Sept. 2003.

  25. Information Sharing Game

  26. Addition of other Discriminants Value-added due to transmission of likelihood values

More Related