1 / 33

How to do Fast Analytics on Massive Datasets

How to do Fast Analytics on Massive Datasets. Alexander Gray Georgia Institute of Technology Computational Science and Engineering College of Computing FASTlab: Fundamental Algorithmic and Statistical Tools. The FASTlab F undamental A lgorithmic and S tatistical T ools Laboratory.

Mia_John
Download Presentation

How to do Fast Analytics on Massive Datasets

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. How to do Fast Analyticson Massive Datasets Alexander Gray Georgia Institute of Technology Computational Science and Engineering College of Computing FASTlab: Fundamental Algorithmic and Statistical Tools

  2. The FASTlabFundamental Algorithmic and Statistical Tools Laboratory • Arkadas Ozakin: Research scientist, PhD Theoretical Physics • Dong Ryeol Lee: PhD student, CS + Math • Ryan Riegel: PhD student, CS + Math • Parikshit Ram: PhD student, CS + Math • William March: PhD student, Math + CS • James Waters: PhD student, Physics + CS • Hua Ouyang: PhD student, CS • Sooraj Bhat: PhD student, CS • Ravi Sastry: PhD student, CS • Long Tran: PhD student, CS • Michael Holmes: PhD student, CS + Physics (co-supervised) • Nikolaos Vasiloglou: PhD student, EE (co-supervised) • Wei Guan: PhD student, CS (co-supervised) • Nishant Mehta: PhD student, CS (co-supervised) • Wee Chin Wong: PhD student, ChemE (co-supervised) • Abhimanyu Aditya: MS student, CS • Yatin Kanetkar: MS student, CS • Praveen Krishnaiah: MS student, CS • Devika Karnik: MS student, CS • Prasad Jakka: MS student, CS

  3. Our mission Allow users to apply all the state-of-the-art statistical methods… ….with orders-of-magnitude more computational efficiency • Via: Fast algorithms + Distributed computing

  4. The problem: big datasets D N M Could be large: N (#data), D(#features),M(#models)

  5. Core methods ofstatistics / machine learning / mining • Querying: nearest-neighbor O(N), spherical range-search O(N), orthogonal range-search O(N), contingency table • Density estimation: kernel density estimation O(N2), mixture of Gaussians O(N) • Regression: linear regression O(D3), kernel regression O(N2), Gaussian process regression O(N3) • Classification: nearest-neighbor classifier O(N2), nonparametric Bayes classifier O(N2), support vector machine • Dimension reduction: principal component analysis O(D3), non-negative matrix factorization, kernel PCA O(N3), maximum variance unfolding O(N3) • Outlier detection: by robust L2 estimation, by density estimation, by dimension reduction • Clustering: k-means O(N), hierarchical clustering O(N3), by dimension reduction • Time series analysis: Kalman filter O(D3), hidden Markov model, trajectory tracking • 2-sample testing: n-point correlation O(Nn) • Cross-match: bipartite matching O(N3)

  6. 5 main computational bottlenecks:Aggregations,GNPs,graphical models,linear algebra,optimization • Querying:nearest-neighbor O(N), spherical range-search O(N), orthogonal range-search O(N), contingency table • Density estimation:kernel density estimation O(N2),mixture of Gaussians O(N) • Regression:linear regression O(D3),kernel regression O(N2),Gaussian process regressionO(N3) • Classification:nearest-neighbor classifier O(N2), nonparametric Bayes classifier O(N2), support vectormachine • Dimension reduction:principal component analysis O(D3), non-negative matrix factorization, kernelPCA O(N3), maximum variance unfolding O(N3) • Outlier detection: by robust L2 estimation, by density estimation, by dimension reduction • Clustering:k-means O(N), hierarchical clustering O(N3), by dimension reduction • Time series analysis:Kalman filter O(D3),hiddenMarkov model, trajectory tracking • 2-sample testing:n-point correlation O(Nn) • Cross-match:bipartite matching O(N3)

  7. How can we compute this efficiently? Multi-resolution data structures e.g. kd-trees [Bentley 1975], [Friedman, Bentley & Finkel 1977],[Moore & Lee 1995]

  8. A kd-tree: level 1

  9. A kd-tree: level 2

  10. A kd-tree: level 3

  11. A kd-tree: level 4

  12. A kd-tree: level 5

  13. A kd-tree: level 6

  14. Computational complexityusing fast algorithms • Querying: nearest-neighbor O(logN), spherical range-search O(logN), orthogonal range-search O(logN), contingency table • Density estimation: kernel density estimation O(N) or O(1), mixture of Gaussians O(logN) • Regression: linear regression O(D) or O(1), kernel regression O(N) or O(1), Gaussian process regression O(N) or O(1) • Classification: nearest-neighbor classifier O(N), nonparametric Bayes classifier O(N), support vector machine • Dimension reduction: principal component analysis O(D) or O(1), non-negative matrix factorization, kernel PCA O(N) or O(1), maximum variance unfolding O(N) • Outlier detection: by robust L2 estimation, by density estimation, by dimension reduction • Clustering: k-means O(logN), hierarchical clustering O(NlogN), by dimension reduction • Time series analysis: Kalman filter O(D) or O(1), hidden Markov model, trajectory tracking • 2-sample testing: n-point correlation O(Nlogn) • Cross-match: bipartite matching O(N) or O(1)

  15. Ex: 3-point correlation runtime (biggest previous: 20K) VIRGO simulation data, N = 75,000,000 naïve: 5x109 sec. (~150 years) multi-tree: 55 sec. (exact) n=2: O(N) n=3: O(Nlog3) n=4: O(N2)

  16. Our upcoming products • MLPACK (C++) Dec. 2008 • First scalable comprehensive ML library • THOR (distributed) Apr. 2009 • Faster tree-based “MapReduce” for analytics • MLPACK-dbApr. 2009 • fast data analytics in relational databases (SQL Server)

  17. The end • It’s possible to scale up all statistical methods • …with smart algorithms, not just brute force • Look for MLPACK (C++), THOR (distributed), MLPACK-db (RDBMS) Alexander Grayagray@cc.gatech.edu (email is best; webpage sorely out of date)

  18. Range-count recursive algorithm

  19. Range-count recursive algorithm

  20. Range-count recursive algorithm

  21. Range-count recursive algorithm

  22. Range-count recursive algorithm Pruned! (inclusion)

  23. Range-count recursive algorithm

  24. Range-count recursive algorithm

  25. Range-count recursive algorithm

  26. Range-count recursive algorithm

  27. Range-count recursive algorithm

  28. Range-count recursive algorithm

  29. Range-count recursive algorithm

  30. Range-count recursive algorithm Pruned! (exclusion)

  31. Range-count recursive algorithm

  32. Range-count recursive algorithm

  33. Range-count recursive algorithm fastest practical algorithm [Bentley 1975] our algorithms can use any tree

More Related