Machine Learning on Massive Datasets

Machine Learning on Massive Datasets Alexander Gray Georgia Institute of Technology College of Computing FASTlab: Fundamental Algorithmic and Statistical Tools Laboratory

The FASTlabFundamental Algorithmic and Statistical Tools Laboratorywww.fast-lab.org • Alexander Gray: Assoc Prof, Applied Math + CS; PhD CS • Arkadas Ozakin: Research Scientist, Math + Physics; PhD Physics • Dong Ryeol Lee: PhD student, CS + Math • Ryan Riegel: PhD student, CS + Math • Sooraj Bhat: PhD student, CS • Wei Guan: PhD student, CS • Nishant Mehta: PhD student, CS • Parikshit Ram: PhD student, CS + Math • William March: PhD student, Math + CS • Hua Ouyang: PhD student, CS • Ravi Sastry: PhD student, CS • Long Tran: PhD student, CS • Ryan Curtin: PhD student, EE • Ailar Javadi: PhD student, EE • Abhimanyu Aditya: Affiliate, CS; MS CS + 5-10 MS students and undergraduates

Instruments Data: CMB Maps Exponential growth in dataset sizes [Science, Szalay & J. Gray, 2001] 1990 COBE 1,000 2000 Boomerang 10,000 2002 CBI 50,000 2003 WMAP 1 Million 2008 Planck 10 Million Data: Local Redshift Surveys Data: Angular Surveys 1986 CfA 3,500 1996 LCRS 23,000 2003 2dF 250,000 2005 SDSS 800,000 1970 Lick 1M 1990 APM 2M 2005 SDSS 200M 2010 LSST 2B

1993-1999: DPOSS1999-2008: SDSSComing: Pan-STARRS, LSST

Happening everywhere! Molecular biology (cancer) microarray chips fiber optics Network traffic (spam) 300M/day Simulations (Millennium) microprocessors particle colliders Particle events (LHC) 1B 1M/sec

How did galaxies evolve? • What was the early universe like? • Does dark energy exist? • Is our model (GR+inflation) right? Astrophysicist R. Nichol, Inst. Cosmol. Gravitation A. Connolly, U. Pitt Physics C. Miller, NOAO R. Brunner, NCSA G. Djorgovsky, Caltech G. Kulkarni, Inst. Cosmol. Gravitation D. Wake, Inst. Cosmol. Gravitation R. Scranton,U. Pitt Physics M. Balogh, U. Waterloo Physics I. Szapudi, U. Hawaii Inst. Astronomy G. Richards, Princeton Physics A. Szalay, Johns Hopkins Physics Machine learning/ statistics guy

How did galaxies evolve? • What was the early universe like? • Does dark energy exist? • Is our model (GR+inflation) right? Astrophysicist • Kernel density estimator • n-point spatial statistics • Nonparametric Bayes classifier • Support vector machine • Nearest-neighbor statistics • Gaussian process regression • Hierarchical clustering O(N2) O(Nn) R. Nichol, Inst. Cosmol. Grav. A. Connolly, U. Pitt Physics C. Miller, NOAO R. Brunner, NCSA G. Djorgovsky,Caltech G. Kulkarni, Inst. Cosmol. Grav. D. Wake, Inst. Cosmol. Grav. R. Scranton,U. Pitt Physics M. Balogh, U. Waterloo Physics I. Szapudi, U. Hawaii Inst. Astro. G. Richards, Princeton Physics A. Szalay, Johns Hopkins Physics O(N2) O(N2) O(N2) O(N3) O(cDT(N)) Machine learning/ statistics guy

How did galaxies evolve? • What was the early universe like? • Does dark energy exist? • Is our model (GR+inflation) right? Astrophysicist • Kernel density estimator • n-point spatial statistics • Nonparametric Bayes classifier • Support vector machine • Nearest-neighbor statistics • Gaussian process regression • Hierarchical clustering O(N2) O(Nn) R. Nichol, Inst. Cosmol. Grav. A. Connolly, U. Pitt Physics C. Miller, NOAO R. Brunner, NCSA G. Djorgovsky, Caltech G. Kulkarni, Inst. Cosmol. Grav. D. Wake, Inst. Cosmol. Grav. R. Scranton,U. Pitt Physics M. Balogh, U. Waterloo Physics I. Szapudi, U. Hawaii Inst. Astro. G. Richards, Princeton Physics A. Szalay, Johns Hopkins Physics O(N2) O(N3) O(N2) O(N3) O(N3) Machine learning/ statistics guy

How did galaxies evolve? • What was the early universe like? • Does dark energy exist? • Is our model (GR+inflation) right? Astrophysicist • Kernel density estimator • n-point spatial statistics • Nonparametric Bayes classifier • Support vector machine • Nearest-neighbor statistics • Gaussian process regression • Hierarchical clustering O(N2) O(Nn) R. Nichol, Inst. Cosmol. Grav. A. Connolly, U. Pitt Physics C. Miller, NOAO R. Brunner, NCSA G. Djorgovsky, Caltech G. Kulkarni, Inst. Cosmol. Grav. D. Wake, Inst. Cosmol. Grav. R. Scranton,U. Pitt Physics M. Balogh, U. Waterloo Physics I. Szapudi, U. Hawaii Inst. Astro. G. Richards, Princeton Physics A. Szalay, Johns Hopkins Physics O(N2) O(N3) O(N2) O(N3) O(N3) But I have 1 million points Machine learning/ statistics guy

The challenge State-of-the-art statistical methods… • Best accuracy with fewest assumptions with orders-of-mag more efficiency. • Large N(#data),D(#features),M(#models) D Reduce data? Use simpler model? Approximation with poor/no error bounds?  Poor results N M

What I like to think about… • The statistical problems and methods needed for answering scientific questions • The computational problems (bottlenecks) and methods involved in scaling all of them up to big datasets • The infrastructure-tailored software to make the application happen

How to do Stats/Machine Learning on Very Large Survey Datasets? • Choose the appropriate statistical task and method for the scientific question • Use the fastest algorithm and data structure for the statistical method • Apply method to the data!

10 data analysis problems, and scalable tools we’d like for them • Querying (e.g. characterizing a region of space): • spherical range-search O(N) • orthogonal range-search O(N) • spatial join O(N2) • k-nearest-neighbors O(N) • all-k-nearest-neighbors O(N2) [red: of high interest in astronomy] • Density estimation (e.g. comparing galaxy types): • mixture of Gaussians • kernel density estimation O(N2) • kernel conditional density estimation O(N3) • submanifold kernel density estimation O(N2) [Ozakin and Gray, NIPS 2009] • hyperkernel density estimation O(N4) [Sastry and Gray 2010, to be submitted] • L2 density estimation trees [Ram and Gray, under review]

10 data analysis problems, and scalable tools we’d like for them 3. Regression (e.g. photometric redshifts): • linear regression • kernel regression O(N2) • Gaussian process regression/kriging O(N3) 4. Classification (e.g. quasar detection, star-galaxy separation): • decision tree • k-nearest-neighbor classifier O(N2) • nonparametric Bayes classifier O(N2), aka KDA • support vector machine (SVM) O(N3) • non-negative SVM O(N3) [Guan and Gray 2010, under review] • isometric separation map O(N3) [Vasiloglou, Gray, and Anderson, MLSP 2008 (Nominated for Best Paper)]

10 data analysis problems, and scalable tools we’d like for them • Dimension reduction (e.g. galaxy or spectra characterization): • principal component analysis • non-negative matrix factorization • kernel PCA (including LLE, IsoMap, et al.) O(N3) • maximum variance unfolding O(N3) • rank-based manifolds O(N3) [Ouyang and Gray, ICML 2008] • isometric non-negative matrix factorization O(N3) [Vasiloglou, Gray, and Anderson, SDM 2009] • density-preserving maps O(N3) [Ozakin and Gray 2010, to be submitted] • Outlier detection (e.g. new object types, data cleaning): • by density estimation, by dimension reduction • by robust Lp estimation [Ram, Riegel and Gray 2010, to be submitted]

10 data analysis problems, and scalable tools we’d like for them 7. Clustering (e.g. automatic Hubble sequence) • by dimension reduction, by density estimation • k-means • mean-shift segmentation O(N2) • hierarchical clustering (“friends-of-friends”) O(N3) 8. Time series analysis (e.g. asteroid tracking, variable objects): • Kalman filter • hidden Markov model • trajectory tracking O(Nn) • functional independent component analysis [Mehta and Gray, SDM 2009] • Markov matrix factorization [Tran, Wong, and Gray 2010, under review] • generative mean map kernel [Mehta and Gray 2010, to be submitted]

10 data analysis problems, and scalable tools we’d like for them 9. Feature selection and causality (e.g. which features predict star/galaxy) • LASSO regression • L1 SVM • Gaussian graphical model inference and structure search • discrete graphical model inference and structure search • fractional-norm SVM [Guan and Gray 2010, under review] • mixed-integer SVM [Guan and Gray 2010, to be submitted] 10. 2-sample testing and matching (e.g. cosmological validation, multiple surveys): • minimum spanning tree O(N3) • n-point correlation O(Nn) • bipartite matching/Gaussian graphical model inference O(N3) [Riegel and Gray, in preparation]

Core methods ofstatistics / machine learning / mining • Querying: spherical range-search O(N), orthogonal range-search O(N), spatial join O(N2), nearest-neighbor O(N), all-nearest-neighbors O(N2) • Density estimation: mixture of Gaussians, kernel density estimation O(N2), kernel conditional density estimation O(N3) • Regression: linear regression, kernel regression O(N2), Gaussian process regression O(N3) • Classification: decision tree, nearest-neighbor classifier O(N2), nonparametric Bayes classifier O(N2), support vector machine O(N3) • Dimension reduction: principal component analysis, non-negative matrix factorization, kernel PCA O(N3), maximum variance unfolding O(N3) • Outlier detection: by density estimation or dimension reduction • Clustering: by density estimation or dimension reduction, k-means, mean-shift segmentation O(N2), hierarchical clustering O(N3) • Time series analysis: Kalman filter, hidden Markov model, trajectory tracking O(Nn) • Feature selection and causality: LASSO, L1 SVM, Gaussian graphical models, discrete graphical models • 2-sample testing and matching: bipartite matching O(N3), n-point correlation O(Nn)

Now pretty fast (2011)…made fast by us: fastest, fastest in some settings • Querying: spherical range-search O(logN)*, orthogonal range-search O(logN)*, spatial join O(N)*,nearest-neighbor O(logN),all-nearest-neighbors O(N) • Density estimation: mixture of Gaussians, kernel density estimation O(N), kernel conditional density estimation O(Nlog3)* • Regression: linear regression, kernel regression O(N),Gaussian process regression O(N)* • Classification: decision tree, nearest-neighbor classifier O(N), nonparametric Bayes classifier O(N)*, support vector machine O(N)/O(N2) • Dimension reduction:principal component analysis, non-negative matrix factorization, kernel PCA O(N)*, maximum variance unfolding O(N)* • Outlier detection: by density estimation or dimension reduction • Clustering: by density estimation or dimension reduction, k-means, mean-shift segmentation O(N), hierarchical (FoF) clustering O(NlogN) • Time series analysis: Kalman filter, hidden Markov model, trajectory tracking O(Nlogn)* • Feature selection and causality: LASSO, L1 SVM, Gaussian graphical models, discrete graphical models • 2-sample testing and matching: bipartite matching O(N)**, n-point correlation O(Nlogn)*

How to do Stats/Machine Learning on Very Large Survey Datasets? • Choose the appropriate statistical task and method for the scientific question • Use the fastest algorithm and data structure for the statistical method • Apply method to the data!

Core computational problems What are the basic mathematical operations, or bottleneck subroutines, can we focus on developing fast algorithms for?

Core computational problems • Generalized N-body problems • Graphical model inference • Linear algebra • Optimization

4 main computational bottlenecks:N-body,graphical models,linear algebra,optimization • Querying:spherical range-search O(N),orthogonal range-search O(N),spatial join O(N2),nearest-neighbor O(N),all-nearest-neighbors O(N2) • Density estimation:mixture of Gaussians, kernel density estimation O(N2), kernel conditional density estimation O(N3) • Regression:linear regression, kernel regression O(N2),Gaussian process regressionO(N3) • Classification: decision tree, nearest-neighbor classifier O(N2), nonparametric Bayes classifier O(N2), support vector machine O(N3) • Dimension reduction:principal component analysis, non-negative matrix factorization, kernel PCA O(N3), maximum variance unfolding O(N3) • Outlier detection: by density estimation or dimension reduction • Clustering: by density estimation or dimension reduction, k-means, mean-shift segmentationO(N2),hierarchical clustering O(N3) • Time series analysis:Kalman filter, hidden Markov model, trajectory tracking O(Nn) • Feature selection and causality: LASSO, L1 SVM, Gaussian graphical models, discrete graphical models • 2-sample testing and matching:bipartite matching O(N3), n-point correlationO(Nn)

Mathematical/practical challenges • Return answers with error guarantees • Rigorous theoretical runtimes • Ensure small constants (small actual runtimes, not just good asymptotic runtimes)

Generalized N-body Problems I • How it appears: nearest-neighbor, sph range-search, ortho range-search • Common methods: locality sensitive hashing, kd-trees, metric trees, disk-based trees • Mathematical challenges: high dimensions, provable runtime, distribution-dependent analysis, parallel indexing • Mathematical topics: computational geometry, randomized algorithms

First example: spherical range search. How can we compute this efficiently? kd-trees: most widely-used space-partitioning tree [Bentley 1975], [Friedman, Bentley & Finkel 1977],[Moore & Lee 1995]

A kd-tree: level 1

Range-count recursive algorithm

Range-count recursive algorithm Pruned! (inclusion)

Range-count recursive algorithm Pruned! (exclusion)

Range-count recursive algorithm fastest practical algorithm [Bentley 1975] our algorithms can use any tree

Generalized N-body Problems I • Interesting approach: Cover-trees [Beygelzimer et al 2004] • Provable runtime • Consistently good performance, even in higher dimensions • Interesting approach:Learning trees [Cayton et al 2007] • Learning data-optimal data structures • Improves performance over kd-trees • Interesting approach:MapReduce [Dean and Ghemawat 2004] • Brute-force • But makes HPC automatic for a certain problem form • Interesting approach:approximation using spill-trees [Liu, Moore, Gray, and Yang, NIPS 2004] • Faster/more accurate than LSH • Interesting approach:approximation in rank [Ram, Ouyang and Gray, NIPS 2009] • Approximate NN in terms of distance conflicts with known theoretical results • Is approximation in rank feasible? • Faster/more accurate than LSH

Machine Learning on Massive Datasets

Machine Learning on Massive Datasets

Presentation Transcript

How to do Fast Analytics on Massive Datasets

Fast N-Body Algorithms for Massive Datasets

Machine Learning on Spark

Advanced Algorithms for Massive DataSets

Machine Learning on Spark

Advanced Algorithms for Massive Datasets

Learning Bayesian Network Structure from Massive Datasets: The “Sparse Candidate” Algorithm

OpenCL based machine learning labeling of biomedical datasets

Advanced Algorithms for Massive Datasets

Machine Learning on fMRI Data

Learning Bayesian Network Structure from Massive Datasets: The ``Sparse Candidate'' Algorithm

Machine Learning from Big Datasets

Machine Learning on Images

Efficient Handling of Massive (Terrain) Datasets

Mining of Massive Datasets: Course Introduction

Machine Learning for Spatio -temporal Datasets and Remote Sensing

Machine Learning on Spark

Saby on Machine Learning

Toric Modification on Machine Learning

Learning Bayesian Network Structure from Massive Datasets: The “Sparse Candidate” Algorithm

Joining Massive High-Dimensional Datasets

Healthcare and medical datasets for machine learning