Machine Learning and Statistical Analysis

Jong Youl Choi Computer Science Department (jychoi@cs.indiana.edu) Machine Learning and Statistical Analysis

Motivations • Social Bookmarking Socialized Bookmarks Tags

Collaborative Tagging System • Motivations • Social indexing or collaborative annotation • Collect knowledge from people • Extract information • Challenges • Vast amount of data  Efficient indexing scheme • Very dynamic Temporal analysis • Unsupervised data Clustering, inference

Outlines • Principles of Machine Learning • Bayes’ theorem and maximum likelihood • Machine Learning Algorithms • Clustering analysis • Dimension reduction • Classification • Parallel Computing • General parallel computing architecture • Parallel algorithms

Machine Learning • Definition Algorithms or techniques that enable computer (machine) to “learn” from data. Related with many areas such as data mining, statistics, information theory, etc. • Algorithm Types • Unsupervised learning • Supervised learning • Reinforcement learning • Topics • Models • Artificial Neural Network (ANN) • Support Vector Machine (SVM) • Optimization • Expectation-Maximization (EM) • Deterministic Annealing (DA)

Bayes’ Theorem • Posterior probability of i, given X • i 2 : Parameter • X : Observations • P(i) : Prior (or marginal) probability • P(X|i) : likelihood • Maximum Likelihood (ML) • Used to find the most plausible i2, given X • Computing maximum likelihood (ML) or log-likelihood  Optimization problem

Maximum Likelihood (ML) Estimation • Problem Estimate hidden parameters (={, })from the given data extracted from k Gaussian distributions • Gaussian distribution • Maximum Likelihood • With Gaussian (P = N), • Solve either brute-force or numeric method (Mitchell , 1997)

EM algorithm • Problems in ML estimation • Observation X is often not complete • Latent (hidden) variable Z exists • Hard to explore whole parameter space • Expectation-Maximization algorithm • Object : To find ML, over latent distribution P(Z |X,) • Steps 0. Init – Choose a random old 1. E-step – Expectation P(Z |X, old) 2. M-step – Find new which maximize likelihood. 3. Go to step 1 after updating oldÃnew

Clustering Analysis • Definition Grouping unlabeled data into clusters, for the purpose of inference of hidden structures or information • Dissimilarity measurement • Distance : Euclidean(L2), Manhattan(L1), … • Angle : Inner product, … • Non-metric : Rank, Intensity, … • Types of Clustering • Hierarchical • Agglomerative or divisive • Partitioning • K-means, VQ, MDS, … (Matlab helppage)

K-Means • Find K partitions with the total intra-cluster variance minimized • Iterative method • Initialization : Randomized yi • Assignment of x (yi fixed) • Update of yi (x fixed) • Problem?  Trap in local minima (MacKay, 2003)

Deterministic Annealing (DA) • Deterministically avoid local minima • No stochastic process (random walk) • Tracing the global solution by changing level of randomness • Statistical Mechanics • Gibbs distribution • Helmholtz free energy F = D– TS • Average Energy D = < Ex> • Entropy S = - P(Ex) ln P(Ex) • F = – T ln Z • In DA, we make F minimized (Maxima and Minima, Wikipedia)

Deterministic Annealing (DA) • Analogy to physical annealing process • Control energy (randomness) by temperature (high  low) • Starting with high temperature (T = 1) • Soft (or fuzzy) association probability • Smooth cost function with one global minimum • Lowering the temperature (T !0) • Hard association • Revealing full complexity, clusters are emerged • Minimization of F, using E(x, yj) = ||x-yj||2 Iteratively,

Dimension Reduction • Definition Process to transform high-dimensional data into low-dimensional ones for improving accuracy, understanding, or removing noises. • Curse of dimensionality • Complexity grows exponentially in volume by adding extra dimensions • Types • Feature selection : Choose representatives (e.g., filter,…) • Feature extraction : Map to lower dim. (e.g., PCA, MDS, … ) (Koppen, 2000)

Principle Component Analysis (PCA) • Finding a map of principle components (PCs) of data into an orthogonal space, such that y= W xwhere W 2Rd£h (hÀd) • PCs – Variables with the largest variances • Orthogonality • Linearity – Optimal least mean-square error • Limitations? • Strict linearity • specific distribution • Large variance assumption x2 PC2 PC1 x1

Random Projection • Like PCA, reduction of dimension by y= R x where R is a random matrix with i.i.d columns and R 2Rd£p (pÀd) • Johnson-Lindenstrauss lemma • When projecting to a randomly selected subspace, the distance are approximately preserved • Generating R • Hard to obtain orthogonalized R • Gaussian R • Simple approach choose rij = {+31/2,0,-31/2} with probability 1/6, 4/6, 1/6 respectively

Multi-Dimensional Scaling (MDS) • Dimension reduction preserving distance proximities observed in original data set • Loss functions • Inner product • Distance • Squared distance • Classical MDS: minimizing STRAIN, given  • From , find inner product matrix B (Double centering) • From B, recover the coordinates X’ (i.e., B=X’X’T )

Multi-Dimensional Scaling (MDS) • SMACOF : minimizing STRESS • Majorization – for complex f(x), find auxiliary simple g(x,y) s.t.: • Majorization for STRESS • Minimize tr(XT B(Y) Y), known as Guttman transform (Cox, 2001)

Self-Organizing Map (SOM) • Competitive and unsupervised learning process for clustering and visualization • Result : similar data getting closer in the model space • Learning • Choose the best similar model vector mj with xi • Update the winner and its neighbors by • mk = mk + (t) (t)(xi – mk) • (t) : learning rate • (t) : neighborhood size Input Model

Classification • Definition A procedure dividing data into the given set of categories based on the training set in a supervised way • Generalization Vs. Specification • Hard to achieve both • Avoid overfitting(overtraining) • Early stopping • Holdout validation • K-fold cross validation • Leave-one-out cross-validation Underfitting Overfitting Validation Error Training Error (Overfitting, Wikipedia)

Artificial Neural Network (ANN) • Perceptron : A computational unit with binary threshold • Abilities • Linear separable decision surface • Represent boolean functions (AND, OR, NO) • Network (Multilayer) of perceptrons Various network architectures and capabilities Weighted Sum Activation Function (Jain, 1996)

Artificial Neural Network (ANN) • Learning weights – random initialization and updating • Error-correction training rules • Difference between training data and output: E(t,o) • Gradient descent (Batch learning) • With E =  Ei , • Stochastic approach (On-line learning) • Update gradient for each result • Various error functions • Adding weight regularization term ( wi2) to avoid overfitting • Adding momentum (wi(n-1)) to expedite convergence

Support Vector Machine • Q: How to draw the optimal linear separating hyperplane?  A: Maximizing margin • Margin maximization • The distance between H+1 and H-1: • Thus, ||w|| should be minimized Margin

Support Vector Machine • Constraint optimization problem • Given training set {xi, yi} (yi2 {+1, -1}): • Minimize : • Lagrangian equation with saddle points • Minimized w.r.t the primal variable w and b: • Maximized w.r.t the dual variables i (all i¸ 0) • xi with i > 0 (not i = 0) is called support vector (SV)

Support Vector Machine • Soft Margin (Non-separable case) • Slack variables i < C • Optimization with additional constraint • Non-linear SVM • Map non-linear input to feature space • Kernel function k(x,y) = h(x), (y)i • Kernel classifier with support vectors si Input Space Feature Space

Parallel Computing • Memory Architecture • Decomposition Strategy • Task – E.g., Word, IE, … • Data – scientific problem • Pipelining – Task + Data Shared Memory Distributed Memory • Symmetric Multiprocessor (SMP) • OpenMP, POSIX, pthread, MPI • Easy to manage but expensive • Commodity, off-the-shelf processors • MPI • Cost effective but hard to maintain (Barney, 2007) (Barney, 2007)

Parallel SVM • Shrinking • Recall : Only support vectors (i>0) are used in SVM optimization • Predict if data is either SV or non-SV • Remove non-SVs from problem space • Parallel SVM • Partition the problem • Merge data hierarchically • Each unit finds support vectors • Loop until converge (Graf, 2005)

Thank you!! Questions?

Machine Learning and Statistical Analysis

Machine Learning and Statistical Analysis

Presentation Transcript

Statistical Machine Translation

CS 59000 Statistical Machine learning Lecture 16

Statistical Machine Learning and Computational Biology

Statistical Machine Translation

CS 59000 Statistical Machine learning Lecture 25

Statistical Machine Translation

Statistical Analysis and Machine Learning using Hadoop

CS 59000 Statistical Machine learning Lecture 6

CS 59000 Statistical Machine learning Lecture 3

CS 59000 Statistical Machine learning Lecture 24

Statistical Machine Translation

Archetypal Analysis for Machine Learning

CS 59000 Statistical Machine learning Lecture 13

CS 59000 Statistical Machine learning Lecture 15

Introduction: statistical and machine learning based approaches to neurobiology

Using Statistical Machine Learning in Cloud Computing

Machine and Statistical Learning for Database Querying

CS 59000 Statistical Machine learning Lecture 7

CS 59000 Statistical Machine learning Lecture 18

Statistical Machine Translation

Machine Learning and Multivariate Statistical Methods in Particle Physics

Using Statistical Machine Learning in Cloud Computing