1 / 27

Machine Learning and Statistical Analysis

Jong Youl Choi Computer Science Department (jychoi@cs.indiana.edu). Machine Learning and Statistical Analysis. Motivations. Social Bookmarking. Socialized. Bookmarks. Tags. Collaborative Tagging System. Motivations Social indexing or collaborative annotation

cruz-young
Download Presentation

Machine Learning and Statistical Analysis

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Jong Youl Choi Computer Science Department (jychoi@cs.indiana.edu) Machine Learning and Statistical Analysis

  2. Motivations • Social Bookmarking Socialized Bookmarks Tags

  3. Collaborative Tagging System • Motivations • Social indexing or collaborative annotation • Collect knowledge from people • Extract information • Challenges • Vast amount of data  Efficient indexing scheme • Very dynamic Temporal analysis • Unsupervised data Clustering, inference

  4. Outlines • Principles of Machine Learning • Bayes’ theorem and maximum likelihood • Machine Learning Algorithms • Clustering analysis • Dimension reduction • Classification • Parallel Computing • General parallel computing architecture • Parallel algorithms

  5. Machine Learning • Definition Algorithms or techniques that enable computer (machine) to “learn” from data. Related with many areas such as data mining, statistics, information theory, etc. • Algorithm Types • Unsupervised learning • Supervised learning • Reinforcement learning • Topics • Models • Artificial Neural Network (ANN) • Support Vector Machine (SVM) • Optimization • Expectation-Maximization (EM) • Deterministic Annealing (DA)

  6. Bayes’ Theorem • Posterior probability of i, given X • i 2 : Parameter • X : Observations • P(i) : Prior (or marginal) probability • P(X|i) : likelihood • Maximum Likelihood (ML) • Used to find the most plausible i2, given X • Computing maximum likelihood (ML) or log-likelihood  Optimization problem

  7. Maximum Likelihood (ML) Estimation • Problem Estimate hidden parameters (={, })from the given data extracted from k Gaussian distributions • Gaussian distribution • Maximum Likelihood • With Gaussian (P = N), • Solve either brute-force or numeric method (Mitchell , 1997)

  8. EM algorithm • Problems in ML estimation • Observation X is often not complete • Latent (hidden) variable Z exists • Hard to explore whole parameter space • Expectation-Maximization algorithm • Object : To find ML, over latent distribution P(Z |X,) • Steps 0. Init – Choose a random old 1. E-step – Expectation P(Z |X, old) 2. M-step – Find new which maximize likelihood. 3. Go to step 1 after updating oldÃnew

  9. Clustering Analysis • Definition Grouping unlabeled data into clusters, for the purpose of inference of hidden structures or information • Dissimilarity measurement • Distance : Euclidean(L2), Manhattan(L1), … • Angle : Inner product, … • Non-metric : Rank, Intensity, … • Types of Clustering • Hierarchical • Agglomerative or divisive • Partitioning • K-means, VQ, MDS, … (Matlab helppage)

  10. K-Means • Find K partitions with the total intra-cluster variance minimized • Iterative method • Initialization : Randomized yi • Assignment of x (yi fixed) • Update of yi (x fixed) • Problem?  Trap in local minima (MacKay, 2003)

  11. Deterministic Annealing (DA) • Deterministically avoid local minima • No stochastic process (random walk) • Tracing the global solution by changing level of randomness • Statistical Mechanics • Gibbs distribution • Helmholtz free energy F = D– TS • Average Energy D = < Ex> • Entropy S = - P(Ex) ln P(Ex) • F = – T ln Z • In DA, we make F minimized (Maxima and Minima, Wikipedia)

  12. Deterministic Annealing (DA) • Analogy to physical annealing process • Control energy (randomness) by temperature (high  low) • Starting with high temperature (T = 1) • Soft (or fuzzy) association probability • Smooth cost function with one global minimum • Lowering the temperature (T !0) • Hard association • Revealing full complexity, clusters are emerged • Minimization of F, using E(x, yj) = ||x-yj||2 Iteratively,

  13. Dimension Reduction • Definition Process to transform high-dimensional data into low-dimensional ones for improving accuracy, understanding, or removing noises. • Curse of dimensionality • Complexity grows exponentially in volume by adding extra dimensions • Types • Feature selection : Choose representatives (e.g., filter,…) • Feature extraction : Map to lower dim. (e.g., PCA, MDS, … ) (Koppen, 2000)

  14. Principle Component Analysis (PCA) • Finding a map of principle components (PCs) of data into an orthogonal space, such that y= W xwhere W 2Rd£h (hÀd) • PCs – Variables with the largest variances • Orthogonality • Linearity – Optimal least mean-square error • Limitations? • Strict linearity • specific distribution • Large variance assumption x2 PC2 PC1 x1

  15. Random Projection • Like PCA, reduction of dimension by y= R x where R is a random matrix with i.i.d columns and R 2Rd£p (pÀd) • Johnson-Lindenstrauss lemma • When projecting to a randomly selected subspace, the distance are approximately preserved • Generating R • Hard to obtain orthogonalized R • Gaussian R • Simple approach choose rij = {+31/2,0,-31/2} with probability 1/6, 4/6, 1/6 respectively

  16. Multi-Dimensional Scaling (MDS) • Dimension reduction preserving distance proximities observed in original data set • Loss functions • Inner product • Distance • Squared distance • Classical MDS: minimizing STRAIN, given  • From , find inner product matrix B (Double centering) • From B, recover the coordinates X’ (i.e., B=X’X’T )

  17. Multi-Dimensional Scaling (MDS) • SMACOF : minimizing STRESS • Majorization – for complex f(x), find auxiliary simple g(x,y) s.t.: • Majorization for STRESS • Minimize tr(XT B(Y) Y), known as Guttman transform (Cox, 2001)

  18. Self-Organizing Map (SOM) • Competitive and unsupervised learning process for clustering and visualization • Result : similar data getting closer in the model space • Learning • Choose the best similar model vector mj with xi • Update the winner and its neighbors by • mk = mk + (t) (t)(xi – mk) • (t) : learning rate • (t) : neighborhood size Input Model

  19. Classification • Definition A procedure dividing data into the given set of categories based on the training set in a supervised way • Generalization Vs. Specification • Hard to achieve both • Avoid overfitting(overtraining) • Early stopping • Holdout validation • K-fold cross validation • Leave-one-out cross-validation Underfitting Overfitting Validation Error Training Error (Overfitting, Wikipedia)

  20. Artificial Neural Network (ANN) • Perceptron : A computational unit with binary threshold • Abilities • Linear separable decision surface • Represent boolean functions (AND, OR, NO) • Network (Multilayer) of perceptrons Various network architectures and capabilities Weighted Sum Activation Function (Jain, 1996)

  21. Artificial Neural Network (ANN) • Learning weights – random initialization and updating • Error-correction training rules • Difference between training data and output: E(t,o) • Gradient descent (Batch learning) • With E =  Ei , • Stochastic approach (On-line learning) • Update gradient for each result • Various error functions • Adding weight regularization term ( wi2) to avoid overfitting • Adding momentum (wi(n-1)) to expedite convergence

  22. Support Vector Machine • Q: How to draw the optimal linear separating hyperplane?  A: Maximizing margin • Margin maximization • The distance between H+1 and H-1: • Thus, ||w|| should be minimized Margin

  23. Support Vector Machine • Constraint optimization problem • Given training set {xi, yi} (yi2 {+1, -1}): • Minimize : • Lagrangian equation with saddle points • Minimized w.r.t the primal variable w and b: • Maximized w.r.t the dual variables i (all i¸ 0) • xi with i > 0 (not i = 0) is called support vector (SV)

  24. Support Vector Machine • Soft Margin (Non-separable case) • Slack variables i < C • Optimization with additional constraint • Non-linear SVM • Map non-linear input to feature space • Kernel function k(x,y) = h(x), (y)i • Kernel classifier with support vectors si Input Space Feature Space

  25. Parallel Computing • Memory Architecture • Decomposition Strategy • Task – E.g., Word, IE, … • Data – scientific problem • Pipelining – Task + Data Shared Memory Distributed Memory • Symmetric Multiprocessor (SMP) • OpenMP, POSIX, pthread, MPI • Easy to manage but expensive • Commodity, off-the-shelf processors • MPI • Cost effective but hard to maintain (Barney, 2007) (Barney, 2007)

  26. Parallel SVM • Shrinking • Recall : Only support vectors (i>0) are used in SVM optimization • Predict if data is either SV or non-SV • Remove non-SVs from problem space • Parallel SVM • Partition the problem • Merge data hierarchically • Each unit finds support vectors • Loop until converge (Graf, 2005)

  27. Thank you!! Questions?

More Related