270 likes | 376 Views
Jong Youl Choi Computer Science Department (jychoi@cs.indiana.edu). Machine Learning and Statistical Analysis. Motivations. Social Bookmarking. Socialized. Bookmarks. Tags. Collaborative Tagging System. Motivations Social indexing or collaborative annotation
E N D
Jong Youl Choi Computer Science Department (jychoi@cs.indiana.edu) Machine Learning and Statistical Analysis
Motivations • Social Bookmarking Socialized Bookmarks Tags
Collaborative Tagging System • Motivations • Social indexing or collaborative annotation • Collect knowledge from people • Extract information • Challenges • Vast amount of data Efficient indexing scheme • Very dynamic Temporal analysis • Unsupervised data Clustering, inference
Outlines • Principles of Machine Learning • Bayes’ theorem and maximum likelihood • Machine Learning Algorithms • Clustering analysis • Dimension reduction • Classification • Parallel Computing • General parallel computing architecture • Parallel algorithms
Machine Learning • Definition Algorithms or techniques that enable computer (machine) to “learn” from data. Related with many areas such as data mining, statistics, information theory, etc. • Algorithm Types • Unsupervised learning • Supervised learning • Reinforcement learning • Topics • Models • Artificial Neural Network (ANN) • Support Vector Machine (SVM) • Optimization • Expectation-Maximization (EM) • Deterministic Annealing (DA)
Bayes’ Theorem • Posterior probability of i, given X • i 2 : Parameter • X : Observations • P(i) : Prior (or marginal) probability • P(X|i) : likelihood • Maximum Likelihood (ML) • Used to find the most plausible i2, given X • Computing maximum likelihood (ML) or log-likelihood Optimization problem
Maximum Likelihood (ML) Estimation • Problem Estimate hidden parameters (={, })from the given data extracted from k Gaussian distributions • Gaussian distribution • Maximum Likelihood • With Gaussian (P = N), • Solve either brute-force or numeric method (Mitchell , 1997)
EM algorithm • Problems in ML estimation • Observation X is often not complete • Latent (hidden) variable Z exists • Hard to explore whole parameter space • Expectation-Maximization algorithm • Object : To find ML, over latent distribution P(Z |X,) • Steps 0. Init – Choose a random old 1. E-step – Expectation P(Z |X, old) 2. M-step – Find new which maximize likelihood. 3. Go to step 1 after updating oldÃnew
Clustering Analysis • Definition Grouping unlabeled data into clusters, for the purpose of inference of hidden structures or information • Dissimilarity measurement • Distance : Euclidean(L2), Manhattan(L1), … • Angle : Inner product, … • Non-metric : Rank, Intensity, … • Types of Clustering • Hierarchical • Agglomerative or divisive • Partitioning • K-means, VQ, MDS, … (Matlab helppage)
K-Means • Find K partitions with the total intra-cluster variance minimized • Iterative method • Initialization : Randomized yi • Assignment of x (yi fixed) • Update of yi (x fixed) • Problem? Trap in local minima (MacKay, 2003)
Deterministic Annealing (DA) • Deterministically avoid local minima • No stochastic process (random walk) • Tracing the global solution by changing level of randomness • Statistical Mechanics • Gibbs distribution • Helmholtz free energy F = D– TS • Average Energy D = < Ex> • Entropy S = - P(Ex) ln P(Ex) • F = – T ln Z • In DA, we make F minimized (Maxima and Minima, Wikipedia)
Deterministic Annealing (DA) • Analogy to physical annealing process • Control energy (randomness) by temperature (high low) • Starting with high temperature (T = 1) • Soft (or fuzzy) association probability • Smooth cost function with one global minimum • Lowering the temperature (T !0) • Hard association • Revealing full complexity, clusters are emerged • Minimization of F, using E(x, yj) = ||x-yj||2 Iteratively,
Dimension Reduction • Definition Process to transform high-dimensional data into low-dimensional ones for improving accuracy, understanding, or removing noises. • Curse of dimensionality • Complexity grows exponentially in volume by adding extra dimensions • Types • Feature selection : Choose representatives (e.g., filter,…) • Feature extraction : Map to lower dim. (e.g., PCA, MDS, … ) (Koppen, 2000)
Principle Component Analysis (PCA) • Finding a map of principle components (PCs) of data into an orthogonal space, such that y= W xwhere W 2Rd£h (hÀd) • PCs – Variables with the largest variances • Orthogonality • Linearity – Optimal least mean-square error • Limitations? • Strict linearity • specific distribution • Large variance assumption x2 PC2 PC1 x1
Random Projection • Like PCA, reduction of dimension by y= R x where R is a random matrix with i.i.d columns and R 2Rd£p (pÀd) • Johnson-Lindenstrauss lemma • When projecting to a randomly selected subspace, the distance are approximately preserved • Generating R • Hard to obtain orthogonalized R • Gaussian R • Simple approach choose rij = {+31/2,0,-31/2} with probability 1/6, 4/6, 1/6 respectively
Multi-Dimensional Scaling (MDS) • Dimension reduction preserving distance proximities observed in original data set • Loss functions • Inner product • Distance • Squared distance • Classical MDS: minimizing STRAIN, given • From , find inner product matrix B (Double centering) • From B, recover the coordinates X’ (i.e., B=X’X’T )
Multi-Dimensional Scaling (MDS) • SMACOF : minimizing STRESS • Majorization – for complex f(x), find auxiliary simple g(x,y) s.t.: • Majorization for STRESS • Minimize tr(XT B(Y) Y), known as Guttman transform (Cox, 2001)
Self-Organizing Map (SOM) • Competitive and unsupervised learning process for clustering and visualization • Result : similar data getting closer in the model space • Learning • Choose the best similar model vector mj with xi • Update the winner and its neighbors by • mk = mk + (t) (t)(xi – mk) • (t) : learning rate • (t) : neighborhood size Input Model
Classification • Definition A procedure dividing data into the given set of categories based on the training set in a supervised way • Generalization Vs. Specification • Hard to achieve both • Avoid overfitting(overtraining) • Early stopping • Holdout validation • K-fold cross validation • Leave-one-out cross-validation Underfitting Overfitting Validation Error Training Error (Overfitting, Wikipedia)
Artificial Neural Network (ANN) • Perceptron : A computational unit with binary threshold • Abilities • Linear separable decision surface • Represent boolean functions (AND, OR, NO) • Network (Multilayer) of perceptrons Various network architectures and capabilities Weighted Sum Activation Function (Jain, 1996)
Artificial Neural Network (ANN) • Learning weights – random initialization and updating • Error-correction training rules • Difference between training data and output: E(t,o) • Gradient descent (Batch learning) • With E = Ei , • Stochastic approach (On-line learning) • Update gradient for each result • Various error functions • Adding weight regularization term ( wi2) to avoid overfitting • Adding momentum (wi(n-1)) to expedite convergence
Support Vector Machine • Q: How to draw the optimal linear separating hyperplane? A: Maximizing margin • Margin maximization • The distance between H+1 and H-1: • Thus, ||w|| should be minimized Margin
Support Vector Machine • Constraint optimization problem • Given training set {xi, yi} (yi2 {+1, -1}): • Minimize : • Lagrangian equation with saddle points • Minimized w.r.t the primal variable w and b: • Maximized w.r.t the dual variables i (all i¸ 0) • xi with i > 0 (not i = 0) is called support vector (SV)
Support Vector Machine • Soft Margin (Non-separable case) • Slack variables i < C • Optimization with additional constraint • Non-linear SVM • Map non-linear input to feature space • Kernel function k(x,y) = h(x), (y)i • Kernel classifier with support vectors si Input Space Feature Space
Parallel Computing • Memory Architecture • Decomposition Strategy • Task – E.g., Word, IE, … • Data – scientific problem • Pipelining – Task + Data Shared Memory Distributed Memory • Symmetric Multiprocessor (SMP) • OpenMP, POSIX, pthread, MPI • Easy to manage but expensive • Commodity, off-the-shelf processors • MPI • Cost effective but hard to maintain (Barney, 2007) (Barney, 2007)
Parallel SVM • Shrinking • Recall : Only support vectors (i>0) are used in SVM optimization • Predict if data is either SV or non-SV • Remove non-SVs from problem space • Parallel SVM • Partition the problem • Merge data hierarchically • Each unit finds support vectors • Loop until converge (Graf, 2005)
Thank you!! Questions?