1 / 79

Data mining @ Mahout

Data mining @ Mahout. Reporter: terry. What is Mahout?. Mahout's goal is to build scalable machine learning libraries. What’s the meaning of “scalable”? Scalable to reasonably large data sets. Scalable to support your business case. Scalable community Notes:

aira
Download Presentation

Data mining @ Mahout

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data mining @ Mahout Reporter: terry

  2. What is Mahout? • Mahout's goal is to build scalable machine learning libraries. • What’s the meaning of “scalable”? • Scalable to reasonably large data sets. • Scalable to support your business case. • Scalable community • Notes: • The core libraries are highly optimized to allow for good performance also for non-distributed algorithms. • https://cwiki.apache.org/confluence/display/MAHOUT/Overview

  3. How it does us a favor? • Currently Mahout supports mainly four use cases: • Recommendation mining • takes users' behavior and from that tries to find items users might like. • Clustering • takes e.g. text documents and groups them into groups of topically related documents. • Classification • learns from existing categorized documents what documents of a specific category look like and is able to assign unlabelled documents to the (hopefully) correct category. • Frequent item-set mining • takes a set of item groups (terms in a query session, shopping cart content) and identifies, which individual items usually appear together.

  4. Classification in Mahout • Logistic Regression (SGD-Stochastic Gradient Descent ) • A model used for prediction of the probability of occurrence of an event by fitting data to a logit function logistic curve. • SGD • An online learning algorithm • Do on-line evaluation using cross validation • An evolutionary system to do learning hyper-parameter optimization http://www.autonlab.org/autonweb/14709/version/4/part/5/data/komarek:lr_thesis.pdf?branch=main&language=en

  5. Classification in Mahout • Logistic Regression (SGD-Stochastic Gradient Descent ) 7%

  6. Classification in Mahout • SGD-Stochastic Gradient Descent • A optimization algorithm Learning rate

  7. Classification in Mahout • SGD-Stochastic Gradient Descent

  8. Classification in Mahout • SGD-Stochastic Gradient Descent Straight line: Input points: Applications: LMS (Least mean square) and backprogation

  9. Classification out of Mahout • AdaBoost • Training data weak classifiers

  10. Classification out of Mahout • AdaBoost • Discrete AdaBoost

  11. Classification out of Mahout • AdaBoost • Real AdaBoost

  12. Classification in Mahout • Bayesian • Traditional Naive Byes: Simple & Naïve • A simple probabilistic classifier based on applying Bayes' theorem with strong (naive) independence assumptions constant

  13. Classification in Mahout • Bayesian for Then:

  14. Classification in Mahout • Bayesian • Parameters estimation • MAP( maximum a posteriori) : the percent of class C in training set : the percent of class C in training set

  15. Classification in Mahout • Bayesian • Example( Sex Classification) Parameters estimation: Probability distribution of every feature in every class The class priors:

  16. Classification in Mahout • Bayesian • Example( Sex Classification) evidence

  17. Classification in Mahout • Bayesian • Example( Sex Classification)

  18. Classification in Mahout • Bayesian • Example( Sex Classification)

  19. Classification in Mahout • Bayesian • Example( Sex Classification) Post( male) Post( female) × √

  20. Classification in Mahout • Bayesian • Extension • Random Naïve Bayes • Random Tree + Naïve Bayes • Bayes network • Conditional dependencies • directed acyclic graph (DAG) • Node( variables) and edge( conditional dependencies)

  21. Classification in Mahout • Support Vector Machine( BLC) • Each object is considered as a point in n-dims feature space • Each point is labeled with ‘0’ or ‘1’ • Find a hyperplane separate objects • Liner separating in low Dims leading to mistakes • Curse of Dims • Fewer Features VS. free parameters • Impose structural constraints

  22. Classification in Mahout • Support Vector Machine • Linear Classifier Point s. line Line s. plane Plane s. volume

  23. Classification in Mahout • Support Vector Machine • Maximum-margin hyperplane • The best line: red line • The worst line: green line • The medium line: blue line • The distance from plane to the nearest point on each side Is maximized

  24. Classification in Mahout • Support Vector Machine • Linear SVM • Find Maximum Margin hyperplane Hyperplanes:

  25. Classification in Mahout • Support Vector Machine • Linear SVM

  26. Classification in Mahout • Support Vector Machine • Linear SVM ∝ Lagrange multipliers

  27. Classification in Mahout • Support Vector Machine • Linear SVM Wrong? Maybe!

  28. Classification in Mahout • Support Vector Machine • Linear SVM generalized

  29. Classification in Mahout • Support Vector Machine • Linear SVM Support Vectors!

  30. Classification in Mahout • Support Vector Machine • Linear SVM • Soft margin

  31. Classification in Mahout • Support Vector Machine • Non-Linear SVM • Dot product( No!) • Kernel function Low Dims high Dims

  32. Classification in Mahout • Support Vector Machine • Non-Linear SVM • Common kernel: Polynomial( homogeneous ): Gaussian Radial Basis Function: Hyperbolic tangent:

  33. Clustering in Mahout • K-Means Clustering • Partition n observations to k clusters Observations: Observations: Clusters: is the mean of points in

  34. Clustering in Mahout • K-Means Clustering • Step1: Assignment step • Step2: Update step Until the assignment no longer changes!

  35. Clustering in Mahout • K-Means Clustering a. The result may depend on the initial clusters b. It is fast usually, so run it several times in different conditions

  36. Clustering in Mahout • K-Means Clustering K=2? Or k=3? Which is better?

  37. Clustering in Mahout • Fuzzy Clustering • Every point has a degree of belonging to clusters • Most popular: FCM( Fuzzy C-Means) Fuzzy logic belonging VS. Determined belonging Data sets Return:

  38. Number of clusters Exponential weight Termination criterion Partition matrix Clustering in Mahout • Fuzzy Clustering • FCM • Step1: Initialization

  39. Clustering in Mahout • Fuzzy Clustering • FCM • Step2: calculating the cluster center • Step3: calculating the partition matrix Partition matrix

  40. Clustering in Mahout • Fuzzy Clustering • FCM • Step4:calculate the variation of partition matrix

  41. Clustering in Mahout • Fuzzy Clustering • FCM • Example

  42. Clustering in Mahout • Spectral Clustering • Make use of spectrum of similarity matrix Input : and number of clusters k Similarity matrix : Similarity graph :

  43. Clustering in Mahout • Spectral Clustering • Different similarity matrix • -neighborhood graph • is a threshold • K-nearest neighbor graph • Directed graph or undirected graph • The fully connected graph

  44. Clustering in Mahout • Spectral Clustering • Unnormalized Laplacian matrix: Property:

  45. Clustering in Mahout • Spectral Clustering • steps:

  46. Clustering in Mahout • Expectation maximization • Description Observations: Latent data or missing value: Unknown parameters: Likehood Function: MLE!

  47. Clustering in Mahout • Expectation maximization • Step1: Expectation step • Step2: maximization

  48. Clustering in Mahout • Expectation maximization • Hard-EM • Initialize the parameter • Compute the best value for z • Derive a better • Soft-EM • Determine the probability of each possible value for z

  49. Clustering in Mahout • Expectation maximization • Example( Gaussian Mixture) Input: Latent variables: determine Soft EM: Parameters: Likehood function:

  50. Clustering in Mahout • Expectation maximization • Example( Gaussian Mixture)

More Related