1 / 43

Scalable Machine Learning

Scalable Machine Learning. CMSC 491/691 Hadoop-Based Distributed Computing Spring 2014 Adam Shook. Apache Mahout. ma·hout - mə -ˈ hau̇t - noun - A keeper and driver of an elephant. Overview. Build a scalable machine learning library, in both data volume and processing

fay
Download Presentation

Scalable Machine Learning

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Scalable Machine Learning CMSC 491/691 Hadoop-Based Distributed Computing Spring 2014 Adam Shook

  2. Apache Mahout ma·hout -\mə-ˈhau̇t\ - noun - A keeper and driver of an elephant

  3. Overview • Build a scalable machine learning library, in both data volume and processing • Began in 2008 as a subproject of Apache Lucene, then became a top-level Apache project in 2010 • Address issues commonly found in ML libraries: • Lack community, scalability, documentation/examples, Apache licensing • Not well-tested • Not research oriented • Not built on existing production-quality projects • Active Community

  4. But what is Machine Learning • “Machine Learning is programming computers to optimize a performance criterion using example data or past experience” • Given a data set X, can we effectively predict Y by optimizing Z? Intro. to Machine Learning by E. Alpaydin

  5. Supervised vs. Unsupervised • Algorithms trained on labeled examples • I know these images are of cats and these are of dogs, tell me if this image is a cat or a dog • Algorithms trained on unlabeled examples • Group these images together by similarity, i.e. some kind of distance function

  6. Use Cases • Collaborative Filtering • Takes users' behavior and from that tries to find items users might like • Clustering • Take things and put them into groups of related things • Classification • Learn from existing categories to determine what things in a category look like, and assign unlabeled things the (hopefully) correct category • Frequent Itemset Mining • Analyzes items in a groups and identifies which items frequently appear together

  7. Clustering • Dirichlet Processing Clustering • Bayesian mixture modeling • K-Means Clustering • Partition n observations into kclusters • Fuzzy K-Means • Soft clusters where a point can be in more than one • Hierarchical Clustering • Hierarchy of clusters from bottom-up or top-down • Canopy Clustering • Preprocess data before K-Means or Hierarchical

  8. More Clustering • Latent Dirichlet Allocation • Cluster words into topics and documents into mixtures of topics • Mean Shift Clustering • Finding modes or clusters in 2-dimensional space, where number of clusters is unknown • Minhash Clustering • Quickly estimate similarity between two data sets • Spectral Clustering • Cluster points using eigenvectors of matrices derived from data

  9. Collaborative Filtering • Distributed Item-based Collaborate Filtering • Estimates a user’s preference for one item by looking at preference for similar items • Collaborate Filtering using a Parallel Matrix Factorization • Among a matrix of items that a user has not yet seen, predict which items the user might prefer

  10. Classification • Bayesian • Classify objects into binary categories • Random Forests • Method for classification and regression by constructing a multitude of decision trees Dog Cat

  11. Frequent Itemset Mining • Parallel FP Growth Algorithm • Analyzes items in a group and then identifies which items appear together

  12. Technical Requirements • Linux • Java 1.6 or greater • Maven • Hadoop • Although, not all algorithms are implemented to work on Hadoop clusters

  13. Building Mahout for Hadoop 2 • Check out Mahout trunk with git git clone –b trunk https://github.com/apache/mahout.git \ mahout-git • Build with Maven, giving it the proper Hadoop and HBase versions mvn install -DskipTests \ -Dhadoop2 -Dhadoop2.version=2.2.0 \ -Dhbase.version=0.98.0-hadoop2

  14. Algorithm Examples • Clustering w/ K-Means • Recommendation Generation

  15. K-Means ClusteringExample • Let’s cluster the Reuter’s data set together • A bunch (21,578, to be exact) of hand-classified news articles from the greatest year created, 1987 • Steps! • Generate Sequence Files from data • Generate Vectors from Sequence Files • Run k-means

  16. K-Means ClusteringConvert dataset into a Sequence File • Download and extract the SGML files $ wget http://www.daviddlewis.com/resources/testcollections /reuters21578/reuters21578.tar.gz $ tar –xf reuters21578.tar.gz –C reuters-sgm/ • Extract content from SGML to text file $ bin/mahout org.apache.lucene.benchmark.utils.ExtractReuters \ reuters-sgm/ reuters-out/ $ hdfsdfs –put reuters-out . # Takes a while... • Use seqdirectory tool to convert text file into a Hadoop Sequence File $ bin/mahout seqdirectory -ireuters-out \ -o reuters-out-seqdir -c UTF-8 -chunk 5

  17. Tangent: Writing to Sequence Files // Say you have some documents array Configuration conf = new Configuration(); FileSystem fs = FileSystem.get(conf); Path path = new Path("testdata/part-00000"); SequenceFile.Writer writer = new SequenceFile.Writer(fs, conf, path, Text.class, Text.class); for (inti = 0; i < MAX_DOCS; ++i) { writer.append(new Text(documents[i].getId()), new Text(documents[i].getContent())); } writer.close();

  18. Original File $ cat reut2-000.sgm-30.txt 26-FEB-1987 15:43:14.36 U.S. TAX WRITERS SEEK ESTATE TAX CURBS, RAISING 6.7 BILLION DLRS THRU 1991

  19. Now, in Sequence File /reut2-000.sgm-30.txt 26-FEB-1987 15:43:14.36 U.S. TAX WRITERS SEEK ESTATE TAX CURBS, RAISING 6.7 BILLION DLRS THRU 1991 Key Value* * Contains new line characters

  20. K-Means ClusteringGenerate Vectors from Sequence Files • Steps • Compute Dictionary • Assign integers for words • Compute feature weights • Create vector for each document using word-integer mapping and feature-weight • Or simply run $ mahout seq2sparse $ mahout seq2sparse \ -i mahout-work/reuters-out-seqdir/ \ -o mahout-work/reuters-out-seqdir-sparse-kmeans

  21. After seq2sparse /reut2-000.sgm-30.txt{3960:1.0,21578:1.0,33629 :1.0,41511:1.0,8361:1.0,10882:1.0,5405:1.0,22224:1.0,15528:1.0,38507:2.0,39687:1.0,2737:1.0,35909:1.0,2962:1.0,19078:1.0,20362:1.0} Key Value

  22. Document to Integers to Vector { 3960:1.0, 21578:1.0, 33629:1.0, 41511:1.0, 8361:1.0, 10882:1.0, 5405:1.0, 22224:1.0, 15528:1.0, 38507:2.0, 39687:1.0, 2737:1.0, 35909:1.0, 2962:1.0, 19078:1.0, 20362:1.0 } 26-FEB-1987 15:43:14.36 U.S. TAX WRITERS SEEK ESTATE TAX CURBS, RAISING 6.7 BILLION DLRS THRU 1991 14.36 2737 15 2962 1991 3960 26 5405 43 8361 6.7 10882 billion 15528 curbs 19078 dlrs 20362 estate 21578 feb 22224 raising 33629 seek 35909 tax 38507 u.s 39687 writers 41511 One document of many!

  23. K-Means ClusteringRun the kmeans program $ mahout kmeans \ -ireuters-out-seqdir-sparse-kmeans/tfidf-vectors/ \ -c reuters-kmeans-clusters \ -o reuters-kmeans \ -dmorg.apache.mahout.common.distance.CosineDistanceMeasure –cd 0.1 -x 10 -k 20 • Key Parameters • Distance measure • Convergence delta • Number of iterations • Creating assignments

  24. K-Means Clustering c2 c1 c3

  25. K-Means Clustering c2 c1 c3

  26. K-Means Clustering c2 c2 c1 c1 c3 c3

  27. K-Means Clustering c2 c1 c3

  28. Inspect clusters $ bin/mahout clusterdump \ -s reuters-kmeans/clusters-9 \ -d reuters-out-seqdir-sparse-kmeans/dictionary.file-0 \ -dtsequencefile -b 100 -n 10 One cluster’s output :VL-16018{n=244 c=[0.1:0.054, 0.15:0.033, 0.16:0.038, 0.39:0.038, 0.5:0.104, 0.50:0.056, 0.66695:0.06        Top Terms:                coffee                                  =>  6.603762315922096 ico                                     =>  2.774544069024383                quotas                                  =>  2.553474168308446                export                                  =>  2.237574426854243                bags                                    =>  2.0811007140112703                said                                    =>  1.9646081743670292                sales                                   =>  1.8946112726555495                year                                    =>  1.7350609068010674                brazil                                  =>  1.7211509923465917                producers                               =>  1.6411271017105853

  29. FAQs • How to get rid of useless words? • Increase minSupport and or decrease dfPercent • Use StopwordsAnalyzer • How to see documents to cluster assignments? • Run clustering process at the end of centroid generation using –cl • How to choose appropriate weighting? • If its long text, go with tf-idf. Use normalization if documents different in length • How to run this on a cluster? • Set HADOOP_CONF directory to point to your hadoop cluster conf directory • How to scale? • Use small value of k to partially cluster data and then do full clustering on each cluster.

  30. FAQs • How to choose k? • Figure out based on the data you have. Trial and error • Or use Canopy Clustering and distance threshold to figure it out • Or use Spectral clustering • How to improve Similarity Measurement? • Not all features are equal • Small weight difference for certain types creates a large semantic difference • Use WeightedDistanceMeasure • Or write a custom DistanceMeasure

  31. Recommendations • Help users find items they might like based on historical preferences Based on example by Sebastian Schelter in “Distributed Itembased Collaborative Filtering with Apache Mahout”

  32. Recommendations Alice 5 1 4 ? Bob 2 5 Peter 4 3 2

  33. Recommendations • Algorithm • Neighborhood-based approach • Works by finding similarly rated items in the user-item-matrix (e.g. cosine, Pearson-Correlation, Tanimoto Coefficient) • Estimates a user's preference towards an item by looking at his/her preferences towards similar items

  34. Recommendations • Prediction: Estimate Bob's preference towards “The Matrix” • Look at all items that • a) are similar to “The Matrix“ • b) have been rated by Bob => “Alien“, “Inception“ • Estimate the unknown preference with a weighted sum

  35. Recommendations • MapReduce phase 1 • Map – Make user the key

  36. Recommendations • MapReduce phase 1 • Reduce – Create inverted index

  37. Recommendations • MapReduce phase 2 • Map – Isolate all co-occurred ratings (all cases where a user rated both items)

  38. Recommendations • MapReduce phase 2 • Reduce – Compute similarities

  39. Recommendations • Calculate Weighted sum (-.47*2 + .47*5) / (.47+.47) = 1.5

  40. Recommendations Alice 5 1 4 Bob 1.5 2 5 Peter 4 3 2

  41. Implementations in Mahout • ItemSimilarityJob • Computes all item similarities • Various configuration options: • Similarity measure to use (cosine, Pearson-Correlation, etc.) • Maximum number of similar items per item • Maximum number of co-occurences to consider • Input: CSV file (userId, itemID, value) • Output: Pairs of itemIDs with associated similarity

  42. Implementations in Mahout • RecommenderJob • Distributed Itembased Recommender • Various configuration options: • Similarity measure to use • Number of recommendations per user • Filter out some users or items • Input: CSV file (userId, itemID, value) • Output: UserIds with recommended itemIDs and their scores

  43. References • http://mahout.apache.org • http://isabel-drost.de/hadoop/slides/collabMahout.pdf • http://www.slideshare.net/OReillyOSCON/hands-on-mahout# • http://www.slideshare.net/urilavi/intro-to-mahout

More Related