1 / 28

Enhancing Discovery with Solr and Mahout

Enhancing Discovery with Solr and Mahout. Grant Ingersoll Chief Scientist Lucid Imagination. Evolution. Minding the Intersection. Topics. Background Apache Mahout Apache Solr and Lucene Recommendations with Mahout Collaborative Filtering Discovery with Solr and Mahout Discussion.

hamlet
Download Presentation

Enhancing Discovery with Solr and Mahout

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Enhancing Discovery with Solr and Mahout Grant Ingersoll Chief Scientist Lucid Imagination

  2. Evolution

  3. Minding the Intersection

  4. Topics • Background • Apache Mahout • Apache Solr and Lucene • Recommendations with Mahout • Collaborative Filtering • Discovery with Solr and Mahout • Discussion

  5. Apache Lucene in a Nutshell • http://lucene.apache.org/java • Java based Application Programming Interface (API) for adding search and indexing functionality to applications • Fast and efficient scoring and indexing algorithms • Lots of contributions to make common tasks easier: • Highlighting, spatial, Query Parsers, Benchmarking tools, etc. • Most widely deployed search library on the planet

  6. Apache Solr in a Nutshell • http://lucene.apache.org/solr • Lucene-based Search Server + other features and functionality • Access Lucene over HTTP: • Java, XML, Ruby, Python, .NET, JSON, PHP, etc. • Most programming tasks in Lucene are taken care of in Solr • Faceting (guided navigation, filters, etc.) • Replication and distributed search support • Lucene Best Practices

  7. Apache Mahout in a Nutshell http://dictionary.reference.com/browse/mahout • An Apache Software Foundation project to create scalable machine learning libraries under the Apache Software License • http://mahout.apache.org • The Three C’s: • Collaborative Filtering (recommenders) • Clustering • Classification • Others: • Frequent Item Mining • Primitive collections • Math stuff

  8. Recommendations with Mahout

  9. Recommenders • Collaborative Filtering (CF) • Provide recommendations solely based on preferences expressed between users and items • “People who watched this also watched that” • Content-based Recommendations (CBR) • Provide recommendations based on the attributes of the items and user profile • ‘Modern Family’ is a sitcom, Bob likes sitcoms • => Suggest Modern Family to Bob • Mahout geared towards CF, can be extended to do CBR • Classification can also be used for CBR • Aside: search engines can also solve these problems

  10. To Rate or Not? • In many instances, user’s don’t provide actual ratings • Clicks, views, etc. • Non-Boolean ratings can also often introduce unnecessary noise • Even a low rating often has a positive correlation with highly rated items in the real world • Example: Should we recommend Frankenstein to Bob?

  11. Collaborative Filtering with Mahout • Extensive framework for collaborative filtering • Recommenders • User based • Item based • Slope One • Online and Offline support • Offline can utilize Hadoop Recommendations for User X

  12. User Similarity What should we recommend for User 1? User 2 User 1 User 3 User 4 Item 1 Item 2 Item 3 Item 4

  13. Item Similarity What should we recommend for User 1? User 2 User 1 User 3 User 4 Item 1 Item 2 Item 3 Item 4

  14. Slope One • Intuition: There is a linear relationship between rated items • Y = mX + b where m = 1 • Solve for b upfront based on existing ratings: b = (Y-X) • Find the average difference in preference value for every pair of items • Online can be very fast, but requires up front computation and memory User A: 3.5 – 2 = 1.5 Item 1 (User B) = 3 + 1.5 = 4.5

  15. Online and Offline Recommendations • Online • Predates Hadoop • Designed to run on a single node • Matrix size of ~ 100M interactions • API for integrating with your application • Offline • Hadoop based • Designed to run on large cluster • Several approaches: • RecommenderJob, ItemSimilarityJob, ParallelALSFactorizationJob

  16. RecommenderJob • Essentially does matrix multiplication using distributed techniques • $MAHOUT_HOME/bin/examples/asf-email-examples.sh X =

  17. Discovery with Solr

  18. Discovery with Solr • Goals: • Guide users to results without having to guess at keywords • Encourage serendipity • Never show empty results • Out of the Box: • Faceting • Spell Checking • More Like This • Clustering (Carrot2) • Extend • Clustering (with Mahout) • Frequent Item Mining (with Mahout)

  19. Clustering • Automatically group similar content together to aid users in discovering related items and/or avoiding repetitive content • Solr has search result clustering • Pluggable • Default implementation uses Carrot2 • Mahout has Hadoop based large scale clustering • K-Means, Minhash, Dirichlet, Canopy, Spectral, etc.

  20. Discovery In Action • Pre-reqs: • Apache Ant 1.7.x, Subversion (SVN) • Command Line 1: • svn co https://svn.apache.org/repos/asf/lucene/dev/trunksolr-trunk • cdsolr-trunk/solr/ • ant example • cd example • java –Dsolr.clustering.enabled=true –jar start.jar • Command Line 2 • cd exampledocs; java –jar post.jar *.xml • http://localhost:8983/solr/browse?q=&debugQuery=true&annotateBrowse=true

  21. Solr + Mahout

  22. Basics • Most Mahout tasks are offline • Solr provides many touch points for integration: • ClusteringEngine • Clustering results • SearchComponent • Suggestions – Related searches, clusters, MLT, spellchecking • UpdateProcessor • Classification of documents • FunctionQuery

  23. Example: FrequentItemset Mining • Discover frequently co-occurring items • Use Case: Related Searches from Solr Logs • Hadoop and sequential versions • Parallel FP Growth • Input: • <optional document id>TAB<TOKEN1>SPACE<TOKEN2>SPACE • Comma, pipe also allowed as delimiters

  24. FIM on Solr Query Logs • Goal: • Extract user queries from Solr logs • Feed into FIM to generate Related Keyword Searches • Context: • Solr Query logs • bin/mahout regexconverter–input $PATH_TO_LOGS --output /tmp/solr/output --regex "(?<=(\?|&)q=).*?(?=&|$)" --overwrite --transformerClassurl --formatterClassfpg • bin/mahout fpg --input /tmp/solr/output/ -o /tmp/solr/fim/output -k 25 -s 2 --method mapreduce • bin/mahout seqdumper --seqFile /tmp/solr2/results/frequentpatterns/part-r-00000

  25. Output • Key: Chris: Value: ([Chris, Hostetter],870), ([Chris],870), ([Search, Faceted, Chris, Hostetter, Webcast, Power, Mastering],18), ([Search, Faceted, Chris, Hostetter, Webcast, Power],18), ([Search, Faceted, Chris, Hostetter],18), ([Solr, new, Chris, Hostetter, webcast, along, sponsors, DZone, QA, Refcard],12), ([Solr, new, Chris, Hostetter, webcast, along, sponsors, DZone],12), ([Solr, new, Chris, Hostetter, webcast, along, sponsors],12), ([Solr, new, Chris, Hostetter, webcast, along],12), ([Solr, new, Chris, Hostetter, webcast],12), ([Solr, new, Chris, Hostetter],12)

  26. Resources • http://lucene.apache.org • http://mahout.apache.org • http://manning.com/owen • http://manning.com/ingersoll • http://www.lucidimagination.com • grant@lucidimagination.com • @gsingers

  27. Appendix

  28. Mahout Overview Applications Examples Genetic Freq. Pattern Mining Classification Clustering Recommenders Utilities/Integration Lucene/Vectorizer Math Vectors/Matrices/SVD Collections (primitives) Apache Hadoop See http://cwiki.apache.org/confluence/display/MAHOUT/Algorithms

More Related