Enhancing Discovery with Solr and Mahout

Enhancing Discovery with Solr and Mahout Grant Ingersoll Chief Scientist Lucid Imagination

Evolution

Minding the Intersection

Topics • Background • Apache Mahout • Apache Solr and Lucene • Recommendations with Mahout • Collaborative Filtering • Discovery with Solr and Mahout • Discussion

Apache Lucene in a Nutshell • http://lucene.apache.org/java • Java based Application Programming Interface (API) for adding search and indexing functionality to applications • Fast and efficient scoring and indexing algorithms • Lots of contributions to make common tasks easier: • Highlighting, spatial, Query Parsers, Benchmarking tools, etc. • Most widely deployed search library on the planet

Apache Solr in a Nutshell • http://lucene.apache.org/solr • Lucene-based Search Server + other features and functionality • Access Lucene over HTTP: • Java, XML, Ruby, Python, .NET, JSON, PHP, etc. • Most programming tasks in Lucene are taken care of in Solr • Faceting (guided navigation, filters, etc.) • Replication and distributed search support • Lucene Best Practices

Apache Mahout in a Nutshell http://dictionary.reference.com/browse/mahout • An Apache Software Foundation project to create scalable machine learning libraries under the Apache Software License • http://mahout.apache.org • The Three C’s: • Collaborative Filtering (recommenders) • Clustering • Classification • Others: • Frequent Item Mining • Primitive collections • Math stuff

Recommendations with Mahout

Recommenders • Collaborative Filtering (CF) • Provide recommendations solely based on preferences expressed between users and items • “People who watched this also watched that” • Content-based Recommendations (CBR) • Provide recommendations based on the attributes of the items and user profile • ‘Modern Family’ is a sitcom, Bob likes sitcoms • => Suggest Modern Family to Bob • Mahout geared towards CF, can be extended to do CBR • Classification can also be used for CBR • Aside: search engines can also solve these problems

To Rate or Not? • In many instances, user’s don’t provide actual ratings • Clicks, views, etc. • Non-Boolean ratings can also often introduce unnecessary noise • Even a low rating often has a positive correlation with highly rated items in the real world • Example: Should we recommend Frankenstein to Bob?

Collaborative Filtering with Mahout • Extensive framework for collaborative filtering • Recommenders • User based • Item based • Slope One • Online and Offline support • Offline can utilize Hadoop Recommendations for User X

User Similarity What should we recommend for User 1? User 2 User 1 User 3 User 4 Item 1 Item 2 Item 3 Item 4

Item Similarity What should we recommend for User 1? User 2 User 1 User 3 User 4 Item 1 Item 2 Item 3 Item 4

Slope One • Intuition: There is a linear relationship between rated items • Y = mX + b where m = 1 • Solve for b upfront based on existing ratings: b = (Y-X) • Find the average difference in preference value for every pair of items • Online can be very fast, but requires up front computation and memory User A: 3.5 – 2 = 1.5 Item 1 (User B) = 3 + 1.5 = 4.5

Online and Offline Recommendations • Online • Predates Hadoop • Designed to run on a single node • Matrix size of ~ 100M interactions • API for integrating with your application • Offline • Hadoop based • Designed to run on large cluster • Several approaches: • RecommenderJob, ItemSimilarityJob, ParallelALSFactorizationJob

RecommenderJob • Essentially does matrix multiplication using distributed techniques • $MAHOUT_HOME/bin/examples/asf-email-examples.sh X =

Discovery with Solr

Discovery with Solr • Goals: • Guide users to results without having to guess at keywords • Encourage serendipity • Never show empty results • Out of the Box: • Faceting • Spell Checking • More Like This • Clustering (Carrot2) • Extend • Clustering (with Mahout) • Frequent Item Mining (with Mahout)

Clustering • Automatically group similar content together to aid users in discovering related items and/or avoiding repetitive content • Solr has search result clustering • Pluggable • Default implementation uses Carrot2 • Mahout has Hadoop based large scale clustering • K-Means, Minhash, Dirichlet, Canopy, Spectral, etc.

Discovery In Action • Pre-reqs: • Apache Ant 1.7.x, Subversion (SVN) • Command Line 1: • svn co https://svn.apache.org/repos/asf/lucene/dev/trunksolr-trunk • cdsolr-trunk/solr/ • ant example • cd example • java –Dsolr.clustering.enabled=true –jar start.jar • Command Line 2 • cd exampledocs; java –jar post.jar *.xml • http://localhost:8983/solr/browse?q=&debugQuery=true&annotateBrowse=true

Solr + Mahout

Basics • Most Mahout tasks are offline • Solr provides many touch points for integration: • ClusteringEngine • Clustering results • SearchComponent • Suggestions – Related searches, clusters, MLT, spellchecking • UpdateProcessor • Classification of documents • FunctionQuery

Example: FrequentItemset Mining • Discover frequently co-occurring items • Use Case: Related Searches from Solr Logs • Hadoop and sequential versions • Parallel FP Growth • Input: • <optional document id>TAB<TOKEN1>SPACE<TOKEN2>SPACE • Comma, pipe also allowed as delimiters

FIM on Solr Query Logs • Goal: • Extract user queries from Solr logs • Feed into FIM to generate Related Keyword Searches • Context: • Solr Query logs • bin/mahout regexconverter–input $PATH_TO_LOGS --output /tmp/solr/output --regex "(?<=(\?|&)q=).*?(?=&|$)" --overwrite --transformerClassurl --formatterClassfpg • bin/mahout fpg --input /tmp/solr/output/ -o /tmp/solr/fim/output -k 25 -s 2 --method mapreduce • bin/mahout seqdumper --seqFile /tmp/solr2/results/frequentpatterns/part-r-00000

Output • Key: Chris: Value: ([Chris, Hostetter],870), ([Chris],870), ([Search, Faceted, Chris, Hostetter, Webcast, Power, Mastering],18), ([Search, Faceted, Chris, Hostetter, Webcast, Power],18), ([Search, Faceted, Chris, Hostetter],18), ([Solr, new, Chris, Hostetter, webcast, along, sponsors, DZone, QA, Refcard],12), ([Solr, new, Chris, Hostetter, webcast, along, sponsors, DZone],12), ([Solr, new, Chris, Hostetter, webcast, along, sponsors],12), ([Solr, new, Chris, Hostetter, webcast, along],12), ([Solr, new, Chris, Hostetter, webcast],12), ([Solr, new, Chris, Hostetter],12)

Resources • http://lucene.apache.org • http://mahout.apache.org • http://manning.com/owen • http://manning.com/ingersoll • http://www.lucidimagination.com • grant@lucidimagination.com • @gsingers

Appendix

Mahout Overview Applications Examples Genetic Freq. Pattern Mining Classification Clustering Recommenders Utilities/Integration Lucene/Vectorizer Math Vectors/Matrices/SVD Collections (primitives) Apache Hadoop See http://cwiki.apache.org/confluence/display/MAHOUT/Algorithms

Enhancing Discovery with Solr and Mahout

Enhancing Discovery with Solr and Mahout

Presentation Transcript

Apache Solr

WEKA, Mahout, and MLlib Overview

Solr Integration and Enhancements

Solr 3.1 and Beyond

Practical Solr

Scaling Big Data Search with Solr and HBase

RDA and Authority Records: Enhancing Discovery

Intelligent Apps with Apache Lucene, Mahout and friends

Apache Solr

Introducing Apache Mahout

Apache Solr

Data mining @ Mahout

MTV Networks and Solr

Apache Mahout

Introducing Apache Mahout

Implementing Autocomplete with Solr and jQuery

Apache Mahout

Apache Solr Training | Apache Solr Online Training | Online Apache Solr Training

MTV Networks and Solr