Introducing apache mahout
Download
1 / 29

Introducing Apache Mahout - PowerPoint PPT Presentation


  • 263 Views
  • Uploaded on
  • Presentation posted in: General

Introducing Apache Mahout. Scalable Machine Learning for All! Grant Ingersoll Lucid Imagination. Overview. What is Machine Learning? Mahout. Definition. “Machine Learning is programming computers to optimize a performance criterion using example data or past experience”

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha

Download Presentation

Introducing Apache Mahout

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Introducing Apache Mahout

Scalable Machine Learning for All!

Grant Ingersoll

Lucid Imagination


Overview

What is Machine Learning?

Mahout


Definition

“Machine Learning is programming computers to optimize a performance criterion using example data or past experience”

Intro. To Machine Learning by E. Alpaydin

Subset of Artificial Intelligence

Many other fields: comp sci., biology, math, psychology, etc.


Types

Supervised

Using labeled training data, create function that predicts output of unseen inputs

Unsupervised

Using unlabeled data, create function that predicts output

Semi-Supervised

Uses labeled and unlabeled data


Characterizations

Lots of Data

Identifiable Features in that Data

Too big/costly for people to handle

People still can help


Clustering

Unsupervised

Find Natural Groupings

Documents

Search Results

People

Genetic traits in groups

Many, many more uses


Example: Clustering

Google News


Collaborative Filtering

Unsupervised

Recommend people and products

User-User

User likes X, you might too

Item-Item

People who bought X also bought Y


Example: Collab Filtering

Amazon.com


Classification/Categorization

Many, many types

Spam Filtering

Named Entity Recognition

Phrase Identification

Sentiment Analysis

Classification into a Taxonomy


Example: NER

NER?

Excerpt from Yahoo News


Example: Categorization


Info. Retrieval

Learning Ranking Functions

Learning Spelling Corrections

User Click Analysis and Tracking


Other

Image Analysis

Robotics

Games

Higher level natural language processing

Many, many others


What is Apache Mahout?

A Mahout is an elephant trainer/driver/keeper, hence…

(and other distributed techniques)

+

Machine Learning

=


What?

Hadoop brings:

Map/Reduce API

HDFS

In other words, scalability and fault-tolerance

Mahout brings:

Library of machine learning algorithms

Examples


Why Mahout?

Many Open Source ML libraries either:

Lack Community

Lack Documentation and Examples

Lack Scalability

Lack the Apache License ;-)

Or are research-oriented


Why Mahout?

Intelligent Apps are the Present and Future

Thus, Mahout’s Goal is:

Scalable Machine Learning with Apache License


Current Status

What’s in it:

Simple Matrix/Vector library

Taste Collaborative Filtering

Clustering

Canopy/K-Means/Fuzzy K-Means/Mean-shift/Dirichlet

Classifiers

Naïve Bayes

Complementary NB

Evolutionary

Integration with Watchmaker for fitness function


How?

Examples

Taste

Clustering

Classification

Evolutionary


Taste: Movie Recommendations

Given ratings by users of movies, recommend other movies

http://lucene.apache.org/mahout/taste.html#demo


http://localhost:8080/mahout-taste-webapp/RecommenderServlet?userID=12&debug=true

http://localhost:8080/mahout-taste-webapp/RecommenderServlet?userID=43&debug=true

Taste Demo


Clustering: Synthetic Control Data

http://archive.ics.uci.edu/ml/datasets/Synthetic+Control+Chart+Time+Series

Each clustering impl. has an example Job for running in <MAHOUT_HOME>/examples

o.a.mahout.clustering.syntheticcontrol.*

Outputs clusters…


Classification: NB and CNB Examples

20 Newsgroups

http://cwiki.apache.org/confluence/display/MAHOUT/TwentyNewsgroups

Wikipedia

http://cwiki.apache.org/confluence/display/MAHOUT/WikipediaBayesExample


Evolutionary

Traveling Salesman

http://cwiki.apache.org/confluence/display/MAHOUT/Traveling+Salesman

Class Discovery

http://cwiki.apache.org/confluence/display/MAHOUT/Class+Discovery


What’s Next?

More Examples

Winnow/Perceptron (MAHOUT-85)

Text Clustering

Association Rules (MAHOUT-108)

Logistic Regression

Solr Integration (SOLR-769)

GSOC


When, Who

When? Now!

Mahout is growing

Who? You!

We want programmers who:

Are comfortable with math

Like to work on hard problems

We want others to:

Kick the tires


Where?

  • http://lucene.apache.org/mahout

    • Hadoop - http://hadoop.apache.org

  • http://cwiki.apache.org/MAHOUT

  • mahout-{user|dev}@lucene.apache.org

    • http://www.lucidimagination.com/search/p:mahout


Resources

“Programming Collective Intelligence” by Segaran

“Data Mining - Practical Machine Learning Tools and Techniques” by Witten and Frank

“Taming Text” by Ingersoll and Morton


ad
  • Login