an introduction to machine learning with perl n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
An Introduction to Machine Learning with Perl PowerPoint Presentation
Download Presentation
An Introduction to Machine Learning with Perl

Loading in 2 Seconds...

play fullscreen
1 / 83

An Introduction to Machine Learning with Perl - PowerPoint PPT Presentation


  • 737 Views
  • Uploaded on

An Introduction to Machine Learning with Perl. February 3, 2003 O’Reilly Bioinformatics Conference. Ken Williams ken@mathforum.org. Tutorial Overview. What is Machine Learning? (20’) Why use Perl for ML? (15’) Some theory (20’) Some tools (30’) Decision trees (20’) SVMs (15’)

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'An Introduction to Machine Learning with Perl' - albert


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
an introduction to machine learning with perl

An Introduction to Machine Learning with Perl

February 3, 2003

O’Reilly Bioinformatics Conference

Ken Williams

ken@mathforum.org

tutorial overview
Tutorial Overview
  • What is Machine Learning? (20’)
  • Why use Perl for ML? (15’)
  • Some theory (20’)
  • Some tools (30’)
  • Decision trees (20’)
  • SVMs (15’)
  • Categorization (40’)
references sources
References & Sources
  • Machine Learning, Tom Mitchell. McGraw-Hill, 414pp, 1997
  • Foundations of Natural Language Processing, Christopher D. Manning & Hinrich Schütze. MIT Press, 680 pp, 1999
  • Perl-AI list (perl-ai@perl.org)
what is machine learning
What Is Machine Learning?
  • A subfield of Artificial Intelligence (but without the baggage)
  • Usually concerns some particular task, not the building of a sentient robot
  • Concerns the design of systems that improve (or at least change) as they acquire knowledge or experience
typical ml tasks
Typical ML Tasks
  • Clustering
  • Categorization
  • Recognition
  • Filtering
  • Game playing
  • Autonomous performance
typical ml tasks2
Typical ML Tasks
  • Categorization
typical ml tasks3
Typical ML Tasks
  • Recognition

Vincent Van Gogh

Michael Stipe

Mohammed Ali

Ken Williams

Burl Ives

Winston Churchill

Grover Cleveland

typical ml tasks4
Typical ML Tasks
  • Recognition

Little red corvette The kids are all right

The rain in Spain Bort bort bort

typical ml tasks6
Typical ML Tasks
  • Game playing
typical ml tasks7
Typical ML Tasks
  • Autonomous performance
typical ml buzzwords
Typical ML Buzzwords
  • Data Mining
  • Knowledge Management (KM)
  • Information Retrieval (IR)
  • Expert Systems
  • Topic detection and tracking
who does ml
Who does ML?
  • Two main groups: research and industry
  • These groups do listen to each other, at least some
  • Not many reusable ML/KM components, outside of a few commercial systems
  • KM is seen as a key component of big business strategy - lots of KM consultants
  • ML is an extremely active research area with relatively low “cost of entry”
when is ml useful
When is ML useful?
  • When you have lots of data
  • When you can’t hire enough people, or when people are too slow
  • When you can afford to be wrong sometimes
  • When you need to find patterns
  • When you have nothing to lose
an aside on your presenter
An aside on your presenter
  • Academic background in math & music (not computer science or even statistics)
  • Several years as a Perl consultant
  • Two years as a math teacher
  • Currently studying document categorization at The University of Sydney
  • In other words, a typical ML student
why use perl for ml
Why use Perl for ML?
  • CPAN - the viral solution™
  • Perl has rapid reusability
  • Perl is widely deployed
  • Perl code can be written quickly
  • Embeds both ways
  • Human-oriented development
  • Leaves your options open
but what about all the data
But what about all the data?
  • ML techniques tend to use lots of data in complicated ways
  • Perl is great at data in general, but tends to gobble memory or forego strict checking
  • Two fine solutions exist:
    • Be as careful in Perl as you are in C (Params::Validate, Tie::SecureHash, etc.)
    • Use PDL or Inline (more on these later)
interfaces vs implementations
Interfaces vs. Implementations
  • In ML applications, we need both data integrity and the ability to “play with it”
  • Perl wrappers around C/C++ structures/objects are a nice balance
  • Keeps high-level interfaces in Perl, low-level implementations in C/C++
  • Can be prototyped in pure Perl, with C/C++ parts added later
some ml theory and terminology
Some ML Theory and Terminology
  • ML concerns learning a target function from a set of examples
  • The target function is often called a hypothesis
  • Example: with Neural Network, a trained network is a hypothesis
  • The set of all possible target functions is called the hypothesis space
  • Training process can be considerd a search through the hypothesis space
some ml theory and terminology1
Some ML Theory and Terminology
  • Each ML technique will
    • probably exclude some hypotheses
    • prefer some hypotheses over others
  • A technique’s exclusion & preference rules are called its inductive bias
  • If it ain’t biased, it ain’t learnin’
    • No bias = rote learning
    • Bias = generalization
  • Example: kids learning multiplication (understanding vs. memorization)
some ml theory and terminology2
Some ML Theory and Terminology
  • Ideally, a ML technique will
    • not exclude the “right” hypothesis, i.e. the hypothesis space will include the target hypothesis
    • Prefer the target hypothesis over others
  • Measuring the degree to which these criteria are satisfied is important and sometimes complicated
evaluating hypotheses
Evaluating Hypotheses
  • We often want to know how good a hypothesis is
    • To know how it performs in real world
    • May be used to improve learning technique or tune parameters
    • May be used by a learner to automatically improve the hypothesis
  • Usually evaluate on test data
    • Test data must be kept separate from training data
    • Test data used for purpose 3) is usually called validation or held-out data.
    • Training, validation, and test data should not contaminate each other
evaluating hypotheses1
Evaluating Hypotheses
  • Some standard statistical measures are useful
  • Error rate, accuracy, precision, recall, F1
  • Calculated using contingency tables
evaluating hypotheses2
Evaluating Hypotheses
  • Error = (b+c)/(a+b+c+d)
  • Accuracy = (a+d)/(a+b+c+d)
  • Precision = p = a/(a+b)
  • Recall = r = a/(a+c)
  • F1 = 2pr/(p+r)

Precision is easy to maximize by assigning nothing

Recall is easy to maximize by assigning everything

F1 combines precision and recall equally

evaluating hypotheses3
Evaluating Hypotheses
  • Example (from categorization)
  • Note that precision is higher than recall - indicates a cautious categorizer

Precision = 0.851, Recall = 0.711, F1 = 0.775

These scores depend on the task - can’t compare scores across tasks

Often useful to compare categories separately, then average (macro-averaging)

evaluating hypotheses4
Evaluating Hypotheses
  • The Statistics::Contingency module (on CPAN) helps calculate these figures:

use Statistics::Contingency;

my $s = new Statistics::Contingency;

while (...) {

... Do some categorization ...

$s->add_result($assigned, $correct);

}

print "Micro F1: ", $s->micro_F1, "\n";

print $s->stats_table;

Micro F1: 0.774803607797498

+-------------------------------------------------+

| miR miP miF1 maR maP maF1 Err |

| 0.243 0.843 0.275 0.711 0.851 0.775 0.006 |

+-------------------------------------------------+

useful perl data munging tools
Useful Perl Data-Munging Tools
  • Storable - cheap persistence and cloning
  • PDL - helps performance and design
  • Inline::C - tight loops and interfaces
storable
Storable
  • One of many persistence classes for Perl data (Data::Dumper, YAML, Data::Denter)
  • Allows saving structures to disk:

store($x, $filename);

$x = retrieve($filename);

  • Allows cloning of structures:

$y = dclone($x);

  • Not terribly interesting, but handy
slide30
PDL
  • Perl Data Language
  • On CPAN, of course (PDL-2.3.4.tar.gz)
  • Turns Perl into a data-processing language similar to Matlab
  • Native C/Fortran numerical handling
  • Compact multi-dimensional arrays
  • Still Perl at highest level
pdl demo
PDL demo

PDL experimentation shell:

ken% perldl

perldl> demo pdl

extending pdl
Extending PDL
  • PDL has extension language PDL::PP

Lets you write C extensions to PDL

Handles many gory details (data types, loop indexes, “threading”)

extending pdl1
Extending PDL
  • Example: $n = $pdl->sum_elements;

# Usage:

$pdl = PDL->random(7);

print "PDL: $pdl\n";

$x = $pdl->sum_elements;

print "Sum: $sum\n";

# Output:

PDL: [0.513 0.175 0.308 0.534 0.947 0.171 0.702]

Sum: [3.35]

extending pdl2
Extending PDL

pp_def('sum_elements',

Pars => 'a(n); [o]b();',

Code => <<'EOF’,

double tmp;

tmp = 0;

loop(n) %{

tmp += $a();

%}

$b() = tmp;

EOF

);

extending pdl3
Extending PDL

pp_def('sum_elements',

Pars => 'a(n); [o]b();',

Code => <<'EOF’,

double tmp;

tmp = 0;

loop(n) %{

tmp += $a();

%}

$b() = tmp;

EOF

);

extending pdl4
Extending PDL

pp_def('sum_elements',

Pars => 'a(n); [o]b();',

Code => <<'EOF’,

double tmp;

tmp = 0;

loop(n) %{

tmp += $a();

%}

$b() = tmp;

EOF

);

extending pdl5
Extending PDL

pp_def('sum_elements',

Pars => 'a(n); [o]b();',

Code => <<'EOF’,

$GENERIC() tmp;

tmp = ($GENERIC()) 0;

loop(n) %{

tmp += $a();

%}

$b() = tmp;

EOF

);

inline c
Inline::C
  • Allows very easy embedding of C code in Perl modules
  • Also Inline::Java, Inline::Python, Inline::CPP, Inline::ASM, Inline::Tcl
  • Considered much easier than XS or SWIG
  • Developers are very enthusiastic and helpful
inline c basic syntax
Inline::C basic syntax
  • A complete Perl script using Inline:

(taken from Inline docs)

#!/usr/bin/perl

greet();

use Inline C => q{

void greet() { printf("Hello, world\n"); }

}

inline c for writing functions
Inline::C for writing functions
  • Find next prime number greater than $x

#!/usr/bin/perl

foreach (-2.7, 29, 30.33, 100_000) {

print "$_: ", next_prime($_), "\n";

}

. . .

inline c for writing functions1
Inline::C for writing functions

use Inline C => q{

int next_prime(double in) {

// Implements a Sieve of Eratosthenes

int *is_prime;

int i, j;

int candidate = ceil(in);

if (in < 2.0) return 2;

is_prime = malloc(2 * candidate * sizeof(int));

for (i = 0; i<2*candidate; i++) is_prime[i] = 1;

. . .

inline c for writing functions2
Inline::C for writing functions

for (i = 2; i < 2*first_candidate; i++) {

if (!is_prime[i]) continue;

if (i >= first_candidate) { free(is_prime); return i; }

for (j = i; j < 2*first_candidate; j += i) is_prime[j] = 0;

}

return 0; // Should never get here

}

}

inline c for wrapping libraries
Inline::C for wrapping libraries
  • We’ll create a wrapper for ‘libbow’, an IR package
  • Contains an implementation of the Porter word-stemming algorithm (i.e., the stem of 'trying' is 'try’)

# A Perlish interface:

$stem = stem_porter($word);

# A C-like interface:

stem_porter_inplace($word);

inline c for wrapping libraries1
Inline::C for wrapping libraries

package Bow::Inline;

use strict;

use Exporter;

use vars qw($VERSION @ISA @EXPORT_OK);

BEGIN {

$VERSION = '0.01';

}

@ISA = qw(Exporter);

@EXPORT_OK = qw(stem_porter

stem_porter_inplace);

. . .

inline c for wrapping libraries2
Inline::C for wrapping libraries

use Inline (C => 'DATA',

VERSION => $VERSION,

NAME => __PACKAGE__,

LIBS => '-L/tmp/bow/lib -lbow',

INC => '-I/tmp/bow/include',

CCFLAGS => '-no-cpp-precomp',

);

1;

__DATA__

__C__

. . .

inline c for wrapping libraries3
Inline::C for wrapping libraries

// libbow includes bow_stem_porter()

#include "bow/libbow.h"

// The bare-bones C interface exposed

int stem_porter_inplace(SV* word) {

int retval;

char* ptr = SvPV_nolen(word);

retval = bow_stem_porter(ptr);

SvCUR_set(word, strlen(ptr));

return retval;

}

. . .

inline c for wrapping libraries4
Inline::C for wrapping libraries

// A Perlish interface

char* stem_porter (char* word) {

if (!bow_stem_porter(word)) return &PL_sv_undef;

return word;

}

// Don't know what the hell these are for in libbow,

// but it needs them.

const char *argp_program_version = "foo 1.0";

const char *program_invocation_short_name = "foofy";

when to use speed tools
When to use speed tools
  • A word of caution - don’t use C or PDL before you need to
  • Plain Perl is great for most tasks and usually pretty fast
  • Remember - external libraries (like libbow, pari-gp) both solve problems and create headaches
decision trees
Decision Trees
  • Conceptually simple
  • Fast evaluation
  • Scrutable structures
  • Can be learned from training data
  • Can be difficult to build
  • Can “overfit” training data
  • Usually prefer simpler, i.e. smaller trees
decision trees1
Decision Trees
  • Sample training data:
decision trees2
Decision Trees
  • How do we build the tree from the training data?
  • We want to make the smallest possible trees
  • Which attribute (Outlook, Wind, etc.) is the best classifier?
  • We need a measurement of how much information a given attribute contributes toward the outcome.
  • We use information gain (IG), which is based on the entropy of the training instances.
  • The attribute with the highest IG is the “most helpful” classifier, and reduces entropy the most.
decision trees3
Decision Trees
  • From Information Theory, invented by Claude Shannon
  • Measures uncertainty of a decision between alternate options
  • Probabilistically expected value of the number of bits necessary to specify value of an attribute
  • i represents an attribute value, pi represents the probability of seeing that attribute.
decision trees4
Decision Trees

sub entropy {

my %prob;

$prob{$_}++ foreach @_;

$_ /= @_ foreach values %prob;

my $sum = 0;

$sum += $_ * log($_) foreach values %prob;

return -$sum / log(2);

}

decision trees5
Decision Trees
  • Si are the subsets of S having attribute I value i
  • IG is original entropy minus entropy after knowing attribute i
  • Find argmaxI(Gain(S,I)) at each splitting node
  • To maximize IG, we can just minimize the second term on the right, since Entropy(S) is constant
  • This is the ID3 algorithm (J. R. Quinlan, 1986)
decision trees6
Decision Trees
  • Decision trees in Perl are available with AI::DecisionTree (on CPAN)
  • Very simple OO interface
  • Currently implements ID3
    • Handles either consistent or noisy input
    • Can post-prune trees using a Minimum Message Length criterion
    • Doesn’t do cross-validation
    • Doesn’t handle continuous data
  • More robust feature sets are needed - patches welcome!
decision trees example
Decision Trees - Example

use AI::DecisionTree;

my $dtree = new AI::DecisionTree;

# Add training instances

$dtree->add_instance

(attributes => {outlook => 'sunny',

temperature => 'hot',

humidity => 'high'},

result => 'no');

$dtree->add_instance

(attributes => {outlook => 'overcast',

temperature => 'hot',

humidity => 'normal'},

result => 'yes');

# ... repeat for several more instances

decision trees example1
Decision Trees - Example

# ... continued ...

$dtree->train;

# Find results for unseen instances

my $result = $dtree->get_result

(attributes => {outlook => 'sunny',

temperature => 'hot',

humidity => 'normal'});

print "Result: $result\n";

slide58
SVMs
  • Another ML technique
  • Measures features quantitatively, induces a vector space
  • Finds the optimal decision surface
slide59
SVMs
  • Data may be inseparable
  • Same algorithms usually work, find “best” surface
  • Different surface shapes may be used
  • Usually scales well with number of features, poorly with number of examples
svms example
SVMs - Example

use Algorithm::SVM;

use Algorithm::SVM::DataSet;

# Collect & format the data:

my @data;

for (...) {

push @data, Algorithm::SVM::DataSet->new

( Label => $foo,

Data => \@bar, );

}

# Train the SVM:

my $svm = Algorithm::SVM->new(Kernel => ‘linear’);

$svm->train(@data);

... continued ...

svms example1
SVMs - Example

my $test = Algorithm::SVM::DataSet->new

( Label => undef,

Data => \@baz, );

}

my $result = $svm->predict($test);

print "Predicted: $result\n";

text categorization
Text Categorization
  • Text categorization, and categorization in general, is an extremely powerful ML technique
  • Generalizes well to many areas
    • Document management
    • Information Retrieval
    • Gene/protein identification
    • Spam filtering
  • Fairly simple concept
  • Lots of technical challenges
text categorization1
Text Categorization
  • AI::Categorizer (sequel to AI::Categorize) on CPAN
  • Addresses lots of tasks in text categorization
    • Format of documents (XML, text, database, etc.)
    • Support for structured documents (title, body, etc.)
    • Tokenizing of data into words
    • Linguistic stemming
    • Feature selection (1-grams, n-grams, statistically chosen)
    • Vector space modeling (TF/IDF methods)
    • Machine learning algorithm (Naïve Bayes, SVM, DecisionTree, kNN, etc.)
    • Machine learning parameters (different in each algorithm)
    • Hypothesis behavior (best-category only, or all matching categories)
ai categorizer framework1
AI::Categorizer Framework
  • KnowledgeSet embodies a set of documents and categories
ai categorizer framework2
AI::Categorizer Framework
  • Document is a (possibly structured) set of text data, belonging to 1 or more categories
ai categorizer framework3
AI::Categorizer Framework
  • Category is a named set containing 1 or more documents
ai categorizer framework4
AI::Categorizer Framework
  • Collection is a storage medium for document and category information (as text files, in DBI, XML files, etc.)
ai categorizer framework5
AI::Categorizer Framework
  • Feature Vector maps features (words) to weights (counts)
ai categorizer framework6
AI::Categorizer Framework
  • Learner is a ML algorithm class (Naïve Bayes, kNN, Decision Tree, etc.)
ai categorizer framework7
AI::Categorizer Framework
  • Hypothesis is the learner’s “best guess” about document categories
ai categorizer framework8
AI::Categorizer Framework
  • Experiment collects and analyzes hypotheses
using ai categorizer
Using AI::Categorizer
  • Highest-level interface

use AI::Categorizer;

my $c = new AI::Categorizer(...parameters...);

# Run a complete experiment - training on a

# corpus, testing on a test set, printing a

# summary of results to STDOUT

$c->run_experiment;

using ai categorizer1
Using AI::Categorizer
  • More detailed:

use AI::Categorizer;

my $c = new AI::Categorizer(...parameters...);

# Run the separate parts of $c->run_experiment

$c->scan_features;

$c->read_training_set;

$c->train;

$c->evaluate_test_set;

print $c->stats_table;

using ai categorizer2
Using AI::Categorizer
  • In an application:

# After training, use learner for categorizing

my $l = $c->learner;

while (...) {

my $d = ...create a document...

my $h = $l->categorize($d);

print "Best category: ", $h->best_category;

}

using ai categorizer3
Using AI::Categorizer
  • Uses the Class::Container package, so all parameters can go to the top-level object constructor:

my $c = new AI::Categorizer

(save_progress => 'my_progress',

data_root => 'my_data',

features_kept => 10_000,

threshold => 0.1,

);

using ai categorizer4
Using AI::Categorizer
  • Uses the Class::Container package, so all parameters can go to the top-level object constructor:

my $c = new AI::Categorizer

(save_progress => 'my_progress',

data_root => 'my_data',

features_kept => 10_000,

threshold => 0.1,

);

(AI::Categorizer needn’t know about these, it’s transparent)

To Categorizer

To KnowledgeSet

To Learner

na ve bayes categorization
Naïve Bayes Categorization
  • Simple, fast machine learning technique
  • Let c1…m represent all categories, and w1…n represent the words of a given document

Above term is computationally infeasible - data is too sparse

na ve bayes categorization1
Naïve Bayes Categorization
  • Apply Bayes’ Theorem
na ve bayes categorization2
Naïve Bayes Categorization
  • The quantities p(ci) and p(wj|ci) can be calculated from training set
  • p(ci) is fraction of training set belonging to category ci
  • p(wj|ci) is fraction of words in ci that are wj
  • Must deal with unseen words, we don’t want any p(wj|ci) to be zero
  • Typically we pretend unseen words have been seen 0.5 times, or use some similar strategy
na ve bayes sample run
Naïve Bayes Sample Run

ken> perl eg/run_experiment.pl [options]

references
References
  • Ken Williams: ken@mathforum.org or kenw@ee.usyd.edu.au
  • Perl-AI list: perl-ai@perl.org
  • AI::Categorizer, AI::DecisionTree, Statistics::Contingency, Inline::C, PDL, Storable all on CPAN
  • libbow: http://www.cs.cmu.edu/~mccallum/bow
  • Machine Learning, Tom Mitchell. McGraw-Hill, 414pp, 1997
  • Foundations of Natural Language Processing, Christopher D. Manning & Hinrich Schütze. MIT Press, 680 pp., 1999
extras time permitting
Extras, time permitting
  • AI::Categorizer parameters by class
  • AI::DecisionTree example
  • PDL::Sparse walkthrough
  • AI::NodeLib (incomplete implementation)