An introduction to machine learning with perl
Download
1 / 83

An Introduction to Machine Learning with Perl - PowerPoint PPT Presentation

An Introduction to Machine Learning with Perl. February 3, 2003 O’Reilly Bioinformatics Conference. Ken Williams ken@mathforum.org. Tutorial Overview. What is Machine Learning? (20’) Why use Perl for ML? (15’) Some theory (20’) Some tools (30’) Decision trees (20’) SVMs (15’)

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha

Download Presentation

An Introduction to Machine Learning with Perl

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


An Introduction to Machine Learning with Perl

February 3, 2003

O’Reilly Bioinformatics Conference

Ken Williams

ken@mathforum.org


Tutorial Overview

  • What is Machine Learning? (20’)

  • Why use Perl for ML? (15’)

  • Some theory (20’)

  • Some tools (30’)

  • Decision trees (20’)

  • SVMs (15’)

  • Categorization (40’)


References & Sources

  • Machine Learning, Tom Mitchell. McGraw-Hill, 414pp, 1997

  • Foundations of Natural Language Processing, Christopher D. Manning & Hinrich Schütze. MIT Press, 680 pp, 1999

  • Perl-AI list (perl-ai@perl.org)


What Is Machine Learning?

  • A subfield of Artificial Intelligence (but without the baggage)

  • Usually concerns some particular task, not the building of a sentient robot

  • Concerns the design of systems that improve (or at least change) as they acquire knowledge or experience


Typical ML Tasks

  • Clustering

  • Categorization

  • Recognition

  • Filtering

  • Game playing

  • Autonomous performance


Typical ML Tasks

  • Clustering


Typical ML Tasks

  • Categorization


Typical ML Tasks

  • Recognition

Vincent Van Gogh

Michael Stipe

Mohammed Ali

Ken Williams

Burl Ives

Winston Churchill

Grover Cleveland


Typical ML Tasks

  • Recognition

Little red corvetteThe kids are all right

The rain in SpainBort bort bort


Typical ML Tasks

  • Filtering


Typical ML Tasks

  • Game playing


Typical ML Tasks

  • Autonomous performance


Typical ML Buzzwords

  • Data Mining

  • Knowledge Management (KM)

  • Information Retrieval (IR)

  • Expert Systems

  • Topic detection and tracking


Who does ML?

  • Two main groups: research and industry

  • These groups do listen to each other, at least some

  • Not many reusable ML/KM components, outside of a few commercial systems

  • KM is seen as a key component of big business strategy - lots of KM consultants

  • ML is an extremely active research area with relatively low “cost of entry”


When is ML useful?

  • When you have lots of data

  • When you can’t hire enough people, or when people are too slow

  • When you can afford to be wrong sometimes

  • When you need to find patterns

  • When you have nothing to lose


An aside on your presenter

  • Academic background in math & music (not computer science or even statistics)

  • Several years as a Perl consultant

  • Two years as a math teacher

  • Currently studying document categorization at The University of Sydney

  • In other words, a typical ML student


Why use Perl for ML?

  • CPAN - the viral solution™

  • Perl has rapid reusability

  • Perl is widely deployed

  • Perl code can be written quickly

  • Embeds both ways

  • Human-oriented development

  • Leaves your options open


But what about all the data?

  • ML techniques tend to use lots of data in complicated ways

  • Perl is great at data in general, but tends to gobble memory or forego strict checking

  • Two fine solutions exist:

    • Be as careful in Perl as you are in C (Params::Validate, Tie::SecureHash, etc.)

    • Use PDL or Inline (more on these later)


Interfaces vs. Implementations

  • In ML applications, we need both data integrity and the ability to “play with it”

  • Perl wrappers around C/C++ structures/objects are a nice balance

  • Keeps high-level interfaces in Perl, low-level implementations in C/C++

  • Can be prototyped in pure Perl, with C/C++ parts added later


Some ML Theory and Terminology

  • ML concerns learning a target function from a set of examples

  • The target function is often called a hypothesis

  • Example: with Neural Network, a trained network is a hypothesis

  • The set of all possible target functions is called the hypothesis space

  • Training process can be considerd a search through the hypothesis space


Some ML Theory and Terminology

  • Each ML technique will

    • probably exclude some hypotheses

    • prefer some hypotheses over others

  • A technique’s exclusion & preference rules are called its inductive bias

  • If it ain’t biased, it ain’t learnin’

    • No bias = rote learning

    • Bias = generalization

  • Example: kids learning multiplication (understanding vs. memorization)


Some ML Theory and Terminology

  • Ideally, a ML technique will

    • not exclude the “right” hypothesis, i.e. the hypothesis space will include the target hypothesis

    • Prefer the target hypothesis over others

  • Measuring the degree to which these criteria are satisfied is important and sometimes complicated


Evaluating Hypotheses

  • We often want to know how good a hypothesis is

    • To know how it performs in real world

    • May be used to improve learning technique or tune parameters

    • May be used by a learner to automatically improve the hypothesis

  • Usually evaluate on test data

    • Test data must be kept separate from training data

    • Test data used for purpose 3) is usually called validation or held-out data.

    • Training, validation, and test data should not contaminate each other


Evaluating Hypotheses

  • Some standard statistical measures are useful

  • Error rate, accuracy, precision, recall, F1

  • Calculated using contingency tables


Evaluating Hypotheses

  • Error = (b+c)/(a+b+c+d)

  • Accuracy = (a+d)/(a+b+c+d)

  • Precision = p = a/(a+b)

  • Recall = r = a/(a+c)

  • F1 = 2pr/(p+r)

Precision is easy to maximize by assigning nothing

Recall is easy to maximize by assigning everything

F1 combines precision and recall equally


Evaluating Hypotheses

  • Example (from categorization)

  • Note that precision is higher than recall - indicates a cautious categorizer

Precision = 0.851, Recall = 0.711, F1 = 0.775

These scores depend on the task - can’t compare scores across tasks

Often useful to compare categories separately, then average (macro-averaging)


Evaluating Hypotheses

  • The Statistics::Contingency module (on CPAN) helps calculate these figures:

    use Statistics::Contingency;

    my $s = new Statistics::Contingency;

    while (...) {

    ... Do some categorization ...

    $s->add_result($assigned, $correct);

    }

    print "Micro F1: ", $s->micro_F1, "\n";

    print $s->stats_table;

    Micro F1: 0.774803607797498

    +-------------------------------------------------+

    | miR miP miF1 maR maP maF1 Err |

    | 0.243 0.843 0.275 0.711 0.851 0.775 0.006 |

    +-------------------------------------------------+


Useful Perl Data-Munging Tools

  • Storable - cheap persistence and cloning

  • PDL - helps performance and design

  • Inline::C - tight loops and interfaces


Storable

  • One of many persistence classes for Perl data (Data::Dumper, YAML, Data::Denter)

  • Allows saving structures to disk:

    store($x, $filename);

    $x = retrieve($filename);

  • Allows cloning of structures:

    $y = dclone($x);

  • Not terribly interesting, but handy


PDL

  • Perl Data Language

  • On CPAN, of course (PDL-2.3.4.tar.gz)

  • Turns Perl into a data-processing language similar to Matlab

  • Native C/Fortran numerical handling

  • Compact multi-dimensional arrays

  • Still Perl at highest level


PDL demo

PDL experimentation shell:

ken% perldl

perldl> demo pdl


Extending PDL

  • PDL has extension language PDL::PP

Lets you write C extensions to PDL

Handles many gory details (data types, loop indexes, “threading”)


Extending PDL

  • Example: $n = $pdl->sum_elements;

# Usage:

$pdl = PDL->random(7);

print "PDL: $pdl\n";

$x = $pdl->sum_elements;

print "Sum: $sum\n";

# Output:

PDL: [0.513 0.175 0.308 0.534 0.947 0.171 0.702]

Sum: [3.35]


Extending PDL

pp_def('sum_elements',

Pars => 'a(n); [o]b();',

Code => <<'EOF’,

double tmp;

tmp = 0;

loop(n) %{

tmp += $a();

%}

$b() = tmp;

EOF

);


Extending PDL

pp_def('sum_elements',

Pars => 'a(n); [o]b();',

Code => <<'EOF’,

double tmp;

tmp = 0;

loop(n) %{

tmp += $a();

%}

$b() = tmp;

EOF

);


Extending PDL

pp_def('sum_elements',

Pars => 'a(n); [o]b();',

Code => <<'EOF’,

double tmp;

tmp = 0;

loop(n) %{

tmp += $a();

%}

$b() = tmp;

EOF

);


Extending PDL

pp_def('sum_elements',

Pars => 'a(n); [o]b();',

Code => <<'EOF’,

$GENERIC() tmp;

tmp = ($GENERIC()) 0;

loop(n) %{

tmp += $a();

%}

$b() = tmp;

EOF

);


Inline::C

  • Allows very easy embedding of C code in Perl modules

  • Also Inline::Java, Inline::Python, Inline::CPP, Inline::ASM, Inline::Tcl

  • Considered much easier than XS or SWIG

  • Developers are very enthusiastic and helpful


Inline::C basic syntax

  • A complete Perl script using Inline:

(taken from Inline docs)

#!/usr/bin/perl

greet();

use Inline C => q{

void greet() { printf("Hello, world\n"); }

}


Inline::C for writing functions

  • Find next prime number greater than $x

    #!/usr/bin/perl

    foreach (-2.7, 29, 30.33, 100_000) {

    print "$_: ", next_prime($_), "\n";

    }

    . . .


Inline::C for writing functions

use Inline C => q{

int next_prime(double in) {

// Implements a Sieve of Eratosthenes

int *is_prime;

int i, j;

int candidate = ceil(in);

if (in < 2.0) return 2;

is_prime = malloc(2 * candidate * sizeof(int));

for (i = 0; i<2*candidate; i++) is_prime[i] = 1;

. . .


Inline::C for writing functions

for (i = 2; i < 2*first_candidate; i++) {

if (!is_prime[i]) continue;

if (i >= first_candidate) { free(is_prime); return i; }

for (j = i; j < 2*first_candidate; j += i) is_prime[j] = 0;

}

return 0; // Should never get here

}

}


Inline::C for wrapping libraries

  • We’ll create a wrapper for ‘libbow’, an IR package

  • Contains an implementation of the Porter word-stemming algorithm (i.e., the stem of 'trying' is 'try’)

# A Perlish interface:

$stem = stem_porter($word);

# A C-like interface:

stem_porter_inplace($word);


Inline::C for wrapping libraries

package Bow::Inline;

use strict;

use Exporter;

use vars qw($VERSION @ISA @EXPORT_OK);

BEGIN {

$VERSION = '0.01';

}

@ISA = qw(Exporter);

@EXPORT_OK = qw(stem_porter

stem_porter_inplace);

. . .


Inline::C for wrapping libraries

use Inline (C => 'DATA',

VERSION => $VERSION,

NAME => __PACKAGE__,

LIBS => '-L/tmp/bow/lib -lbow',

INC => '-I/tmp/bow/include',

CCFLAGS => '-no-cpp-precomp',

);

1;

__DATA__

__C__

. . .


Inline::C for wrapping libraries

// libbow includes bow_stem_porter()

#include "bow/libbow.h"

// The bare-bones C interface exposed

int stem_porter_inplace(SV* word) {

int retval;

char* ptr = SvPV_nolen(word);

retval = bow_stem_porter(ptr);

SvCUR_set(word, strlen(ptr));

return retval;

}

. . .


Inline::C for wrapping libraries

// A Perlish interface

char* stem_porter (char* word) {

if (!bow_stem_porter(word)) return &PL_sv_undef;

return word;

}

// Don't know what the hell these are for in libbow,

// but it needs them.

const char *argp_program_version = "foo 1.0";

const char *program_invocation_short_name = "foofy";


When to use speed tools

  • A word of caution - don’t use C or PDL before you need to

  • Plain Perl is great for most tasks and usually pretty fast

  • Remember - external libraries (like libbow, pari-gp) both solve problems and create headaches


Decision Trees

  • Conceptually simple

  • Fast evaluation

  • Scrutable structures

  • Can be learned from training data

  • Can be difficult to build

  • Can “overfit” training data

  • Usually prefer simpler, i.e. smaller trees


Decision Trees

  • Sample training data:


Decision Trees

  • How do we build the tree from the training data?

  • We want to make the smallest possible trees

  • Which attribute (Outlook, Wind, etc.) is the best classifier?

  • We need a measurement of how much information a given attribute contributes toward the outcome.

  • We use information gain (IG), which is based on the entropy of the training instances.

  • The attribute with the highest IG is the “most helpful” classifier, and reduces entropy the most.


Decision Trees

  • From Information Theory, invented by Claude Shannon

  • Measures uncertainty of a decision between alternate options

  • Probabilistically expected value of the number of bits necessary to specify value of an attribute

  • i represents an attribute value, pi represents the probability of seeing that attribute.


Decision Trees

sub entropy {

my %prob;

$prob{$_}++ foreach @_;

$_ /= @_ foreach values %prob;

my $sum = 0;

$sum += $_ * log($_) foreach values %prob;

return -$sum / log(2);

}


Decision Trees

  • Si are the subsets of S having attribute I value i

  • IG is original entropy minus entropy after knowing attribute i

  • Find argmaxI(Gain(S,I)) at each splitting node

  • To maximize IG, we can just minimize the second term on the right, since Entropy(S) is constant

  • This is the ID3 algorithm (J. R. Quinlan, 1986)


Decision Trees

  • Decision trees in Perl are available with AI::DecisionTree (on CPAN)

  • Very simple OO interface

  • Currently implements ID3

    • Handles either consistent or noisy input

    • Can post-prune trees using a Minimum Message Length criterion

    • Doesn’t do cross-validation

    • Doesn’t handle continuous data

  • More robust feature sets are needed - patches welcome!


Decision Trees - Example

use AI::DecisionTree;

my $dtree = new AI::DecisionTree;

# Add training instances

$dtree->add_instance

(attributes => {outlook => 'sunny',

temperature => 'hot',

humidity => 'high'},

result => 'no');

$dtree->add_instance

(attributes => {outlook => 'overcast',

temperature => 'hot',

humidity => 'normal'},

result => 'yes');

# ... repeat for several more instances


Decision Trees - Example

# ... continued ...

$dtree->train;

# Find results for unseen instances

my $result = $dtree->get_result

(attributes => {outlook => 'sunny',

temperature => 'hot',

humidity => 'normal'});

print "Result: $result\n";


SVMs

  • Another ML technique

  • Measures features quantitatively, induces a vector space

  • Finds the optimal decision surface


SVMs

  • Data may be inseparable

  • Same algorithms usually work, find “best” surface

  • Different surface shapes may be used

  • Usually scales well with number of features, poorly with number of examples


SVMs - Example

use Algorithm::SVM;

use Algorithm::SVM::DataSet;

# Collect & format the data:

my @data;

for (...) {

push @data, Algorithm::SVM::DataSet->new

( Label => $foo,

Data => \@bar, );

}

# Train the SVM:

my $svm = Algorithm::SVM->new(Kernel => ‘linear’);

$svm->train(@data);

... continued ...


SVMs - Example

my $test = Algorithm::SVM::DataSet->new

( Label => undef,

Data => \@baz, );

}

my $result = $svm->predict($test);

print "Predicted: $result\n";


Text Categorization

  • Text categorization, and categorization in general, is an extremely powerful ML technique

  • Generalizes well to many areas

    • Document management

    • Information Retrieval

    • Gene/protein identification

    • Spam filtering

  • Fairly simple concept

  • Lots of technical challenges


Text Categorization

  • AI::Categorizer (sequel to AI::Categorize) on CPAN

  • Addresses lots of tasks in text categorization

    • Format of documents (XML, text, database, etc.)

    • Support for structured documents (title, body, etc.)

    • Tokenizing of data into words

    • Linguistic stemming

    • Feature selection (1-grams, n-grams, statistically chosen)

    • Vector space modeling (TF/IDF methods)

    • Machine learning algorithm (Naïve Bayes, SVM, DecisionTree, kNN, etc.)

    • Machine learning parameters (different in each algorithm)

    • Hypothesis behavior (best-category only, or all matching categories)


AI::Categorizer Framework


AI::Categorizer Framework

  • KnowledgeSet embodies a set of documents and categories


AI::Categorizer Framework

  • Document is a (possibly structured) set of text data, belonging to 1 or more categories


AI::Categorizer Framework

  • Category is a named set containing 1 or more documents


AI::Categorizer Framework

  • Collection is a storage medium for document and category information (as text files, in DBI, XML files, etc.)


AI::Categorizer Framework

  • Feature Vector maps features (words) to weights (counts)


AI::Categorizer Framework

  • Learner is a ML algorithm class (Naïve Bayes, kNN, Decision Tree, etc.)


AI::Categorizer Framework

  • Hypothesis is the learner’s “best guess” about document categories


AI::Categorizer Framework

  • Experiment collects and analyzes hypotheses


Using AI::Categorizer

  • Highest-level interface

    use AI::Categorizer;

    my $c = new AI::Categorizer(...parameters...);

    # Run a complete experiment - training on a

    # corpus, testing on a test set, printing a

    # summary of results to STDOUT

    $c->run_experiment;


Using AI::Categorizer

  • More detailed:

    use AI::Categorizer;

    my $c = new AI::Categorizer(...parameters...);

    # Run the separate parts of $c->run_experiment

    $c->scan_features;

    $c->read_training_set;

    $c->train;

    $c->evaluate_test_set;

    print $c->stats_table;


Using AI::Categorizer

  • In an application:

    # After training, use learner for categorizing

    my $l = $c->learner;

    while (...) {

    my $d = ...create a document...

    my $h = $l->categorize($d);

    print "Best category: ", $h->best_category;

    }


Using AI::Categorizer

  • Uses the Class::Container package, so all parameters can go to the top-level object constructor:

    my $c = new AI::Categorizer

    (save_progress => 'my_progress',

    data_root => 'my_data',

    features_kept => 10_000,

    threshold => 0.1,

    );


Using AI::Categorizer

  • Uses the Class::Container package, so all parameters can go to the top-level object constructor:

    my $c = new AI::Categorizer

    (save_progress => 'my_progress',

    data_root => 'my_data',

    features_kept => 10_000,

    threshold => 0.1,

    );

    (AI::Categorizer needn’t know about these, it’s transparent)

To Categorizer

To KnowledgeSet

To Learner


Naïve Bayes Categorization

  • Simple, fast machine learning technique

  • Let c1…m represent all categories, and w1…n represent the words of a given document

Above term is computationally infeasible - data is too sparse


Naïve Bayes Categorization

  • Apply Bayes’ Theorem


Naïve Bayes Categorization

  • The quantities p(ci) and p(wj|ci) can be calculated from training set

  • p(ci) is fraction of training set belonging to category ci

  • p(wj|ci) is fraction of words in ci that are wj

  • Must deal with unseen words, we don’t want any p(wj|ci) to be zero

  • Typically we pretend unseen words have been seen 0.5 times, or use some similar strategy


Naïve Bayes Sample Run

ken> perl eg/run_experiment.pl [options]


References

  • Ken Williams: ken@mathforum.org or kenw@ee.usyd.edu.au

  • Perl-AI list: perl-ai@perl.org

  • AI::Categorizer, AI::DecisionTree, Statistics::Contingency, Inline::C, PDL, Storable all on CPAN

  • libbow: http://www.cs.cmu.edu/~mccallum/bow

  • Machine Learning, Tom Mitchell. McGraw-Hill, 414pp, 1997

  • Foundations of Natural Language Processing, Christopher D. Manning & Hinrich Schütze. MIT Press, 680 pp., 1999


Extras, time permitting

  • AI::Categorizer parameters by class

  • AI::DecisionTree example

  • PDL::Sparse walkthrough

  • AI::NodeLib (incomplete implementation)


ad
  • Login