An introduction to machine learning with perl
1 / 83

An Introduction to - PowerPoint PPT Presentation

  • Updated On :

An Introduction to Machine Learning with Perl. February 3, 2003 O’Reilly Bioinformatics Conference. Ken Williams [email protected] Tutorial Overview. What is Machine Learning? (20’) Why use Perl for ML? (15’) Some theory (20’) Some tools (30’) Decision trees (20’) SVMs (15’)

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'An Introduction to ' - albert

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
An introduction to machine learning with perl

An Introduction to Machine Learning with Perl

February 3, 2003

O’Reilly Bioinformatics Conference

Ken Williams

[email protected]

Tutorial overview
Tutorial Overview

  • What is Machine Learning? (20’)

  • Why use Perl for ML? (15’)

  • Some theory (20’)

  • Some tools (30’)

  • Decision trees (20’)

  • SVMs (15’)

  • Categorization (40’)

References sources
References & Sources

  • Machine Learning, Tom Mitchell. McGraw-Hill, 414pp, 1997

  • Foundations of Natural Language Processing, Christopher D. Manning & Hinrich Schütze. MIT Press, 680 pp, 1999

  • Perl-AI list ([email protected])

What is machine learning
What Is Machine Learning?

  • A subfield of Artificial Intelligence (but without the baggage)

  • Usually concerns some particular task, not the building of a sentient robot

  • Concerns the design of systems that improve (or at least change) as they acquire knowledge or experience

Typical ml tasks
Typical ML Tasks

  • Clustering

  • Categorization

  • Recognition

  • Filtering

  • Game playing

  • Autonomous performance

Typical ml tasks1
Typical ML Tasks

  • Clustering

Typical ml tasks2
Typical ML Tasks

  • Categorization

Typical ml tasks3
Typical ML Tasks

  • Recognition

Vincent Van Gogh

Michael Stipe

Mohammed Ali

Ken Williams

Burl Ives

Winston Churchill

Grover Cleveland

Typical ml tasks4
Typical ML Tasks

  • Recognition

Little red corvette The kids are all right

The rain in Spain Bort bort bort

Typical ml tasks5
Typical ML Tasks

  • Filtering

Typical ml tasks6
Typical ML Tasks

  • Game playing

Typical ml tasks7
Typical ML Tasks

  • Autonomous performance

Typical ml buzzwords
Typical ML Buzzwords

  • Data Mining

  • Knowledge Management (KM)

  • Information Retrieval (IR)

  • Expert Systems

  • Topic detection and tracking

Who does ml
Who does ML?

  • Two main groups: research and industry

  • These groups do listen to each other, at least some

  • Not many reusable ML/KM components, outside of a few commercial systems

  • KM is seen as a key component of big business strategy - lots of KM consultants

  • ML is an extremely active research area with relatively low “cost of entry”

When is ml useful
When is ML useful?

  • When you have lots of data

  • When you can’t hire enough people, or when people are too slow

  • When you can afford to be wrong sometimes

  • When you need to find patterns

  • When you have nothing to lose

An aside on your presenter
An aside on your presenter

  • Academic background in math & music (not computer science or even statistics)

  • Several years as a Perl consultant

  • Two years as a math teacher

  • Currently studying document categorization at The University of Sydney

  • In other words, a typical ML student

Why use perl for ml
Why use Perl for ML?

  • CPAN - the viral solution™

  • Perl has rapid reusability

  • Perl is widely deployed

  • Perl code can be written quickly

  • Embeds both ways

  • Human-oriented development

  • Leaves your options open

But what about all the data
But what about all the data?

  • ML techniques tend to use lots of data in complicated ways

  • Perl is great at data in general, but tends to gobble memory or forego strict checking

  • Two fine solutions exist:

    • Be as careful in Perl as you are in C (Params::Validate, Tie::SecureHash, etc.)

    • Use PDL or Inline (more on these later)

Interfaces vs implementations
Interfaces vs. Implementations

  • In ML applications, we need both data integrity and the ability to “play with it”

  • Perl wrappers around C/C++ structures/objects are a nice balance

  • Keeps high-level interfaces in Perl, low-level implementations in C/C++

  • Can be prototyped in pure Perl, with C/C++ parts added later

Some ml theory and terminology
Some ML Theory and Terminology

  • ML concerns learning a target function from a set of examples

  • The target function is often called a hypothesis

  • Example: with Neural Network, a trained network is a hypothesis

  • The set of all possible target functions is called the hypothesis space

  • Training process can be considerd a search through the hypothesis space

Some ml theory and terminology1
Some ML Theory and Terminology

  • Each ML technique will

    • probably exclude some hypotheses

    • prefer some hypotheses over others

  • A technique’s exclusion & preference rules are called its inductive bias

  • If it ain’t biased, it ain’t learnin’

    • No bias = rote learning

    • Bias = generalization

  • Example: kids learning multiplication (understanding vs. memorization)

Some ml theory and terminology2
Some ML Theory and Terminology

  • Ideally, a ML technique will

    • not exclude the “right” hypothesis, i.e. the hypothesis space will include the target hypothesis

    • Prefer the target hypothesis over others

  • Measuring the degree to which these criteria are satisfied is important and sometimes complicated

Evaluating hypotheses
Evaluating Hypotheses

  • We often want to know how good a hypothesis is

    • To know how it performs in real world

    • May be used to improve learning technique or tune parameters

    • May be used by a learner to automatically improve the hypothesis

  • Usually evaluate on test data

    • Test data must be kept separate from training data

    • Test data used for purpose 3) is usually called validation or held-out data.

    • Training, validation, and test data should not contaminate each other

Evaluating hypotheses1
Evaluating Hypotheses

  • Some standard statistical measures are useful

  • Error rate, accuracy, precision, recall, F1

  • Calculated using contingency tables

Evaluating hypotheses2
Evaluating Hypotheses

  • Error = (b+c)/(a+b+c+d)

  • Accuracy = (a+d)/(a+b+c+d)

  • Precision = p = a/(a+b)

  • Recall = r = a/(a+c)

  • F1 = 2pr/(p+r)

Precision is easy to maximize by assigning nothing

Recall is easy to maximize by assigning everything

F1 combines precision and recall equally

Evaluating hypotheses3
Evaluating Hypotheses

  • Example (from categorization)

  • Note that precision is higher than recall - indicates a cautious categorizer

Precision = 0.851, Recall = 0.711, F1 = 0.775

These scores depend on the task - can’t compare scores across tasks

Often useful to compare categories separately, then average (macro-averaging)

Evaluating hypotheses4
Evaluating Hypotheses

  • The Statistics::Contingency module (on CPAN) helps calculate these figures:

    use Statistics::Contingency;

    my $s = new Statistics::Contingency;

    while (...) {

    ... Do some categorization ...

    $s->add_result($assigned, $correct);


    print "Micro F1: ", $s->micro_F1, "\n";

    print $s->stats_table;

    Micro F1: 0.774803607797498


    | miR miP miF1 maR maP maF1 Err |

    | 0.243 0.843 0.275 0.711 0.851 0.775 0.006 |


Useful perl data munging tools
Useful Perl Data-Munging Tools

  • Storable - cheap persistence and cloning

  • PDL - helps performance and design

  • Inline::C - tight loops and interfaces


  • One of many persistence classes for Perl data (Data::Dumper, YAML, Data::Denter)

  • Allows saving structures to disk:

    store($x, $filename);

    $x = retrieve($filename);

  • Allows cloning of structures:

    $y = dclone($x);

  • Not terribly interesting, but handy


  • Perl Data Language

  • On CPAN, of course (PDL-2.3.4.tar.gz)

  • Turns Perl into a data-processing language similar to Matlab

  • Native C/Fortran numerical handling

  • Compact multi-dimensional arrays

  • Still Perl at highest level

Pdl demo
PDL demo

PDL experimentation shell:

ken% perldl

perldl> demo pdl

Extending pdl
Extending PDL

  • PDL has extension language PDL::PP

Lets you write C extensions to PDL

Handles many gory details (data types, loop indexes, “threading”)

Extending pdl1
Extending PDL

  • Example: $n = $pdl->sum_elements;

# Usage:

$pdl = PDL->random(7);

print "PDL: $pdl\n";

$x = $pdl->sum_elements;

print "Sum: $sum\n";

# Output:

PDL: [0.513 0.175 0.308 0.534 0.947 0.171 0.702]

Sum: [3.35]

Extending pdl2
Extending PDL


Pars => 'a(n); [o]b();',

Code => <<'EOF’,

double tmp;

tmp = 0;

loop(n) %{

tmp += $a();


$b() = tmp;



Extending pdl3
Extending PDL


Pars => 'a(n); [o]b();',

Code => <<'EOF’,

double tmp;

tmp = 0;

loop(n) %{

tmp += $a();


$b() = tmp;



Extending pdl4
Extending PDL


Pars => 'a(n); [o]b();',

Code => <<'EOF’,

double tmp;

tmp = 0;

loop(n) %{

tmp += $a();


$b() = tmp;



Extending pdl5
Extending PDL


Pars => 'a(n); [o]b();',

Code => <<'EOF’,

$GENERIC() tmp;

tmp = ($GENERIC()) 0;

loop(n) %{

tmp += $a();


$b() = tmp;



Inline c

  • Allows very easy embedding of C code in Perl modules

  • Also Inline::Java, Inline::Python, Inline::CPP, Inline::ASM, Inline::Tcl

  • Considered much easier than XS or SWIG

  • Developers are very enthusiastic and helpful

Inline c basic syntax
Inline::C basic syntax

  • A complete Perl script using Inline:

(taken from Inline docs)



use Inline C => q{

void greet() { printf("Hello, world\n"); }


Inline c for writing functions
Inline::C for writing functions

  • Find next prime number greater than $x


    foreach (-2.7, 29, 30.33, 100_000) {

    print "$_: ", next_prime($_), "\n";


    . . .

Inline c for writing functions1
Inline::C for writing functions

use Inline C => q{

int next_prime(double in) {

// Implements a Sieve of Eratosthenes

int *is_prime;

int i, j;

int candidate = ceil(in);

if (in < 2.0) return 2;

is_prime = malloc(2 * candidate * sizeof(int));

for (i = 0; i<2*candidate; i++) is_prime[i] = 1;

. . .

Inline c for writing functions2
Inline::C for writing functions

for (i = 2; i < 2*first_candidate; i++) {

if (!is_prime[i]) continue;

if (i >= first_candidate) { free(is_prime); return i; }

for (j = i; j < 2*first_candidate; j += i) is_prime[j] = 0;


return 0; // Should never get here



Inline c for wrapping libraries
Inline::C for wrapping libraries

  • We’ll create a wrapper for ‘libbow’, an IR package

  • Contains an implementation of the Porter word-stemming algorithm (i.e., the stem of 'trying' is 'try’)

# A Perlish interface:

$stem = stem_porter($word);

# A C-like interface:


Inline c for wrapping libraries1
Inline::C for wrapping libraries

package Bow::Inline;

use strict;

use Exporter;

use vars qw($VERSION @ISA @EXPORT_OK);


$VERSION = '0.01';


@ISA = qw(Exporter);

@EXPORT_OK = qw(stem_porter


. . .

Inline c for wrapping libraries2
Inline::C for wrapping libraries

use Inline (C => 'DATA',



LIBS => '-L/tmp/bow/lib -lbow',

INC => '-I/tmp/bow/include',

CCFLAGS => '-no-cpp-precomp',





. . .

Inline c for wrapping libraries3
Inline::C for wrapping libraries

// libbow includes bow_stem_porter()

#include "bow/libbow.h"

// The bare-bones C interface exposed

int stem_porter_inplace(SV* word) {

int retval;

char* ptr = SvPV_nolen(word);

retval = bow_stem_porter(ptr);

SvCUR_set(word, strlen(ptr));

return retval;


. . .

Inline c for wrapping libraries4
Inline::C for wrapping libraries

// A Perlish interface

char* stem_porter (char* word) {

if (!bow_stem_porter(word)) return &PL_sv_undef;

return word;


// Don't know what the hell these are for in libbow,

// but it needs them.

const char *argp_program_version = "foo 1.0";

const char *program_invocation_short_name = "foofy";

When to use speed tools
When to use speed tools

  • A word of caution - don’t use C or PDL before you need to

  • Plain Perl is great for most tasks and usually pretty fast

  • Remember - external libraries (like libbow, pari-gp) both solve problems and create headaches

Decision trees
Decision Trees

  • Conceptually simple

  • Fast evaluation

  • Scrutable structures

  • Can be learned from training data

  • Can be difficult to build

  • Can “overfit” training data

  • Usually prefer simpler, i.e. smaller trees

Decision trees1
Decision Trees

  • Sample training data:

Decision trees2
Decision Trees

  • How do we build the tree from the training data?

  • We want to make the smallest possible trees

  • Which attribute (Outlook, Wind, etc.) is the best classifier?

  • We need a measurement of how much information a given attribute contributes toward the outcome.

  • We use information gain (IG), which is based on the entropy of the training instances.

  • The attribute with the highest IG is the “most helpful” classifier, and reduces entropy the most.

Decision trees3
Decision Trees

  • From Information Theory, invented by Claude Shannon

  • Measures uncertainty of a decision between alternate options

  • Probabilistically expected value of the number of bits necessary to specify value of an attribute

  • i represents an attribute value, pi represents the probability of seeing that attribute.

Decision trees4
Decision Trees

sub entropy {

my %prob;

$prob{$_}++ foreach @_;

$_ /= @_ foreach values %prob;

my $sum = 0;

$sum += $_ * log($_) foreach values %prob;

return -$sum / log(2);


Decision trees5
Decision Trees

  • Si are the subsets of S having attribute I value i

  • IG is original entropy minus entropy after knowing attribute i

  • Find argmaxI(Gain(S,I)) at each splitting node

  • To maximize IG, we can just minimize the second term on the right, since Entropy(S) is constant

  • This is the ID3 algorithm (J. R. Quinlan, 1986)

Decision trees6
Decision Trees

  • Decision trees in Perl are available with AI::DecisionTree (on CPAN)

  • Very simple OO interface

  • Currently implements ID3

    • Handles either consistent or noisy input

    • Can post-prune trees using a Minimum Message Length criterion

    • Doesn’t do cross-validation

    • Doesn’t handle continuous data

  • More robust feature sets are needed - patches welcome!

Decision trees example
Decision Trees - Example

use AI::DecisionTree;

my $dtree = new AI::DecisionTree;

# Add training instances


(attributes => {outlook => 'sunny',

temperature => 'hot',

humidity => 'high'},

result => 'no');


(attributes => {outlook => 'overcast',

temperature => 'hot',

humidity => 'normal'},

result => 'yes');

# ... repeat for several more instances

Decision trees example1
Decision Trees - Example

# ... continued ...


# Find results for unseen instances

my $result = $dtree->get_result

(attributes => {outlook => 'sunny',

temperature => 'hot',

humidity => 'normal'});

print "Result: $result\n";


  • Another ML technique

  • Measures features quantitatively, induces a vector space

  • Finds the optimal decision surface


  • Data may be inseparable

  • Same algorithms usually work, find “best” surface

  • Different surface shapes may be used

  • Usually scales well with number of features, poorly with number of examples

Svms example
SVMs - Example

use Algorithm::SVM;

use Algorithm::SVM::DataSet;

# Collect & format the data:

my @data;

for (...) {

push @data, Algorithm::SVM::DataSet->new

( Label => $foo,

Data => \@bar, );


# Train the SVM:

my $svm = Algorithm::SVM->new(Kernel => ‘linear’);


... continued ...

Svms example1
SVMs - Example

my $test = Algorithm::SVM::DataSet->new

( Label => undef,

Data => \@baz, );


my $result = $svm->predict($test);

print "Predicted: $result\n";

Text categorization
Text Categorization

  • Text categorization, and categorization in general, is an extremely powerful ML technique

  • Generalizes well to many areas

    • Document management

    • Information Retrieval

    • Gene/protein identification

    • Spam filtering

  • Fairly simple concept

  • Lots of technical challenges

Text categorization1
Text Categorization

  • AI::Categorizer (sequel to AI::Categorize) on CPAN

  • Addresses lots of tasks in text categorization

    • Format of documents (XML, text, database, etc.)

    • Support for structured documents (title, body, etc.)

    • Tokenizing of data into words

    • Linguistic stemming

    • Feature selection (1-grams, n-grams, statistically chosen)

    • Vector space modeling (TF/IDF methods)

    • Machine learning algorithm (Naïve Bayes, SVM, DecisionTree, kNN, etc.)

    • Machine learning parameters (different in each algorithm)

    • Hypothesis behavior (best-category only, or all matching categories)

Ai categorizer framework1
AI::Categorizer Framework

  • KnowledgeSet embodies a set of documents and categories

Ai categorizer framework2
AI::Categorizer Framework

  • Document is a (possibly structured) set of text data, belonging to 1 or more categories

Ai categorizer framework3
AI::Categorizer Framework

  • Category is a named set containing 1 or more documents

Ai categorizer framework4
AI::Categorizer Framework

  • Collection is a storage medium for document and category information (as text files, in DBI, XML files, etc.)

Ai categorizer framework5
AI::Categorizer Framework

  • Feature Vector maps features (words) to weights (counts)

Ai categorizer framework6
AI::Categorizer Framework

  • Learner is a ML algorithm class (Naïve Bayes, kNN, Decision Tree, etc.)

Ai categorizer framework7
AI::Categorizer Framework

  • Hypothesis is the learner’s “best guess” about document categories

Ai categorizer framework8
AI::Categorizer Framework

  • Experiment collects and analyzes hypotheses

Using ai categorizer
Using AI::Categorizer

  • Highest-level interface

    use AI::Categorizer;

    my $c = new AI::Categorizer(...parameters...);

    # Run a complete experiment - training on a

    # corpus, testing on a test set, printing a

    # summary of results to STDOUT


Using ai categorizer1
Using AI::Categorizer

  • More detailed:

    use AI::Categorizer;

    my $c = new AI::Categorizer(...parameters...);

    # Run the separate parts of $c->run_experiment





    print $c->stats_table;

Using ai categorizer2
Using AI::Categorizer

  • In an application:

    # After training, use learner for categorizing

    my $l = $c->learner;

    while (...) {

    my $d = ...create a document...

    my $h = $l->categorize($d);

    print "Best category: ", $h->best_category;


Using ai categorizer3
Using AI::Categorizer

  • Uses the Class::Container package, so all parameters can go to the top-level object constructor:

    my $c = new AI::Categorizer

    (save_progress => 'my_progress',

    data_root => 'my_data',

    features_kept => 10_000,

    threshold => 0.1,


Using ai categorizer4
Using AI::Categorizer

  • Uses the Class::Container package, so all parameters can go to the top-level object constructor:

    my $c = new AI::Categorizer

    (save_progress => 'my_progress',

    data_root => 'my_data',

    features_kept => 10_000,

    threshold => 0.1,


    (AI::Categorizer needn’t know about these, it’s transparent)

To Categorizer

To KnowledgeSet

To Learner

Na ve bayes categorization
Naïve Bayes Categorization

  • Simple, fast machine learning technique

  • Let c1…m represent all categories, and w1…n represent the words of a given document

Above term is computationally infeasible - data is too sparse

Na ve bayes categorization1
Naïve Bayes Categorization

  • Apply Bayes’ Theorem

Na ve bayes categorization2
Naïve Bayes Categorization

  • The quantities p(ci) and p(wj|ci) can be calculated from training set

  • p(ci) is fraction of training set belonging to category ci

  • p(wj|ci) is fraction of words in ci that are wj

  • Must deal with unseen words, we don’t want any p(wj|ci) to be zero

  • Typically we pretend unseen words have been seen 0.5 times, or use some similar strategy

Na ve bayes sample run
Naïve Bayes Sample Run

ken> perl eg/ [options]


  • Ken Williams: [email protected] or [email protected]

  • Perl-AI list: [email protected]

  • AI::Categorizer, AI::DecisionTree, Statistics::Contingency, Inline::C, PDL, Storable all on CPAN

  • libbow:

  • Machine Learning, Tom Mitchell. McGraw-Hill, 414pp, 1997

  • Foundations of Natural Language Processing, Christopher D. Manning & Hinrich Schütze. MIT Press, 680 pp., 1999

Extras time permitting
Extras, time permitting

  • AI::Categorizer parameters by class

  • AI::DecisionTree example

  • PDL::Sparse walkthrough

  • AI::NodeLib (incomplete implementation)