An introduction to machine learning with perl
This presentation is the property of its rightful owner.
Sponsored Links
1 / 83

An Introduction to Machine Learning with Perl PowerPoint PPT Presentation

An Introduction to Machine Learning with Perl. February 3, 2003 O’Reilly Bioinformatics Conference. Ken Williams [email protected] Tutorial Overview. What is Machine Learning? (20’) Why use Perl for ML? (15’) Some theory (20’) Some tools (30’) Decision trees (20’) SVMs (15’)

Download Presentation

An Introduction to Machine Learning with Perl

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

An introduction to machine learning with perl

An Introduction to Machine Learning with Perl

February 3, 2003

O’Reilly Bioinformatics Conference

Ken Williams

[email protected]

Tutorial overview

Tutorial Overview

  • What is Machine Learning? (20’)

  • Why use Perl for ML? (15’)

  • Some theory (20’)

  • Some tools (30’)

  • Decision trees (20’)

  • SVMs (15’)

  • Categorization (40’)

References sources

References & Sources

  • Machine Learning, Tom Mitchell. McGraw-Hill, 414pp, 1997

  • Foundations of Natural Language Processing, Christopher D. Manning & Hinrich Schütze. MIT Press, 680 pp, 1999

  • Perl-AI list ([email protected])

What is machine learning

What Is Machine Learning?

  • A subfield of Artificial Intelligence (but without the baggage)

  • Usually concerns some particular task, not the building of a sentient robot

  • Concerns the design of systems that improve (or at least change) as they acquire knowledge or experience

Typical ml tasks

Typical ML Tasks

  • Clustering

  • Categorization

  • Recognition

  • Filtering

  • Game playing

  • Autonomous performance

Typical ml tasks1

Typical ML Tasks

  • Clustering

Typical ml tasks2

Typical ML Tasks

  • Categorization

Typical ml tasks3

Typical ML Tasks

  • Recognition

Vincent Van Gogh

Michael Stipe

Mohammed Ali

Ken Williams

Burl Ives

Winston Churchill

Grover Cleveland

Typical ml tasks4

Typical ML Tasks

  • Recognition

Little red corvetteThe kids are all right

The rain in SpainBort bort bort

Typical ml tasks5

Typical ML Tasks

  • Filtering

Typical ml tasks6

Typical ML Tasks

  • Game playing

Typical ml tasks7

Typical ML Tasks

  • Autonomous performance

Typical ml buzzwords

Typical ML Buzzwords

  • Data Mining

  • Knowledge Management (KM)

  • Information Retrieval (IR)

  • Expert Systems

  • Topic detection and tracking

Who does ml

Who does ML?

  • Two main groups: research and industry

  • These groups do listen to each other, at least some

  • Not many reusable ML/KM components, outside of a few commercial systems

  • KM is seen as a key component of big business strategy - lots of KM consultants

  • ML is an extremely active research area with relatively low “cost of entry”

When is ml useful

When is ML useful?

  • When you have lots of data

  • When you can’t hire enough people, or when people are too slow

  • When you can afford to be wrong sometimes

  • When you need to find patterns

  • When you have nothing to lose

An aside on your presenter

An aside on your presenter

  • Academic background in math & music (not computer science or even statistics)

  • Several years as a Perl consultant

  • Two years as a math teacher

  • Currently studying document categorization at The University of Sydney

  • In other words, a typical ML student

Why use perl for ml

Why use Perl for ML?

  • CPAN - the viral solution™

  • Perl has rapid reusability

  • Perl is widely deployed

  • Perl code can be written quickly

  • Embeds both ways

  • Human-oriented development

  • Leaves your options open

But what about all the data

But what about all the data?

  • ML techniques tend to use lots of data in complicated ways

  • Perl is great at data in general, but tends to gobble memory or forego strict checking

  • Two fine solutions exist:

    • Be as careful in Perl as you are in C (Params::Validate, Tie::SecureHash, etc.)

    • Use PDL or Inline (more on these later)

Interfaces vs implementations

Interfaces vs. Implementations

  • In ML applications, we need both data integrity and the ability to “play with it”

  • Perl wrappers around C/C++ structures/objects are a nice balance

  • Keeps high-level interfaces in Perl, low-level implementations in C/C++

  • Can be prototyped in pure Perl, with C/C++ parts added later

Some ml theory and terminology

Some ML Theory and Terminology

  • ML concerns learning a target function from a set of examples

  • The target function is often called a hypothesis

  • Example: with Neural Network, a trained network is a hypothesis

  • The set of all possible target functions is called the hypothesis space

  • Training process can be considerd a search through the hypothesis space

Some ml theory and terminology1

Some ML Theory and Terminology

  • Each ML technique will

    • probably exclude some hypotheses

    • prefer some hypotheses over others

  • A technique’s exclusion & preference rules are called its inductive bias

  • If it ain’t biased, it ain’t learnin’

    • No bias = rote learning

    • Bias = generalization

  • Example: kids learning multiplication (understanding vs. memorization)

Some ml theory and terminology2

Some ML Theory and Terminology

  • Ideally, a ML technique will

    • not exclude the “right” hypothesis, i.e. the hypothesis space will include the target hypothesis

    • Prefer the target hypothesis over others

  • Measuring the degree to which these criteria are satisfied is important and sometimes complicated

Evaluating hypotheses

Evaluating Hypotheses

  • We often want to know how good a hypothesis is

    • To know how it performs in real world

    • May be used to improve learning technique or tune parameters

    • May be used by a learner to automatically improve the hypothesis

  • Usually evaluate on test data

    • Test data must be kept separate from training data

    • Test data used for purpose 3) is usually called validation or held-out data.

    • Training, validation, and test data should not contaminate each other

Evaluating hypotheses1

Evaluating Hypotheses

  • Some standard statistical measures are useful

  • Error rate, accuracy, precision, recall, F1

  • Calculated using contingency tables

Evaluating hypotheses2

Evaluating Hypotheses

  • Error = (b+c)/(a+b+c+d)

  • Accuracy = (a+d)/(a+b+c+d)

  • Precision = p = a/(a+b)

  • Recall = r = a/(a+c)

  • F1 = 2pr/(p+r)

Precision is easy to maximize by assigning nothing

Recall is easy to maximize by assigning everything

F1 combines precision and recall equally

Evaluating hypotheses3

Evaluating Hypotheses

  • Example (from categorization)

  • Note that precision is higher than recall - indicates a cautious categorizer

Precision = 0.851, Recall = 0.711, F1 = 0.775

These scores depend on the task - can’t compare scores across tasks

Often useful to compare categories separately, then average (macro-averaging)

Evaluating hypotheses4

Evaluating Hypotheses

  • The Statistics::Contingency module (on CPAN) helps calculate these figures:

    use Statistics::Contingency;

    my $s = new Statistics::Contingency;

    while (...) {

    ... Do some categorization ...

    $s->add_result($assigned, $correct);


    print "Micro F1: ", $s->micro_F1, "\n";

    print $s->stats_table;

    Micro F1: 0.774803607797498


    | miR miP miF1 maR maP maF1 Err |

    | 0.243 0.843 0.275 0.711 0.851 0.775 0.006 |


Useful perl data munging tools

Useful Perl Data-Munging Tools

  • Storable - cheap persistence and cloning

  • PDL - helps performance and design

  • Inline::C - tight loops and interfaces



  • One of many persistence classes for Perl data (Data::Dumper, YAML, Data::Denter)

  • Allows saving structures to disk:

    store($x, $filename);

    $x = retrieve($filename);

  • Allows cloning of structures:

    $y = dclone($x);

  • Not terribly interesting, but handy

An introduction to


  • Perl Data Language

  • On CPAN, of course (PDL-2.3.4.tar.gz)

  • Turns Perl into a data-processing language similar to Matlab

  • Native C/Fortran numerical handling

  • Compact multi-dimensional arrays

  • Still Perl at highest level

Pdl demo

PDL demo

PDL experimentation shell:

ken% perldl

perldl> demo pdl

Extending pdl

Extending PDL

  • PDL has extension language PDL::PP

Lets you write C extensions to PDL

Handles many gory details (data types, loop indexes, “threading”)

Extending pdl1

Extending PDL

  • Example: $n = $pdl->sum_elements;

# Usage:

$pdl = PDL->random(7);

print "PDL: $pdl\n";

$x = $pdl->sum_elements;

print "Sum: $sum\n";

# Output:

PDL: [0.513 0.175 0.308 0.534 0.947 0.171 0.702]

Sum: [3.35]

Extending pdl2

Extending PDL


Pars => 'a(n); [o]b();',

Code => <<'EOF’,

double tmp;

tmp = 0;

loop(n) %{

tmp += $a();


$b() = tmp;



Extending pdl3

Extending PDL


Pars => 'a(n); [o]b();',

Code => <<'EOF’,

double tmp;

tmp = 0;

loop(n) %{

tmp += $a();


$b() = tmp;



Extending pdl4

Extending PDL


Pars => 'a(n); [o]b();',

Code => <<'EOF’,

double tmp;

tmp = 0;

loop(n) %{

tmp += $a();


$b() = tmp;



Extending pdl5

Extending PDL


Pars => 'a(n); [o]b();',

Code => <<'EOF’,

$GENERIC() tmp;

tmp = ($GENERIC()) 0;

loop(n) %{

tmp += $a();


$b() = tmp;



Inline c


  • Allows very easy embedding of C code in Perl modules

  • Also Inline::Java, Inline::Python, Inline::CPP, Inline::ASM, Inline::Tcl

  • Considered much easier than XS or SWIG

  • Developers are very enthusiastic and helpful

Inline c basic syntax

Inline::C basic syntax

  • A complete Perl script using Inline:

(taken from Inline docs)



use Inline C => q{

void greet() { printf("Hello, world\n"); }


Inline c for writing functions

Inline::C for writing functions

  • Find next prime number greater than $x


    foreach (-2.7, 29, 30.33, 100_000) {

    print "$_: ", next_prime($_), "\n";


    . . .

Inline c for writing functions1

Inline::C for writing functions

use Inline C => q{

int next_prime(double in) {

// Implements a Sieve of Eratosthenes

int *is_prime;

int i, j;

int candidate = ceil(in);

if (in < 2.0) return 2;

is_prime = malloc(2 * candidate * sizeof(int));

for (i = 0; i<2*candidate; i++) is_prime[i] = 1;

. . .

Inline c for writing functions2

Inline::C for writing functions

for (i = 2; i < 2*first_candidate; i++) {

if (!is_prime[i]) continue;

if (i >= first_candidate) { free(is_prime); return i; }

for (j = i; j < 2*first_candidate; j += i) is_prime[j] = 0;


return 0; // Should never get here



Inline c for wrapping libraries

Inline::C for wrapping libraries

  • We’ll create a wrapper for ‘libbow’, an IR package

  • Contains an implementation of the Porter word-stemming algorithm (i.e., the stem of 'trying' is 'try’)

# A Perlish interface:

$stem = stem_porter($word);

# A C-like interface:


Inline c for wrapping libraries1

Inline::C for wrapping libraries

package Bow::Inline;

use strict;

use Exporter;

use vars qw($VERSION @ISA @EXPORT_OK);


$VERSION = '0.01';


@ISA = qw(Exporter);

@EXPORT_OK = qw(stem_porter


. . .

Inline c for wrapping libraries2

Inline::C for wrapping libraries

use Inline (C => 'DATA',



LIBS => '-L/tmp/bow/lib -lbow',

INC => '-I/tmp/bow/include',

CCFLAGS => '-no-cpp-precomp',





. . .

Inline c for wrapping libraries3

Inline::C for wrapping libraries

// libbow includes bow_stem_porter()

#include "bow/libbow.h"

// The bare-bones C interface exposed

int stem_porter_inplace(SV* word) {

int retval;

char* ptr = SvPV_nolen(word);

retval = bow_stem_porter(ptr);

SvCUR_set(word, strlen(ptr));

return retval;


. . .

Inline c for wrapping libraries4

Inline::C for wrapping libraries

// A Perlish interface

char* stem_porter (char* word) {

if (!bow_stem_porter(word)) return &PL_sv_undef;

return word;


// Don't know what the hell these are for in libbow,

// but it needs them.

const char *argp_program_version = "foo 1.0";

const char *program_invocation_short_name = "foofy";

When to use speed tools

When to use speed tools

  • A word of caution - don’t use C or PDL before you need to

  • Plain Perl is great for most tasks and usually pretty fast

  • Remember - external libraries (like libbow, pari-gp) both solve problems and create headaches

Decision trees

Decision Trees

  • Conceptually simple

  • Fast evaluation

  • Scrutable structures

  • Can be learned from training data

  • Can be difficult to build

  • Can “overfit” training data

  • Usually prefer simpler, i.e. smaller trees

Decision trees1

Decision Trees

  • Sample training data:

Decision trees2

Decision Trees

  • How do we build the tree from the training data?

  • We want to make the smallest possible trees

  • Which attribute (Outlook, Wind, etc.) is the best classifier?

  • We need a measurement of how much information a given attribute contributes toward the outcome.

  • We use information gain (IG), which is based on the entropy of the training instances.

  • The attribute with the highest IG is the “most helpful” classifier, and reduces entropy the most.

Decision trees3

Decision Trees

  • From Information Theory, invented by Claude Shannon

  • Measures uncertainty of a decision between alternate options

  • Probabilistically expected value of the number of bits necessary to specify value of an attribute

  • i represents an attribute value, pi represents the probability of seeing that attribute.

Decision trees4

Decision Trees

sub entropy {

my %prob;

$prob{$_}++ foreach @_;

$_ /= @_ foreach values %prob;

my $sum = 0;

$sum += $_ * log($_) foreach values %prob;

return -$sum / log(2);


Decision trees5

Decision Trees

  • Si are the subsets of S having attribute I value i

  • IG is original entropy minus entropy after knowing attribute i

  • Find argmaxI(Gain(S,I)) at each splitting node

  • To maximize IG, we can just minimize the second term on the right, since Entropy(S) is constant

  • This is the ID3 algorithm (J. R. Quinlan, 1986)

Decision trees6

Decision Trees

  • Decision trees in Perl are available with AI::DecisionTree (on CPAN)

  • Very simple OO interface

  • Currently implements ID3

    • Handles either consistent or noisy input

    • Can post-prune trees using a Minimum Message Length criterion

    • Doesn’t do cross-validation

    • Doesn’t handle continuous data

  • More robust feature sets are needed - patches welcome!

Decision trees example

Decision Trees - Example

use AI::DecisionTree;

my $dtree = new AI::DecisionTree;

# Add training instances


(attributes => {outlook => 'sunny',

temperature => 'hot',

humidity => 'high'},

result => 'no');


(attributes => {outlook => 'overcast',

temperature => 'hot',

humidity => 'normal'},

result => 'yes');

# ... repeat for several more instances

Decision trees example1

Decision Trees - Example

# ... continued ...


# Find results for unseen instances

my $result = $dtree->get_result

(attributes => {outlook => 'sunny',

temperature => 'hot',

humidity => 'normal'});

print "Result: $result\n";

An introduction to


  • Another ML technique

  • Measures features quantitatively, induces a vector space

  • Finds the optimal decision surface

An introduction to


  • Data may be inseparable

  • Same algorithms usually work, find “best” surface

  • Different surface shapes may be used

  • Usually scales well with number of features, poorly with number of examples

Svms example

SVMs - Example

use Algorithm::SVM;

use Algorithm::SVM::DataSet;

# Collect & format the data:

my @data;

for (...) {

push @data, Algorithm::SVM::DataSet->new

( Label => $foo,

Data => \@bar, );


# Train the SVM:

my $svm = Algorithm::SVM->new(Kernel => ‘linear’);


... continued ...

Svms example1

SVMs - Example

my $test = Algorithm::SVM::DataSet->new

( Label => undef,

Data => \@baz, );


my $result = $svm->predict($test);

print "Predicted: $result\n";

Text categorization

Text Categorization

  • Text categorization, and categorization in general, is an extremely powerful ML technique

  • Generalizes well to many areas

    • Document management

    • Information Retrieval

    • Gene/protein identification

    • Spam filtering

  • Fairly simple concept

  • Lots of technical challenges

Text categorization1

Text Categorization

  • AI::Categorizer (sequel to AI::Categorize) on CPAN

  • Addresses lots of tasks in text categorization

    • Format of documents (XML, text, database, etc.)

    • Support for structured documents (title, body, etc.)

    • Tokenizing of data into words

    • Linguistic stemming

    • Feature selection (1-grams, n-grams, statistically chosen)

    • Vector space modeling (TF/IDF methods)

    • Machine learning algorithm (Naïve Bayes, SVM, DecisionTree, kNN, etc.)

    • Machine learning parameters (different in each algorithm)

    • Hypothesis behavior (best-category only, or all matching categories)

Ai categorizer framework

AI::Categorizer Framework

Ai categorizer framework1

AI::Categorizer Framework

  • KnowledgeSet embodies a set of documents and categories

Ai categorizer framework2

AI::Categorizer Framework

  • Document is a (possibly structured) set of text data, belonging to 1 or more categories

Ai categorizer framework3

AI::Categorizer Framework

  • Category is a named set containing 1 or more documents

Ai categorizer framework4

AI::Categorizer Framework

  • Collection is a storage medium for document and category information (as text files, in DBI, XML files, etc.)

Ai categorizer framework5

AI::Categorizer Framework

  • Feature Vector maps features (words) to weights (counts)

Ai categorizer framework6

AI::Categorizer Framework

  • Learner is a ML algorithm class (Naïve Bayes, kNN, Decision Tree, etc.)

Ai categorizer framework7

AI::Categorizer Framework

  • Hypothesis is the learner’s “best guess” about document categories

Ai categorizer framework8

AI::Categorizer Framework

  • Experiment collects and analyzes hypotheses

Using ai categorizer

Using AI::Categorizer

  • Highest-level interface

    use AI::Categorizer;

    my $c = new AI::Categorizer(...parameters...);

    # Run a complete experiment - training on a

    # corpus, testing on a test set, printing a

    # summary of results to STDOUT


Using ai categorizer1

Using AI::Categorizer

  • More detailed:

    use AI::Categorizer;

    my $c = new AI::Categorizer(...parameters...);

    # Run the separate parts of $c->run_experiment





    print $c->stats_table;

Using ai categorizer2

Using AI::Categorizer

  • In an application:

    # After training, use learner for categorizing

    my $l = $c->learner;

    while (...) {

    my $d = ...create a document...

    my $h = $l->categorize($d);

    print "Best category: ", $h->best_category;


Using ai categorizer3

Using AI::Categorizer

  • Uses the Class::Container package, so all parameters can go to the top-level object constructor:

    my $c = new AI::Categorizer

    (save_progress => 'my_progress',

    data_root => 'my_data',

    features_kept => 10_000,

    threshold => 0.1,


Using ai categorizer4

Using AI::Categorizer

  • Uses the Class::Container package, so all parameters can go to the top-level object constructor:

    my $c = new AI::Categorizer

    (save_progress => 'my_progress',

    data_root => 'my_data',

    features_kept => 10_000,

    threshold => 0.1,


    (AI::Categorizer needn’t know about these, it’s transparent)

To Categorizer

To KnowledgeSet

To Learner

Na ve bayes categorization

Naïve Bayes Categorization

  • Simple, fast machine learning technique

  • Let c1…m represent all categories, and w1…n represent the words of a given document

Above term is computationally infeasible - data is too sparse

Na ve bayes categorization1

Naïve Bayes Categorization

  • Apply Bayes’ Theorem

Na ve bayes categorization2

Naïve Bayes Categorization

  • The quantities p(ci) and p(wj|ci) can be calculated from training set

  • p(ci) is fraction of training set belonging to category ci

  • p(wj|ci) is fraction of words in ci that are wj

  • Must deal with unseen words, we don’t want any p(wj|ci) to be zero

  • Typically we pretend unseen words have been seen 0.5 times, or use some similar strategy

Na ve bayes sample run

Naïve Bayes Sample Run

ken> perl eg/ [options]



  • Ken Williams: [email protected] or [email protected]

  • Perl-AI list: [email protected]

  • AI::Categorizer, AI::DecisionTree, Statistics::Contingency, Inline::C, PDL, Storable all on CPAN

  • libbow:

  • Machine Learning, Tom Mitchell. McGraw-Hill, 414pp, 1997

  • Foundations of Natural Language Processing, Christopher D. Manning & Hinrich Schütze. MIT Press, 680 pp., 1999

Extras time permitting

Extras, time permitting

  • AI::Categorizer parameters by class

  • AI::DecisionTree example

  • PDL::Sparse walkthrough

  • AI::NodeLib (incomplete implementation)

  • Login