An Introduction to Machine Learning with Perl

An Introduction to Machine Learning with Perl February 3, 2003 O’Reilly Bioinformatics Conference Ken Williams ken@mathforum.org

Tutorial Overview • What is Machine Learning? (20’) • Why use Perl for ML? (15’) • Some theory (20’) • Some tools (30’) • Decision trees (20’) • SVMs (15’) • Categorization (40’)

References & Sources • Machine Learning, Tom Mitchell. McGraw-Hill, 414pp, 1997 • Foundations of Natural Language Processing, Christopher D. Manning & Hinrich Schütze. MIT Press, 680 pp, 1999 • Perl-AI list (perl-ai@perl.org)

What Is Machine Learning? • A subfield of Artificial Intelligence (but without the baggage) • Usually concerns some particular task, not the building of a sentient robot • Concerns the design of systems that improve (or at least change) as they acquire knowledge or experience

Typical ML Tasks • Clustering • Categorization • Recognition • Filtering • Game playing • Autonomous performance

Typical ML Tasks • Clustering

Typical ML Tasks • Categorization

Typical ML Tasks • Recognition Vincent Van Gogh Michael Stipe Mohammed Ali Ken Williams Burl Ives Winston Churchill Grover Cleveland

Typical ML Tasks • Recognition Little red corvette The kids are all right The rain in Spain Bort bort bort

Typical ML Tasks • Filtering

Typical ML Tasks • Game playing

Typical ML Tasks • Autonomous performance

Typical ML Buzzwords • Data Mining • Knowledge Management (KM) • Information Retrieval (IR) • Expert Systems • Topic detection and tracking

Who does ML? • Two main groups: research and industry • These groups do listen to each other, at least some • Not many reusable ML/KM components, outside of a few commercial systems • KM is seen as a key component of big business strategy - lots of KM consultants • ML is an extremely active research area with relatively low “cost of entry”

When is ML useful? • When you have lots of data • When you can’t hire enough people, or when people are too slow • When you can afford to be wrong sometimes • When you need to find patterns • When you have nothing to lose

An aside on your presenter • Academic background in math & music (not computer science or even statistics) • Several years as a Perl consultant • Two years as a math teacher • Currently studying document categorization at The University of Sydney • In other words, a typical ML student

Why use Perl for ML? • CPAN - the viral solution™ • Perl has rapid reusability • Perl is widely deployed • Perl code can be written quickly • Embeds both ways • Human-oriented development • Leaves your options open

But what about all the data? • ML techniques tend to use lots of data in complicated ways • Perl is great at data in general, but tends to gobble memory or forego strict checking • Two fine solutions exist: • Be as careful in Perl as you are in C (Params::Validate, Tie::SecureHash, etc.) • Use PDL or Inline (more on these later)

Interfaces vs. Implementations • In ML applications, we need both data integrity and the ability to “play with it” • Perl wrappers around C/C++ structures/objects are a nice balance • Keeps high-level interfaces in Perl, low-level implementations in C/C++ • Can be prototyped in pure Perl, with C/C++ parts added later

Some ML Theory and Terminology • ML concerns learning a target function from a set of examples • The target function is often called a hypothesis • Example: with Neural Network, a trained network is a hypothesis • The set of all possible target functions is called the hypothesis space • Training process can be considerd a search through the hypothesis space

Some ML Theory and Terminology • Each ML technique will • probably exclude some hypotheses • prefer some hypotheses over others • A technique’s exclusion & preference rules are called its inductive bias • If it ain’t biased, it ain’t learnin’ • No bias = rote learning • Bias = generalization • Example: kids learning multiplication (understanding vs. memorization)

Some ML Theory and Terminology • Ideally, a ML technique will • not exclude the “right” hypothesis, i.e. the hypothesis space will include the target hypothesis • Prefer the target hypothesis over others • Measuring the degree to which these criteria are satisfied is important and sometimes complicated

Evaluating Hypotheses • We often want to know how good a hypothesis is • To know how it performs in real world • May be used to improve learning technique or tune parameters • May be used by a learner to automatically improve the hypothesis • Usually evaluate on test data • Test data must be kept separate from training data • Test data used for purpose 3) is usually called validation or held-out data. • Training, validation, and test data should not contaminate each other

Evaluating Hypotheses • Some standard statistical measures are useful • Error rate, accuracy, precision, recall, F1 • Calculated using contingency tables

Evaluating Hypotheses • Error = (b+c)/(a+b+c+d) • Accuracy = (a+d)/(a+b+c+d) • Precision = p = a/(a+b) • Recall = r = a/(a+c) • F1 = 2pr/(p+r) Precision is easy to maximize by assigning nothing Recall is easy to maximize by assigning everything F1 combines precision and recall equally

Evaluating Hypotheses • Example (from categorization) • Note that precision is higher than recall - indicates a cautious categorizer Precision = 0.851, Recall = 0.711, F1 = 0.775 These scores depend on the task - can’t compare scores across tasks Often useful to compare categories separately, then average (macro-averaging)

Evaluating Hypotheses • The Statistics::Contingency module (on CPAN) helps calculate these figures: use Statistics::Contingency; my $s = new Statistics::Contingency; while (...) { ... Do some categorization ... $s->add_result($assigned, $correct); } print "Micro F1: ", $s->micro_F1, "\n"; print $s->stats_table; Micro F1: 0.774803607797498 +-------------------------------------------------+ | miR miP miF1 maR maP maF1 Err | | 0.243 0.843 0.275 0.711 0.851 0.775 0.006 | +-------------------------------------------------+

Useful Perl Data-Munging Tools • Storable - cheap persistence and cloning • PDL - helps performance and design • Inline::C - tight loops and interfaces

Storable • One of many persistence classes for Perl data (Data::Dumper, YAML, Data::Denter) • Allows saving structures to disk: store($x, $filename); $x = retrieve($filename); • Allows cloning of structures: $y = dclone($x); • Not terribly interesting, but handy

PDL • Perl Data Language • On CPAN, of course (PDL-2.3.4.tar.gz) • Turns Perl into a data-processing language similar to Matlab • Native C/Fortran numerical handling • Compact multi-dimensional arrays • Still Perl at highest level

PDL demo PDL experimentation shell: ken% perldl perldl> demo pdl

Extending PDL • PDL has extension language PDL::PP Lets you write C extensions to PDL Handles many gory details (data types, loop indexes, “threading”)

Extending PDL • Example: $n = $pdl->sum_elements; # Usage: $pdl = PDL->random(7); print "PDL: $pdl\n"; $x = $pdl->sum_elements; print "Sum: $sum\n"; # Output: PDL: [0.513 0.175 0.308 0.534 0.947 0.171 0.702] Sum: [3.35]

Extending PDL pp_def('sum_elements', Pars => 'a(n); [o]b();', Code => <<'EOF’, double tmp; tmp = 0; loop(n) %{ tmp += $a(); %} $b() = tmp; EOF );

Extending PDL pp_def('sum_elements', Pars => 'a(n); [o]b();', Code => <<'EOF’, $GENERIC() tmp; tmp = ($GENERIC()) 0; loop(n) %{ tmp += $a(); %} $b() = tmp; EOF );

Inline::C • Allows very easy embedding of C code in Perl modules • Also Inline::Java, Inline::Python, Inline::CPP, Inline::ASM, Inline::Tcl • Considered much easier than XS or SWIG • Developers are very enthusiastic and helpful

Inline::C basic syntax • A complete Perl script using Inline: (taken from Inline docs) #!/usr/bin/perl greet(); use Inline C => q{ void greet() { printf("Hello, world\n"); } }

Inline::C for writing functions • Find next prime number greater than $x #!/usr/bin/perl foreach (-2.7, 29, 30.33, 100_000) { print "$_: ", next_prime($_), "\n"; } . . .

Inline::C for writing functions use Inline C => q{ int next_prime(double in) { // Implements a Sieve of Eratosthenes int *is_prime; int i, j; int candidate = ceil(in); if (in < 2.0) return 2; is_prime = malloc(2 * candidate * sizeof(int)); for (i = 0; i<2*candidate; i++) is_prime[i] = 1; . . .

Inline::C for writing functions for (i = 2; i < 2*first_candidate; i++) { if (!is_prime[i]) continue; if (i >= first_candidate) { free(is_prime); return i; } for (j = i; j < 2*first_candidate; j += i) is_prime[j] = 0; } return 0; // Should never get here } }

Inline::C for wrapping libraries • We’ll create a wrapper for ‘libbow’, an IR package • Contains an implementation of the Porter word-stemming algorithm (i.e., the stem of 'trying' is 'try’) # A Perlish interface: $stem = stem_porter($word); # A C-like interface: stem_porter_inplace($word);

Inline::C for wrapping libraries package Bow::Inline; use strict; use Exporter; use vars qw($VERSION @ISA @EXPORT_OK); BEGIN { $VERSION = '0.01'; } @ISA = qw(Exporter); @EXPORT_OK = qw(stem_porter stem_porter_inplace); . . .

Inline::C for wrapping libraries use Inline (C => 'DATA', VERSION => $VERSION, NAME => __PACKAGE__, LIBS => '-L/tmp/bow/lib -lbow', INC => '-I/tmp/bow/include', CCFLAGS => '-no-cpp-precomp', ); 1; __DATA__ __C__ . . .

Inline::C for wrapping libraries // libbow includes bow_stem_porter() #include "bow/libbow.h" // The bare-bones C interface exposed int stem_porter_inplace(SV* word) { int retval; char* ptr = SvPV_nolen(word); retval = bow_stem_porter(ptr); SvCUR_set(word, strlen(ptr)); return retval; } . . .

Inline::C for wrapping libraries // A Perlish interface char* stem_porter (char* word) { if (!bow_stem_porter(word)) return &PL_sv_undef; return word; } // Don't know what the hell these are for in libbow, // but it needs them. const char *argp_program_version = "foo 1.0"; const char *program_invocation_short_name = "foofy";

When to use speed tools • A word of caution - don’t use C or PDL before you need to • Plain Perl is great for most tasks and usually pretty fast • Remember - external libraries (like libbow, pari-gp) both solve problems and create headaches

Decision Trees • Conceptually simple • Fast evaluation • Scrutable structures • Can be learned from training data • Can be difficult to build • Can “overfit” training data • Usually prefer simpler, i.e. smaller trees

Decision Trees • Sample training data:

An Introduction to Machine Learning with Perl

An Introduction to Machine Learning with Perl

Presentation Transcript

Introduction to Machine Learning

Introduction to Machine Learning

An Introduction to Perl Part 3

Introduction to Machine Learning

Introduction to Machine Learning

An Introduction to Machine Learning

Introduction to machine learning

Introduction to Machine Learning

Introduction to Machine Learning

An Introduction to Perl

Introduction to Machine Learning

An Introduction to Perl

An Introduction to Perl

Introduction to Machine Learning

Introduction to Machine Learning

Introduction to Machine Learning

Introduction to Machine Learning

An Introduction to Machine Learning Algorithms

Introduction to Machine Learning

An Introduction to Perl

Introduction to Machine Learning