graphical model software for machine learning l.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Graphical model software for machine learning PowerPoint Presentation
Download Presentation
Graphical model software for machine learning

Loading in 2 Seconds...

play fullscreen
1 / 42

Graphical model software for machine learning - PowerPoint PPT Presentation


  • 149 Views
  • Uploaded on

Graphical model software for machine learning. Kevin Murphy University of British Columbia. December, 2005. Outline . Discriminative models for iid data Beyond iid data: conditional random fields Beyond supervised learning: generative models Beyond optimization: Bayesian models.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Graphical model software for machine learning' - thanh


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
graphical model software for machine learning

Graphical modelsoftware for machine learning

Kevin Murphy

University of British Columbia

December, 2005

outline
Outline
  • Discriminative models for iid data
  • Beyond iid data: conditional random fields
  • Beyond supervised learning: generative models
  • Beyond optimization: Bayesian models
supervised learning as bayesian inference
Supervised learning as Bayesian inference

Training

Testing

Y1

Yn

YN

Y*

Y*

X1

Xn

XN

X*

X*

N

supervised learning as optimization
Supervised learning as optimization

Training

Testing

Y1

Yn

YN

Y*

Y*

X1

Xn

XN

X*

X*

N

example logistic regression
Example: logistic regression
  • Let yn2 {1,…,C} be given by a softmax
  • Maximize conditional log likelihood
  • “Max margin” solution
outline6
Outline
  • Discriminative models for iid data
  • Beyond iid data: conditional random fields
  • Beyond supervised learning: generative models
  • Beyond optimization: Bayesian models
1d chain crfs for sequence labeling
1D chain CRFs for sequence labeling

A 1D conditional random field (CRF) is an extension of logistic regressionto the case where the output labels are sequences, yn2 {1,…,C}m

Edge potential

Local evidence

ij

Yn1

Yn2

Ynm

i

Xn

2d lattice crfs for pixel labeling
2D Lattice CRFs for pixel labeling

A conditional randomfield (CRF) is a discriminative modelof P(y|x). The edge potentialsij are image dependent.

2d lattice mrfs for pixel labeling
2D Lattice MRFs for pixel labeling

Local evidence

Potential function

Partition function

A Markov Random Field (MRF) is an undirectedgraphical model. Here we model correlation between pixel labels using

ij(yi,yj). We also have a per-pixelgenerative model of observations P(xi|yi)

tree structured crfs
Tree-structured CRFs
  • Used in parts-based object detection
  • Yi is location of part i in image

nose

eyeR

eyeL

mouth

Fischler & Elschlager, "The representation and matching of pictorial structures”, PAMI’73

Felzenszwalb & Huttenlocher, "Pictorial Structures for Object Recognition," IJCV’05

general crfs
General CRFs
  • In general, the graph may have arbitrary structure
  • eg for collective web page classification,nodes=urls, edges=hyperlinks
  • The potentials are in general defined on cliques, not just edges
factor graphs
Factor graphs

Square nodes = factors (potentials)

Round nodes = random variables

Graph structure = bipartite

potential functions
Potential functions
  • For the local evidence, we can use a discriminative classifier (trained iid)
  • For the edge compatibilities, we can use a maxent/ loglinear form, using pre-defined features
restricted potential functions

l

Restricted potential functions
  • For some applications (esp in vision), we often use a Potts model of the form
  • We can generalize this for ordered labels (eg discretization of continuous states)
learning crfs
Learning CRFs
  • If the log likelihood is
  • then the gradient is

Tied params

cliques

Gradient = features – expected features

learning crfs17
Learning CRFs
  • Given the gradient rd, one can find the global optimum using first or second order optimization methods, such as
    • Conjugate gradient
    • Limited memory BFGS
    • Stochastic meta descent (SMD)?
  • The bottleneck is computing the expected features needed for the gradient
exact inference
Exact inference
  • For 1D chains, one can compute P(yi,i+1|x) exactly in O(N K2) time using belief propagation (BP = forwards backwards algorithm)
  • For restricted potentials (eg ij=( l)), one can do this in O(NK) time using FFT-like tricks
  • This can be generalized to trees.
sum product vs max product
Sum-product vs max-product
  • We use sum-product to compute marginal probabilities needed for learning
  • We use max-product to find the most probable assignment (Viterbi decoding)
  • We can also compute max-marginals
complexity of exact inference
Complexity of exact inference

In general, the running time is (N Kw), where w is the treewidthof the graph; this is the size of the maximal clique of the triangulatedgraph (assuming an optimal elimination ordering).

For chains and trees, w = 2.

For n £ n lattices, w = O(n).

learning intractable crfs
Learning intractable CRFs
  • We can use approximate inference and hope the gradient is “good enough”.
    • If we use max-product, we are doing “Viterbi training” (cf perceptron rule)
  • Or we can use other techniques, such as pseudo likelihood, which does not need inference.
software for inference and learning in 1d crfs
Software for inference and learning in 1D CRFs
  • Various packages
    • Mallet (McCallum et al) – Java
    • Crf.sourceforge.net (Sarawagi, Cohen) – Java
    • My code – matlab (just a toy, not integrated with BNT)
    • Ben Taskar says he will soon release his Max Margin Markov net code (which uses LP for inference and QP for learning).
  • Nothing standard, emphasis on NLP apps
software for inference in general crfs mrfs
Software for inference in general CRFs/ MRFs
  • Max-product : C++ code for GC, BP, TRP and ICM (for Lattice2) by Rick Szeliski et al
    • “A comparative study of energy minimization methods for MRFs”, Rick Szeliksi, Ramin Zabih, Daniel Scharstein, Olga Veksler, Vladimir Kolmogorov, Aseem Agarwala, Marsall Tappen, Carsten Rother
  • Sum-product for Gaussian MRFs: GMRFlib, C code by Havard Rue (exact inference)
  • Sum-product: various other ad hoc pieces
    • My matlab BP code (MRF2)
    • Rivasseau’s C++ code for BP, Gibbs, tree-sampling (factor graphs)
    • Metlzer’s C++ code for BP, GBP, Gibbs, MF (Lattice2)
software for learning general mrfs crfs
Software for learning general MRFs/CRFs
  • Hardly any!
    • Parise’s matlab code (approx gradient, pseudo likelihood, CD, etc)
    • My matlab code (IPF, approx gradient – just a toy – not integrated with BNT)
structure of ideal toolbox

learnEngine

trainData

infEngine

infEngine

queries

model

model

Nbest list

decide

probDist

Structure of ideal toolbox

Generator/GUI/file

train

testData

infer

decisionEngine

performance

decision

visualize

summarize

utilities

structure of bnt

learnEngine

trainData

infEngine

infEngine

queries

model

model

Nbest list

decide

probDist

Structure of BNT

LeRay

Shan

Generator/GUI/file

Graphs+CPDs

Cell array

BPJtree

MCMC

EM

StructuralEM

train

testData

NodeIds

VarElim

Graphs+CPDs

Cell array

infer

JtreeVarElim

decisionEngine

policy

Array, Gaussian, samples

N=1 (MAP)

visualize

summarize

LIMID

outline30
Outline
  • Discriminative models for iid data
  • Beyond iid data: conditional random fields
  • Beyond supervised learning: generative models
  • Beyond optimization: Bayesian models
unsupervised learning why
Unsupervised learning: why?
  • Labeling data is time-consuming.
  • Often not clear what label to use.
  • Complex objects often not describable with a single discrete label.
  • Humans learn without labels.
  • Want to discover novel patterns/ structure.
unsupervised learning what
Unsupervised learning: what?
  • Clusters (eg GMM)
  • Low dim manifolds (eg PCA)
  • Graph structure (eg biology, social networks)
  • “Features” (eg maxent models of language and texture)
  • “Objects” (eg sprite models in vision)
unsupervised learning of objects from video
Unsupervised learning of objects from video

Frey and Jojic; Williams and Titsias ; et al

unsupervised learning issues
Unsupervised learning: issues
  • Objective function not as obvious as in supervised learning. Usually try to maximize likelihood (measure of data compression).
  • Local minima (non convex objective).
  • Uses inference as subroutine (can be slow – no worse than discriminative learning)
unsupervised learning how
Unsupervised learning: how?
  • Construct a generative model (eg a Bayes net).
  • Perform inference.
  • May have to use approximations such as maximum likelihood and BP.
  • Cannot use max likelihood for model selection…
a comparison of bn software
A comparison of BN software

www.ai.mit.edu/~murphyk/Software/Bayes/bnsoft.html

popular bn software
Popular BN software
  • BNT (matlab)
  • Intel’s PNL (C++)
  • Hugin (commercial)
  • Netica (commercial)
  • GMTk (free .exe from Jeff Bilmes)
outline38
Outline
  • Discriminative models for iid data
  • Beyond iid data: conditional random fields
  • Beyond supervised learning: generative models
  • Beyond optimization: Bayesian models
bayesian inference why
Bayesian inference: why?
  • It is optimal.
  • It can easily incorporate prior knowledge (esp. useful for small n, large p problems).
  • It properly reports confidence in output (useful for combining estimates, and for risk-averse applications).
  • It separates models from algorithms.
bayesian inference how
Bayesian inference: how?
  • Since we want to integrate, we cannot use max-product.
  • Since the unknown parameters are continuous, we cannot use sum-product.
  • But we can use EP (expectation propagation), which is similar to BP.
  • We can also use variational inference.
  • Or MCMC (eg Gibbs sampling).
general purpose bayesian software
General purposeBayesian software
  • BUGS (Gibbs sampling)
  • VIBES (variational message passing)
  • Minka and Winn’s toolbox (infer.net)
structure of ideal bayesian toolbox

learnEngine

trainData

infEngine

infEngine

queries

model

model

decide

probDist

Structure of ideal Bayesian toolbox

Generator/ GUI/ file

train

testData

infer

decisionEngine

performance

decision

visualize

summarize

utilities