Exact and approximate inference in probabilistic graphical models
Download
1 / 53

Exact and approximate inference in probabilistic graphical models - PowerPoint PPT Presentation


  • 195 Views
  • Uploaded on

Exact and approximate inference in probabilistic graphical models. Kevin Murphy MIT CSAIL UBC CS/Stats. www.ai.mit.edu/~murphyk/AAAI04. AAAI 2004 tutorial. Recommended reading . Cowell, Dawid, Lauritzen, Spiegelhalter, “Probabilistic Networks and Expert Systems“ 1999

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Exact and approximate inference in probabilistic graphical models' - eliora


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Exact and approximate inference in probabilistic graphical models

Exact and approximate inference in probabilistic graphical models

Kevin MurphyMIT CSAIL UBC CS/Stats

www.ai.mit.edu/~murphyk/AAAI04

AAAI 2004 tutorial

SP2-1


Recommended reading
Recommended reading models

  • Cowell, Dawid, Lauritzen, Spiegelhalter, “Probabilistic Networks and Expert Systems“ 1999

  • Jensen 2001, “Bayesian Networks and Decision Graphs”

  • Jordan (due 2005) “Probabilistic graphical models”

  • Koller & Friedman (due 2005), “Bayes nets and beyond”

  • “Learning in graphical models”,edited M. Jordan

SP2-2


Outline
Outline models

  • Introduction

  • Exact inference

  • Approximate inference

    • Deterministic

    • Stochastic (sampling)

    • Hybrid deterministic/ stochastic

SP2-3


2 reasons for approximate inference
2 reasons for approximate inference models

Low treewidth BUTNon-linear/ Non-Gaussian

High tree width

Chains

N=nxn grid

eg non-linear dynamical system

Trees (no loops)

eg (Bayesian) parameter estimation

X1

X3

X2

Loopy graphs

Y1

Y3

Y2

SP2-4


Complexity of approximate inference
Complexity of approximate inference models

  • Approximating P(Xq|Xe) to within a constant factor for all discrete BNs is NP-hard.

    • In practice, many models exhibit “weak coupling”, so we may safely ignore certain dependencies.

  • Computing P(Xq|Xe) for all polytrees with discrete and Gaussian nodes is NP-hard.

    • In practice, some of the modes of the posterior will have negligible mass.

Dagum93

Lerner01

SP2-5


2 objective functions
2 objective functions models

  • Approximate true posterior P(h|v) by Q(h)

  • Variational: globally optimize all terms wrt simpler Q

  • Expectation propagation (EP): sequentially optimize each term

P

Q

P

Q

min D(Q||P)

min D(P||Q)

Q=0 => P=0

P=0 => Q=0

SP2-6


Outline1
Outline models

  • Introduction

  • Exact inference

  • Approximate inference

    • Deterministic

      • Variational

      • Loopy belief propagation

      • Expectation propagation

      • Graph cuts

    • Stochastic (sampling)

    • Hybrid deterministic/ stochastic

SP2-7


Free energy
Free energy models

  • Variational goal: minimize D(P||Q) wrt Q, where Q has a simpler form than P

  • P(h,v) simpler than P(h|v), so use

  • Free energy is upper bound on neg log-likelihood

SP2-8


Point estimation
Point estimation models

  • Use

  • Minimize

  • Iterative Conditional Modes (ICM):

    • For each iteration, for each hi

    • Example: K-means clustering

    • Ignores uncertainty in P(h|v), P(|v)

    • Tends to get stuck in local minima

Factors in markov blanket of hi

SP2-9


Expectation maximization em
Expectation Maximization (EM) models

  • Point estimates for parameters (ML or MAP), full posterior for hidden vars.

  • E-step: minimize F(Q,P) wrt Q(h)

  • M-step: minimize F(Q,P) wrt Q(h)

Exact inference

Parameter prior

Expected complete-data log-likelihood

SP2-10


Em tricks of the trade
EM: tricks of the trade models

Neal98

  • Generalized EM

    • Partial M-step: reduce F(Q,P) wrt Q(h)[e.g., gradient method]

    • Partial E-step: reduce F(Q,P) wrt Q(h)[approximate inference]

  • Avoiding local optima

    • Deterministic Annealing

    • Data resampling

  • Speedup tricks

    • Combine with conjugate gradient

    • Online/incremental updates

Rose98

Elidan02

Salakhutdinov03

Bauer97,Neal98

SP2-11


Variational bayes vb
Variational Bayes (VB) models

Ghahramani00,Beal02

  • Use

  • For exponential family models with conjugate priors, this results in a generalized version of EM

    • E-step: modified inference to take into account uncertainty of parameters

    • M-step: optimize Q(h) using expected sufficient statistics

  • Variational Message Passing automates this, assuming a fully factorized (mean field) Q

Winn04

variational-Bayes.org

SP2-12


Variational inference for discrete state models with high treewidth
Variational inference for discrete state models with high treewidth

  • We assume the parameters are fixed.

  • We assume Q(h) has a simple form, so we can easily find

  • Mean field:

  • Structured variational:

Xing04

Product of chains

Mean field

Grid MRF

SP2-13


Variational inference for mrfs
Variational inference for MRFs treewidth

  • Probability is exp(-energy)

  • Free energy = average energy - entropy

SP2-14


Mean field for mrfs
Mean field for MRFs treewidth

  • Fully factorized approximation

  • Normalization constraint

  • Average energy

  • Entropy

  • Local minima satisfy

SP2-15


Outline2
Outline treewidth

  • Introduction

  • Exact inference

  • Approximate inference

    • Deterministic

      • Variational

      • Loopy belief propagation

      • Expectation propagation

      • Graph cuts

    • Stochastic (sampling)

    • Hybrid deterministic/ stochastic

SP2-16


Bp vs mean field for mrfs
BP vs mean field for MRFs treewidth

  • Mean field updates

  • BP updates

  • Every node i sends a different message to j

  • Empirically, BP much better than MF (e.g., MF is not exact even for trees)

  • BP is (attempting to) minimize the Bethe free energy

Weiss01

Yedidia01

SP2-17


Bethe free energy
Bethe free energy treewidth

  • We assume the graph is a tree, in which case the following is exact

  • Constraints

    • Normalization

    • Marginalization

  • Average energy

  • Entropy

di = #neighbors for node i

SP2-18


Bp minimizes bethe free energy
BP minimizes Bethe free energy treewidth

Yedidia01

  • Theorem [Yedida, Freeman, Weiss]: fixed points of BP are local stationary points of the Bethe free energy

  • BP may not converge; other algorithms can directly minimize F, but are slower.

  • If BP does not converge, it often means F is a poor approximation

SP2-19


Kikuchi free energy

3 treewidth

1

2

3

1

2

6

4

5

6

4

5

Bethe

Kikuchi

Kikuchi free energy

  • Cluster groups of nodes together

  • Energy per region

  • Free energy per region

  • Kikuchi free energy

Counting numbers

SP2-20


Counting numbers
Counting numbers treewidth

3

1

2

3

1

2

6

4

5

6

4

5

Bethe

Kikuchi

12 23 14 25 36 45 56

Region graphs

1245 2356

1 2 3 4 5 6

25

C= -1 -2 -1 -1 -2 -1

C=1-(1+1)=-1

Fkikuchi is exact if region graph contains 2 levels (regions and intersections)and has no cycles – equivalent to junction tree!

SP2-21


Generalized bp
Generalized BP treewidth

3

1

2

  • 2356 4578 5689

25 45 56 58

6

4

5

5

9

7

8

  • Fkikuchu no longer exact, but more accurate than Fbethe

  • Generalized BP can be used to minimize Fkikuchi

  • This method of choosing regions is called the “cluster variational method”

  • In the limit, we recover the junction tree algorithm.

Welling04

SP2-22


Outline3
Outline treewidth

  • Introduction

  • Exact inference

  • Approximate inference

    • Deterministic

      • Variational

      • Loopy belief propagation

      • Expectation propagation

      • Graph cuts

    • Stochastic (sampling)

    • Hybrid deterministic/ stochastic

SP2-23


Expectation propagation ep
Expectation Propagation (EP) treewidth

Minka01

  • EP = iterated assumed density filtering

  • ADF = recursive Bayesian estimation interleaved with projection step

  • Examples of ADF:

    • Extended Kalman filtering

    • Moment-matching (weak marginalization)

    • Boyen-Koller algorithm

    • Some online learning algorithms

SP2-24


Assumed density filtering adf
Assumed Density Filtering (ADF) treewidth

x

Recursive Bayesian estimation(sequential updating of posterior)

Y1

Yn

  • If p(yi|x) not conjugate to p(x), then p(x|y1:i) may not be tractably representable

  • So project posterior back to representable family

  • And repeat

update

project

Projection becomes moment matching

SP2-25


Expectation propagation
Expectation Propagation treewidth

  • ADF is sensitive to the order of updates.

  • ADF approximates each posterior myopically.

  • EP: iteratively approximate each term.

intractable

Simple, non-iterative, inaccurate

= EP

Simple, iterative, accurate

After Ghahramani

SP2-26


Expectation propagation1
Expectation Propagation treewidth

  • Input:

  • Initialize:

  • Repeat

    • For i=0..N

      • Deletion:

      • Projection:

      • Inclusion:

  • Until convergence

  • Output: q(x)

After Ghahramani

SP2-27


Bp is a special case of ep
BP is a special case of EP treewidth

  • BP assumes fully factorized

  • At each iteration, for each factor i, for each node k, KL projection matches moments (computes marginals by absorbing from neighbors)

Xn1

fj

Xk

fi

Xn2

SP2-28


Treeep
TreeEP treewidth

Minka03

  • TreeEP assumes q(x) is represented by a tree (regardless of “true” model topology).

  • We can use the Jtree algorithm to do the moment matching at each iteration.

  • Faster and more accurate than LBP.

  • Faster and comparably accurate to GBP.

SP2-29


Outline4
Outline treewidth

  • Introduction

  • Exact inference

  • Approximate inference

    • Deterministic

      • Variational

      • Loopy belief propagation

      • Expectation propagation

      • Graph cuts

    • Stochastic (sampling)

    • Hybrid deterministic/ stochastic

SP2-30


Mpe in mrfs
MPE in MRFs treewidth

  • MAP estimation = energy minimization

  • Simplifications:

    • Only pairwise potentials: Eijk=0 etc

    • Special form for potentials

    • Binary variables xi2 {0,1}

SP2-31


Kinds of potential
Kinds of potential treewidth

  • Metric

  • Semi-metric: satisfies (2) & (3)

  • Piecewise constant, eg.

    • Potts model (metric)

  • Piecewise smooth, eg.

    • Semi-metric

    • Metric

  • Discontinuity-preserving potentials avoid oversmoothing

SP2-32


Graphcuts

s treewidth

C-A

B+C-A-D

xi

xj

C-D

t

GraphCuts

Kolmogorov04

  • Thm: we can find argmin E(x) for binary variables and pairwise potentials in at most O(N3) time using maxflow/ mincut algorithm on the graph below iff potentials are submodular i.e.,

  • Metric potentials (eg. Potts) are always submodular.

  • Thm: the general case (eg. non-binary or non-submodular) is NP-hard.

where

SP2-33


Finding strong local minimum
Finding strong local minimum treewidth

  • For the non-binary case, we can optimum wrt some large space of moves by iteratively solving binary subproblems.

  • -expansion: any pixel can change to 

  • - swap: any  can switch to  and vice versa

Picture from Zabih

SP2-34


Finding strong local minimum1
Finding strong local minimum treewidth

  • Start with arbitrary assignment f

  • Done := false

  • While ~done

    • Done := true

    • For each label 

      • Find

      • If E(f’) < E(f) then done := false; f := f’

Binary subproblem!

SP2-35


Properties of the 2 algorithms
Properties of the 2 algorithms treewidth

  • -expansion

    • Requires V to be submodular (eg metric)

    • O(L) per cycle

    • Factor of 2c(V) within optimal

    • c=1 for Potts model

  • - swap

    • Requires V to be semi-metric

    • O(L2) per cycle

    • No comparable theoretical guarantee, but works well in practice

SP2-36


Summary of inference methods for pairwise mrfs
Summary of inference methods for pairwise MRFs treewidth

  • Marginals

    • Mean field

    • Loopy/ generalized BP (sum-product)

    • EP

    • Gibbs sampling

    • Swendsen-Wang

  • MPE/ Viterbi

    • Iterative conditional modes (ICM)

    • Loopy/generalized BP (max-product)

    • Graph cuts

    • Simulated annealing

See Boykov01, Weiss01 and Tappen03 for some empirical comparisons

SP2-37


Outline5
Outline treewidth

  • Introduction

  • Exact inference

  • Approximate inference

    • Deterministic

    • Stochastic (sampling)

    • Hybrid deterministic/ stochastic

SP2-38


Monte carlo sampling methods
Monte Carlo (sampling) methods treewidth

  • Goal: estimate

  • e.g.,

  • Draw N independent samples xr ~ P

  • Hard to draw (independent) samples from P

Accuracy is independentof dimensionality of X

SP2-39


Importance sampling
Importance Sampling treewidth

  • We sample from Q(x) and reweight

Require Q(x)>0 for all where P(x)>0

P*

Q*

SP2-40


Importance sampling for bns likelihood weighting
Importance Sampling for BNs (likelihood weighting) treewidth

  • Input: CPDs P(Xi|Xi), evidence xE

  • Output:

  • For each sample r

    • wr = 1

    • For each node i in topological order

      • If Xi is observed

      • Then xir = xiE; wr = wr * P(Xi=xiE|Xi= xir)

      • Else xir ~ P(Xi|xir)

SP2-41


Drawbacks of importance sampling

C treewidth

C

S

S

R

R

W

W

Drawbacks of importance sampling

  • Sample given upstream evidence, weight by downstream evidence.

  • Evidence reversal = modify model to make all observed nodes be parents – can be expensive

  • Does not scale to high dimensional spaces, even if Q similar to P, since variance of weights too high.

SP2-42


Sequential importance sampling particle filtering

X treewidth1

X3

X2

Y1

Y3

Y2

Sequential importance sampling (particle filtering)

Arulampalam02,Doucet01

  • Apply importance sampling to a (nonlinear, nonGaussian) dynamical system.

  • Resample particles wp wt

    • Unlikely hypotheses get replaced

SP2-43


Markov chain monte carlo mcmc
Markov Chain Monte Carlo (MCMC) treewidth

Neal93,Mackay98

  • Draw dependent samples xt from a chain with transition kernel T(x’ | x), s.t.

    • P(x) is the stationary distribution

    • The chain is ergodic (all states can get to the stationary states)

  • If T satisfies detailed balancethen P = 

SP2-44


Metropolis hastings
Metropolis Hastings treewidth

  • Sample xt ~ Q(x’|xt-1)

  • Accept new state with probability

  • Satisfies detailed balance

SP2-45


Gibbs sampling
Gibbs sampling treewidth

  • Metropolis method where Q is defined in terms of conditionals P(Xi|X-i).

  • Acceptance rate = 1.

  • For graphical model, only need to condition on the Markov blanket

See BUGS software

SP2-46


Difficulties with mcmc
Difficulties with MCMC treewidth

  • May take long time to “mix” (converge to stationary distribution).

  • Hard to know when chain has mixed.

  • Simple proposals exhibit random walk behavior.

    • Hybrid Monte Carlo (use gradient information)

    • Swendsen-Wang (large moves for Ising model)

    • Heuristic proposals

SP2-47


Outline6
Outline treewidth

  • Introduction

  • Exact inference

  • Approximate inference

    • Deterministic

    • Stochastic (sampling)

    • Hybrid deterministic/ stochastic

SP2-48


Comparison of deterministic and stochastic methods
Comparison of deterministic and stochastic methods treewidth

  • Deterministic

    • fast but inaccurate

  • Stochastic

    • slow but accurate

    • Can handle arbitrary hypothesis space

  • Combine best of both worlds (hybrid)

    • Use smart deterministic proposals

    • Integrate out some of the states, sample the rest (Rao-Blackwellization)

    • Non-parametric BP (particle filtering for graphs)

SP2-49


Examples of deterministic proposals
Examples of deterministic proposals treewidth

  • State estimation

    • Unscented particle filter

  • Machine learning

    • Variational MCMC

  • Computer vision

    • Data driven MCMC

Merwe00

deFreitas01

Tu02

SP2-50


Example of rao blackwellized particle filters

S treewidth1

S3

S2

X1

X3

X2

Y1

Y3

Y2

Example of Rao-Blackwellized particle filters

  • Conditioned on the discrete switching nodes, the remaining system is linear Gaussian and can be integrated out using the Kalman filter.

  • Each particle contains sample value str and mean/ covariance for P(Xt|y1:t, s1:tr)

SP2-51


Outline7
Outline treewidth

  • Introduction

  • Exact inference

  • Approximate inference

    • Deterministic

    • Stochastic (sampling)

    • Hybrid deterministic/ stochastic

  • Summary

SP2-52


Summary of inference methods

Exact treewidth

Deterministic approximation

Stochastic approximation

Summary of inference methods

BP=belief propagation, EP = expectation propagation, ADF = assumed density filtering, EKF = extended Kalman filter, UKF = unscented Kalman filter, VarElim = variable elimination, Jtree= junction tree, EM = expectation maximization, VB = variational Bayes, NBP = non-parametric BP

SP2-53