Loading in 2 Seconds...

Exact and approximate inference in probabilistic graphical models

Loading in 2 Seconds...

- 195 Views
- Uploaded on

Download Presentation
## PowerPoint Slideshow about 'Exact and approximate inference in probabilistic graphical models' - eliora

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

### Exact and approximate inference in probabilistic graphical models

Kevin MurphyMIT CSAIL UBC CS/Stats

www.ai.mit.edu/~murphyk/AAAI04

AAAI 2004 tutorial

SP2-1

Recommended reading

- Cowell, Dawid, Lauritzen, Spiegelhalter, “Probabilistic Networks and Expert Systems“ 1999
- Jensen 2001, “Bayesian Networks and Decision Graphs”
- Jordan (due 2005) “Probabilistic graphical models”
- Koller & Friedman (due 2005), “Bayes nets and beyond”
- “Learning in graphical models”,edited M. Jordan

SP2-2

Outline

- Introduction
- Exact inference
- Approximate inference
- Deterministic
- Stochastic (sampling)
- Hybrid deterministic/ stochastic

SP2-3

2 reasons for approximate inference

Low treewidth BUTNon-linear/ Non-Gaussian

High tree width

Chains

N=nxn grid

eg non-linear dynamical system

Trees (no loops)

eg (Bayesian) parameter estimation

X1

X3

X2

Loopy graphs

Y1

Y3

Y2

SP2-4

Complexity of approximate inference

- Approximating P(Xq|Xe) to within a constant factor for all discrete BNs is NP-hard.
- In practice, many models exhibit “weak coupling”, so we may safely ignore certain dependencies.
- Computing P(Xq|Xe) for all polytrees with discrete and Gaussian nodes is NP-hard.
- In practice, some of the modes of the posterior will have negligible mass.

Dagum93

Lerner01

SP2-5

2 objective functions

- Approximate true posterior P(h|v) by Q(h)
- Variational: globally optimize all terms wrt simpler Q
- Expectation propagation (EP): sequentially optimize each term

P

Q

P

Q

min D(Q||P)

min D(P||Q)

Q=0 => P=0

P=0 => Q=0

SP2-6

Outline

- Introduction
- Exact inference
- Approximate inference
- Deterministic
- Variational
- Loopy belief propagation
- Expectation propagation
- Graph cuts
- Stochastic (sampling)
- Hybrid deterministic/ stochastic

SP2-7

Free energy

- Variational goal: minimize D(P||Q) wrt Q, where Q has a simpler form than P
- P(h,v) simpler than P(h|v), so use
- Free energy is upper bound on neg log-likelihood

SP2-8

Point estimation

- Use
- Minimize
- Iterative Conditional Modes (ICM):
- For each iteration, for each hi
- Example: K-means clustering
- Ignores uncertainty in P(h|v), P(|v)
- Tends to get stuck in local minima

Factors in markov blanket of hi

SP2-9

Expectation Maximization (EM)

- Point estimates for parameters (ML or MAP), full posterior for hidden vars.
- E-step: minimize F(Q,P) wrt Q(h)
- M-step: minimize F(Q,P) wrt Q(h)

Exact inference

Parameter prior

Expected complete-data log-likelihood

SP2-10

EM: tricks of the trade

Neal98

- Generalized EM
- Partial M-step: reduce F(Q,P) wrt Q(h)[e.g., gradient method]
- Partial E-step: reduce F(Q,P) wrt Q(h)[approximate inference]
- Avoiding local optima
- Deterministic Annealing
- Data resampling
- Speedup tricks
- Combine with conjugate gradient
- Online/incremental updates

Rose98

Elidan02

Salakhutdinov03

Bauer97,Neal98

SP2-11

Variational Bayes (VB)

Ghahramani00,Beal02

- Use
- For exponential family models with conjugate priors, this results in a generalized version of EM
- E-step: modified inference to take into account uncertainty of parameters
- M-step: optimize Q(h) using expected sufficient statistics
- Variational Message Passing automates this, assuming a fully factorized (mean field) Q

Winn04

variational-Bayes.org

SP2-12

Variational inference for discrete state models with high treewidth

- We assume the parameters are fixed.
- We assume Q(h) has a simple form, so we can easily find
- Mean field:
- Structured variational:

Xing04

Product of chains

Mean field

Grid MRF

SP2-13

Variational inference for MRFs

- Probability is exp(-energy)
- Free energy = average energy - entropy

SP2-14

Mean field for MRFs

- Fully factorized approximation
- Normalization constraint
- Average energy
- Entropy
- Local minima satisfy

SP2-15

Outline

- Introduction
- Exact inference
- Approximate inference
- Deterministic
- Variational
- Loopy belief propagation
- Expectation propagation
- Graph cuts
- Stochastic (sampling)
- Hybrid deterministic/ stochastic

SP2-16

BP vs mean field for MRFs

- Mean field updates
- BP updates
- Every node i sends a different message to j
- Empirically, BP much better than MF (e.g., MF is not exact even for trees)
- BP is (attempting to) minimize the Bethe free energy

Weiss01

Yedidia01

SP2-17

Bethe free energy

- We assume the graph is a tree, in which case the following is exact
- Constraints
- Normalization
- Marginalization
- Average energy
- Entropy

di = #neighbors for node i

SP2-18

BP minimizes Bethe free energy

Yedidia01

- Theorem [Yedida, Freeman, Weiss]: fixed points of BP are local stationary points of the Bethe free energy
- BP may not converge; other algorithms can directly minimize F, but are slower.
- If BP does not converge, it often means F is a poor approximation

SP2-19

1

2

3

1

2

6

4

5

6

4

5

Bethe

Kikuchi

Kikuchi free energy- Cluster groups of nodes together
- Energy per region
- Free energy per region
- Kikuchi free energy

Counting numbers

SP2-20

Counting numbers

3

1

2

3

1

2

6

4

5

6

4

5

Bethe

Kikuchi

12 23 14 25 36 45 56

Region graphs

1245 2356

1 2 3 4 5 6

25

C= -1 -2 -1 -1 -2 -1

C=1-(1+1)=-1

Fkikuchi is exact if region graph contains 2 levels (regions and intersections)and has no cycles – equivalent to junction tree!

SP2-21

Generalized BP

3

1

2

- 2356 4578 5689

25 45 56 58

6

4

5

5

9

7

8

- Fkikuchu no longer exact, but more accurate than Fbethe
- Generalized BP can be used to minimize Fkikuchi
- This method of choosing regions is called the “cluster variational method”
- In the limit, we recover the junction tree algorithm.

Welling04

SP2-22

Outline

- Introduction
- Exact inference
- Approximate inference
- Deterministic
- Variational
- Loopy belief propagation
- Expectation propagation
- Graph cuts
- Stochastic (sampling)
- Hybrid deterministic/ stochastic

SP2-23

Expectation Propagation (EP)

Minka01

- EP = iterated assumed density filtering
- ADF = recursive Bayesian estimation interleaved with projection step
- Examples of ADF:
- Extended Kalman filtering
- Moment-matching (weak marginalization)
- Boyen-Koller algorithm
- Some online learning algorithms

SP2-24

Assumed Density Filtering (ADF)

x

Recursive Bayesian estimation(sequential updating of posterior)

Y1

Yn

- If p(yi|x) not conjugate to p(x), then p(x|y1:i) may not be tractably representable
- So project posterior back to representable family
- And repeat

update

project

Projection becomes moment matching

SP2-25

Expectation Propagation

- ADF is sensitive to the order of updates.
- ADF approximates each posterior myopically.
- EP: iteratively approximate each term.

intractable

Simple, non-iterative, inaccurate

= EP

Simple, iterative, accurate

After Ghahramani

SP2-26

Expectation Propagation

- Input:
- Initialize:
- Repeat
- For i=0..N
- Deletion:
- Projection:
- Inclusion:
- Until convergence
- Output: q(x)

After Ghahramani

SP2-27

BP is a special case of EP

- BP assumes fully factorized
- At each iteration, for each factor i, for each node k, KL projection matches moments (computes marginals by absorbing from neighbors)

Xn1

fj

Xk

fi

Xn2

SP2-28

TreeEP

Minka03

- TreeEP assumes q(x) is represented by a tree (regardless of “true” model topology).
- We can use the Jtree algorithm to do the moment matching at each iteration.
- Faster and more accurate than LBP.
- Faster and comparably accurate to GBP.

SP2-29

Outline

- Introduction
- Exact inference
- Approximate inference
- Deterministic
- Variational
- Loopy belief propagation
- Expectation propagation
- Graph cuts
- Stochastic (sampling)
- Hybrid deterministic/ stochastic

SP2-30

MPE in MRFs

- MAP estimation = energy minimization
- Simplifications:
- Only pairwise potentials: Eijk=0 etc
- Special form for potentials
- Binary variables xi2 {0,1}

SP2-31

Kinds of potential

- Metric
- Semi-metric: satisfies (2) & (3)
- Piecewise constant, eg.
- Potts model (metric)
- Piecewise smooth, eg.
- Semi-metric
- Metric
- Discontinuity-preserving potentials avoid oversmoothing

SP2-32

C-A

B+C-A-D

xi

xj

C-D

t

GraphCutsKolmogorov04

- Thm: we can find argmin E(x) for binary variables and pairwise potentials in at most O(N3) time using maxflow/ mincut algorithm on the graph below iff potentials are submodular i.e.,
- Metric potentials (eg. Potts) are always submodular.
- Thm: the general case (eg. non-binary or non-submodular) is NP-hard.

where

SP2-33

Finding strong local minimum

- For the non-binary case, we can optimum wrt some large space of moves by iteratively solving binary subproblems.
- -expansion: any pixel can change to
- - swap: any can switch to and vice versa

Picture from Zabih

SP2-34

Finding strong local minimum

- Start with arbitrary assignment f
- Done := false
- While ~done
- Done := true
- For each label
- Find
- If E(f’) < E(f) then done := false; f := f’

Binary subproblem!

SP2-35

Properties of the 2 algorithms

- -expansion
- Requires V to be submodular (eg metric)
- O(L) per cycle
- Factor of 2c(V) within optimal
- c=1 for Potts model
- - swap
- Requires V to be semi-metric
- O(L2) per cycle
- No comparable theoretical guarantee, but works well in practice

SP2-36

Summary of inference methods for pairwise MRFs

- Marginals
- Mean field
- Loopy/ generalized BP (sum-product)
- EP
- Gibbs sampling
- Swendsen-Wang
- MPE/ Viterbi
- Iterative conditional modes (ICM)
- Loopy/generalized BP (max-product)
- Graph cuts
- Simulated annealing

See Boykov01, Weiss01 and Tappen03 for some empirical comparisons

SP2-37

Outline

- Introduction
- Exact inference
- Approximate inference
- Deterministic
- Stochastic (sampling)
- Hybrid deterministic/ stochastic

SP2-38

Monte Carlo (sampling) methods

- Goal: estimate
- e.g.,
- Draw N independent samples xr ~ P
- Hard to draw (independent) samples from P

Accuracy is independentof dimensionality of X

SP2-39

Importance Sampling for BNs (likelihood weighting)

- Input: CPDs P(Xi|Xi), evidence xE
- Output:
- For each sample r
- wr = 1
- For each node i in topological order
- If Xi is observed
- Then xir = xiE; wr = wr * P(Xi=xiE|Xi= xir)
- Else xir ~ P(Xi|xir)

SP2-41

C

S

S

R

R

W

W

Drawbacks of importance sampling- Sample given upstream evidence, weight by downstream evidence.
- Evidence reversal = modify model to make all observed nodes be parents – can be expensive
- Does not scale to high dimensional spaces, even if Q similar to P, since variance of weights too high.

SP2-42

X3

X2

Y1

Y3

Y2

Sequential importance sampling (particle filtering)Arulampalam02,Doucet01

- Apply importance sampling to a (nonlinear, nonGaussian) dynamical system.
- Resample particles wp wt
- Unlikely hypotheses get replaced

SP2-43

Markov Chain Monte Carlo (MCMC)

Neal93,Mackay98

- Draw dependent samples xt from a chain with transition kernel T(x’ | x), s.t.
- P(x) is the stationary distribution
- The chain is ergodic (all states can get to the stationary states)
- If T satisfies detailed balancethen P =

SP2-44

Metropolis Hastings

- Sample xt ~ Q(x’|xt-1)
- Accept new state with probability
- Satisfies detailed balance

SP2-45

Gibbs sampling

- Metropolis method where Q is defined in terms of conditionals P(Xi|X-i).
- Acceptance rate = 1.
- For graphical model, only need to condition on the Markov blanket

See BUGS software

SP2-46

Difficulties with MCMC

- May take long time to “mix” (converge to stationary distribution).
- Hard to know when chain has mixed.
- Simple proposals exhibit random walk behavior.
- Hybrid Monte Carlo (use gradient information)
- Swendsen-Wang (large moves for Ising model)
- Heuristic proposals

SP2-47

Outline

- Introduction
- Exact inference
- Approximate inference
- Deterministic
- Stochastic (sampling)
- Hybrid deterministic/ stochastic

SP2-48

Comparison of deterministic and stochastic methods

- Deterministic
- fast but inaccurate
- Stochastic
- slow but accurate
- Can handle arbitrary hypothesis space
- Combine best of both worlds (hybrid)
- Use smart deterministic proposals
- Integrate out some of the states, sample the rest (Rao-Blackwellization)
- Non-parametric BP (particle filtering for graphs)

SP2-49

Examples of deterministic proposals

- State estimation
- Unscented particle filter
- Machine learning
- Variational MCMC
- Computer vision
- Data driven MCMC

Merwe00

deFreitas01

Tu02

SP2-50

S3

S2

X1

X3

X2

Y1

Y3

Y2

Example of Rao-Blackwellized particle filters- Conditioned on the discrete switching nodes, the remaining system is linear Gaussian and can be integrated out using the Kalman filter.
- Each particle contains sample value str and mean/ covariance for P(Xt|y1:t, s1:tr)

SP2-51

Outline

- Introduction
- Exact inference
- Approximate inference
- Deterministic
- Stochastic (sampling)
- Hybrid deterministic/ stochastic
- Summary

SP2-52

Deterministic approximation

Stochastic approximation

Summary of inference methodsBP=belief propagation, EP = expectation propagation, ADF = assumed density filtering, EKF = extended Kalman filter, UKF = unscented Kalman filter, VarElim = variable elimination, Jtree= junction tree, EM = expectation maximization, VB = variational Bayes, NBP = non-parametric BP

SP2-53

Download Presentation

Connecting to Server..