1 / 42

# Divergence measures and message passing - PowerPoint PPT Presentation

Divergence measures and message passing. Tom Minka Microsoft Research Cambridge, UK. with thanks to the Machine Learning and Perception Group. Message-Passing Algorithms. Outline. Example of message passing Interpreting message passing Divergence measures

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## PowerPoint Slideshow about 'Divergence measures and message passing' - tea

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

### Divergence measures and message passing

Tom Minka

Microsoft Research

Cambridge, UK

with thanks to the Machine Learning and Perception Group

• Example of message passing

• Interpreting message passing

• Divergence measures

• Message passing from a divergence measure

• Big picture

• Example of message passing

• Interpreting message passing

• Divergence measures

• Message passing from a divergence measure

• Big picture

b

y

d

e

x

f

z

a

c

1

0

1

0

1

?

?

?

Estimation Problem

b

y

d

e

x

f

z

a

c

Queries:

Want to do these quickly

Final

y

x

z

Marginals:

(Exact)

(BP)

Normalizing constant:

0.45 (Exact)

0.44 (BP)

Argmax:

(0,0,0) (Exact)

(0,0,0) (BP)

• Example of message passing

• Interpreting message passing

• Divergence measures

• Message passing from a divergence measure

• Big picture

Message Passing = Distributed Optimization

• Messages represent a simpler distribution q(x) that approximates p(x)

• A distributed representation

• Message passing = optimizing q to fit p

• q stands in for p when answering queries

• Parameters:

• What type of distribution to construct (approximating family)

• What cost to minimize (divergence measure)

• Pick an approximating family

• fully-factorized, Gaussian, etc.

• Pick a divergence measure

• Construct an optimizer for that measure

• usually fixed-point iteration

• Distribute the optimization across factors

• Example of message passing

• Interpreting message passing

• Divergence measures

• Message passing from a divergence measure

• Big picture

Let p,q be unnormalized distributions

Kullback-Leibler (KL) divergence

Alpha-divergence ( is any real number)

Asymmetric, convex

q is Gaussian, minimizes D(p||q)

= -1

q is Gaussian, minimizes D(p||q)

= 0

q is Gaussian, minimizes D(p||q)

= 0.5

q is Gaussian, minimizes D(p||q)

= 1

q is Gaussian, minimizes D(p||q)

= 1

• £ 0 seeks the mode with largest mass (not tallest)

• zero-forcing: p(x)=0 forces q(x)=0

• underestimates the support of p

• ¸ 1 stretches to cover everything

• inclusive: p(x)>0 forces q(x)>0

• overestimates the support of p

[Frey,Patrascu,Jaakkola,Moran 00]

inclusive (zero

avoiding)

zero

forcing

BP,

EP

MF

0

1

TRW

FBP,

PEP

• If q is an exact minimum of alpha-divergence:

• Normalizing constant:

• If a=1: Gaussian q matches mean,variance of p

• Fully factorized q matches marginals of p

x

y

• q is fully-factorized, minimizes a-divergence to p

• q has correct marginals only for  = 1 (BP)

Bimodal

distribution

= 1 (BP)

• = 0 (MF)

•  0.5

Bimodal

distribution

= 1

• Neither method is inherently superior – depends on what you care about

• A factorized approx does not imply matching marginals (only for =1)

• Adding y to the problem can change the estimated marginal for x (though true marginal is unchanged)

• Example of message passing

• Interpreting message passing

• Divergence measures

• Message passing from a divergence measure

• Big picture

• Write p as product of factors:

• Approximate factors one by one:

• Multiply to get the approximation:

• Global divergence:

• Local divergence:

• Messages are passed between factors

• Messages are factor approximations:

• Minimize local divergence to get

• Send to other factors

• Repeat until convergence

• Produces all 6 algs

In general, local ¹ global

• but results are similar

• BP doesn’t minimize global KL, but comes close

MF

0

local ¹ global

local = global

no loss from message passing

• Which message passing algorithm is best at minimizing global Da(p||q)?

• Procedure:

• Run FBP with various aL

• Compute global divergence for various aG

• Find best aL (best alg) for each aG

• Average over 20 graphs, random singleton and pairwise potentials: exp(wij xi xj)

• Mixed potentials (w ~ U(-1,1)):

• best aL = aG (local should match global)

• FBP with same a is best at minimizing Da

• BP is best at minimizing KL

• Example of message passing

• Interpreting message passing

• Divergence measures

• Message passing from a divergence measure

• Big picture

• Power EP

• exp family

• D(p||q)

• Structured MF

• exp family

• KL(q||p)

• FBP

• fully factorized

• D(p||q)

• EP

• exp family

• KL(p||q)

• MF

• fully factorized

• KL(q||p)

• TRW

• fully factorized

• D(p||q),a>1

• BP

• fully factorized

• KL(p||q)

• MF

• fully factorized

• KL(q||p)

• Structured MF

• exp family

• KL(q||p)

• TRW

• fully factorized

• D(p||q),a>1

approximation family

Other families?

(mixtures)

divergence

measure

• BP

• fully factorized

• KL(p||q)

• EP

• exp family

• KL(p||q)

• FBP

• fully factorized

• D(p||q)

• Power EP

• exp family

• D(p||q)

Other

divergences?

Do they correspond to divergence measures?

• Generalized belief propagation [Yedidia,Freeman,Weiss 00]

• Iterated conditional modes [Besag 86]

• Max-product belief revision

• TRW-max-product [Wainwright,Jaakkola,Willsky 02]

• Laplace propagation [Smola,Vishwanathan,Eskin 03]

• Penniless propagation [Cano,Moral,Salmerón 00]

• Bound propagation [Leisink,Kappen 03]

• Understand existing message passing algorithms

• Understand local vs. global divergence

• New message passing algorithms:

• Specialized divergence measures

• Richer approximating families

• Other ways to minimize divergence