Divergence measures and message passing l.jpg
This presentation is the property of its rightful owner.
Sponsored Links
1 / 42

Divergence measures and message passing PowerPoint PPT Presentation


  • 165 Views
  • Uploaded on
  • Presentation posted in: General

Divergence measures and message passing. Tom Minka Microsoft Research Cambridge, UK. with thanks to the Machine Learning and Perception Group. Message-Passing Algorithms. Outline. Example of message passing Interpreting message passing Divergence measures

Download Presentation

Divergence measures and message passing

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Divergence measures and message passing l.jpg

Divergence measures and message passing

Tom Minka

Microsoft Research

Cambridge, UK

with thanks to the Machine Learning and Perception Group


Message passing algorithms l.jpg

Message-Passing Algorithms


Outline l.jpg

Outline

  • Example of message passing

  • Interpreting message passing

  • Divergence measures

  • Message passing from a divergence measure

  • Big picture


Outline4 l.jpg

Outline

  • Example of message passing

  • Interpreting message passing

  • Divergence measures

  • Message passing from a divergence measure

  • Big picture


Estimation problem l.jpg

Estimation Problem

b

y

d

e

x

f

z

a

c


Estimation problem6 l.jpg

0

1

0

1

0

1

?

?

?

Estimation Problem

b

y

d

e

x

f

z

a

c


Estimation problem7 l.jpg

Estimation Problem

y

x

z


Estimation problem8 l.jpg

Estimation Problem

Queries:

Want to do these quickly


Belief propagation l.jpg

Belief Propagation

y

x

z


Belief propagation10 l.jpg

Belief Propagation

Final

y

x

z


Belief propagation11 l.jpg

Belief Propagation

Marginals:

(Exact)

(BP)

Normalizing constant:

0.45 (Exact)

0.44 (BP)

Argmax:

(0,0,0) (Exact)

(0,0,0) (BP)


Outline12 l.jpg

Outline

  • Example of message passing

  • Interpreting message passing

  • Divergence measures

  • Message passing from a divergence measure

  • Big picture


Message passing distributed optimization l.jpg

Message Passing = Distributed Optimization

  • Messages represent a simpler distribution q(x) that approximates p(x)

    • A distributed representation

  • Message passing = optimizing q to fit p

    • q stands in for p when answering queries

  • Parameters:

    • What type of distribution to construct (approximating family)

    • What cost to minimize (divergence measure)


How to make a message passing algorithm l.jpg

How to make a message-passing algorithm

  • Pick an approximating family

    • fully-factorized, Gaussian, etc.

  • Pick a divergence measure

  • Construct an optimizer for that measure

    • usually fixed-point iteration

  • Distribute the optimization across factors


Outline15 l.jpg

Outline

  • Example of message passing

  • Interpreting message passing

  • Divergence measures

  • Message passing from a divergence measure

  • Big picture


Slide16 l.jpg

Let p,q be unnormalized distributions

Kullback-Leibler (KL) divergence

Alpha-divergence ( is any real number)

Asymmetric, convex


Examples of alpha divergence l.jpg

Examples of alpha-divergence


Minimum alpha divergence l.jpg

Minimum alpha-divergence

q is Gaussian, minimizes D(p||q)

= -1


Minimum alpha divergence19 l.jpg

Minimum alpha-divergence

q is Gaussian, minimizes D(p||q)

= 0


Minimum alpha divergence20 l.jpg

Minimum alpha-divergence

q is Gaussian, minimizes D(p||q)

= 0.5


Minimum alpha divergence21 l.jpg

Minimum alpha-divergence

q is Gaussian, minimizes D(p||q)

= 1


Minimum alpha divergence22 l.jpg

Minimum alpha-divergence

q is Gaussian, minimizes D(p||q)

= 1


Properties of alpha divergence l.jpg

Properties of alpha-divergence

  • £ 0 seeks the mode with largest mass (not tallest)

    • zero-forcing: p(x)=0 forces q(x)=0

    • underestimates the support of p

  • ¸ 1 stretches to cover everything

    • inclusive: p(x)>0 forces q(x)>0

    • overestimates the support of p

[Frey,Patrascu,Jaakkola,Moran 00]


Structure of alpha space l.jpg

Structure of alpha space

inclusive (zero

avoiding)

zero

forcing

BP,

EP

MF

0

1

TRW

FBP,

PEP


Other properties l.jpg

Other properties

  • If q is an exact minimum of alpha-divergence:

  • Normalizing constant:

  • If a=1: Gaussian q matches mean,variance of p

    • Fully factorized q matches marginals of p


Two node example l.jpg

Two-node example

x

y

  • q is fully-factorized, minimizes a-divergence to p

  • q has correct marginals only for  = 1 (BP)


Two node example27 l.jpg

Two-node example

Bimodal

distribution

= 1 (BP)

  • = 0 (MF)

  •  0.5


Two node example28 l.jpg

Two-node example

Bimodal

distribution

= 1


Lessons l.jpg

Lessons

  • Neither method is inherently superior – depends on what you care about

  • A factorized approx does not imply matching marginals (only for =1)

  • Adding y to the problem can change the estimated marginal for x (though true marginal is unchanged)


Outline30 l.jpg

Outline

  • Example of message passing

  • Interpreting message passing

  • Divergence measures

  • Message passing from a divergence measure

  • Big picture


Distributed divergence minimization l.jpg

Distributed divergence minimization


Distributed divergence minimization32 l.jpg

Distributed divergence minimization

  • Write p as product of factors:

  • Approximate factors one by one:

  • Multiply to get the approximation:


Global divergence to local divergence l.jpg

Global divergence to local divergence

  • Global divergence:

  • Local divergence:


Message passing l.jpg

Message passing

  • Messages are passed between factors

  • Messages are factor approximations:

  • Factor a receives

    • Minimize local divergence to get

    • Send to other factors

    • Repeat until convergence

  • Produces all 6 algs


Global divergence vs local divergence l.jpg

Global divergence vs. local divergence

In general, local ¹ global

  • but results are similar

  • BP doesn’t minimize global KL, but comes close

MF

0

local ¹ global

local = global

no loss from message passing


Experiment l.jpg

Experiment

  • Which message passing algorithm is best at minimizing global Da(p||q)?

  • Procedure:

  • Run FBP with various aL

  • Compute global divergence for various aG

  • Find best aL (best alg) for each aG


Results l.jpg

Results

  • Average over 20 graphs, random singleton and pairwise potentials: exp(wij xi xj)

  • Mixed potentials (w ~ U(-1,1)):

    • best aL = aG (local should match global)

    • FBP with same a is best at minimizing Da

      • BP is best at minimizing KL


Outline38 l.jpg

Outline

  • Example of message passing

  • Interpreting message passing

  • Divergence measures

  • Message passing from a divergence measure

  • Big picture


Hierarchy of algorithms l.jpg

Hierarchy of algorithms

  • Power EP

  • exp family

  • D(p||q)

  • Structured MF

  • exp family

  • KL(q||p)

  • FBP

  • fully factorized

  • D(p||q)

  • EP

  • exp family

  • KL(p||q)

  • MF

  • fully factorized

  • KL(q||p)

  • TRW

  • fully factorized

  • D(p||q),a>1

  • BP

  • fully factorized

  • KL(p||q)


Matrix of algorithms l.jpg

Matrix of algorithms

  • MF

  • fully factorized

  • KL(q||p)

  • Structured MF

  • exp family

  • KL(q||p)

  • TRW

  • fully factorized

  • D(p||q),a>1

approximation family

Other families?

(mixtures)

divergence

measure

  • BP

  • fully factorized

  • KL(p||q)

  • EP

  • exp family

  • KL(p||q)

  • FBP

  • fully factorized

  • D(p||q)

  • Power EP

  • exp family

  • D(p||q)

Other

divergences?


Other message passing algorithms l.jpg

Other Message Passing Algorithms

Do they correspond to divergence measures?

  • Generalized belief propagation [Yedidia,Freeman,Weiss 00]

  • Iterated conditional modes [Besag 86]

  • Max-product belief revision

  • TRW-max-product [Wainwright,Jaakkola,Willsky 02]

  • Laplace propagation [Smola,Vishwanathan,Eskin 03]

  • Penniless propagation [Cano,Moral,Salmerón 00]

  • Bound propagation [Leisink,Kappen 03]


Future work l.jpg

Future work

  • Understand existing message passing algorithms

  • Understand local vs. global divergence

  • New message passing algorithms:

    • Specialized divergence measures

    • Richer approximating families

  • Other ways to minimize divergence


  • Login