Convex Point Estimation using Undirected Bayesian Transfer Hierarchies

1 / 25

# Convex Point Estimation using Undirected Bayesian Transfer Hierarchies - PowerPoint PPT Presentation

Gal Elidan Ben Packer Geremy Heitz Daphne Koller Stanford AI Lab. Convex Point Estimation using Undirected Bayesian Transfer Hierarchies. Motivation. Problem : With few instances, learned models aren’t robust. Task : Shape modeling. Principal Components. std +1. MEAN.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## PowerPoint Slideshow about 'Convex Point Estimation using Undirected Bayesian Transfer Hierarchies' - madaline-leon

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Gal Elidan Ben Packer

Geremy Heitz Daphne Koller

Stanford AI Lab

Motivation

Problem:

With few instances, learned

models aren’t robust

Shape modeling

Principal

Components

std +1

MEAN

MEAN

std -1

std +1

std -1

Transfer Learning

Shape is stabilized, but doesn’t

look like an elephant

Can we use rhinos to help elephants?

Principal

Components

std +1

MEAN

MEAN

std -1

std +1

std -1

Hierarchical Bayes

P(qroot)

qroot

P(qElephant|qroot)

P(qRhino|qroot)

qRhino

qElephant

P(DataRhino|qRhino)

P(DataElephant|qElephant)

Goals
• Transfer between related classes
• Probabilistic motivation
• Multilevel, complex hierarchies
• Simple, efficient computation
• Automatically learn what to transfer
Hierarchical Bayes

qroot

• Compute full posterior P(£|D)
• P(£c|£root) must be conjugate to P(D|£c)

qElephant

qRhino

Problem:

Often can’t perform full

Bayesian computations

Approx.: Point estimation

Best parameters are good enough;

don’t need full distribution

qroot

qElephant

• Empirical Bayes
• Point estimation
• Other approximations:
• Posterior as normal, sampling, etc.

qRhino

More Issues: Multiple Levels
• Conjugate priors usually can’t be extended to
• multiple levels (e.g., Dirichlet, inverse-Wishart)
• Exception: Thibeaux and Jordan (‘05)
More Issues: Restrictive Priors
• Example: inverse-Wishart
• Pseudocount restriction
• n >= d
• If d is large, N is small, signal from prior overwhelms data
• We show experiments with N=3, d=20

Normal-Inverse-Wishart parameters

Pseudocounts

Gaussian

parameters

m,L,k,n

mA,SA

mB,SB

N = # samples, d = dimension

Alternative: Shrinkage
• McCallum et al. (‘98)
• 1. Compute maximum likelihood at each node
• 2. “Shrink” each node toward its parent
• Linear combination of q and qparent
• Uses cross-validation of a
• Pros:
• Simple to compute
• Handles multiple levels
• Cons:
• Naive heuristic for transfer
• Averaging not always appropriate

qparent

q* = aq + (1-a)qparent

q

Undirected HB Reformulation

Probabilistic Abstraction Hierarchies

(Segal et al. ‘01)

Defines an undirected Markov

random field model over Q, D

Undirected Probabilistic Model

qroot

b: high

b: low

Divergence

Divergence

qRhino

qElephant

Fdata

Fdata

Divergence:

Encourage parameters to be similar to parents

Fdata:

Encourage parameters to explain data

Purpose of Reformulation
• Easy to specify
• Fdata can be likelihood, classification, or other objective
• Divergence can be L1-distance, L2-distance,

e-insensitive loss, KL divergence, etc.

• No conjugacy or proper prior restrictions
• Easy to optimize
• Convex over Q if Fdata is convex and Divergence is concave
Application: Text categorization

Bag-of-words model

Fdata : Multinomial log likelihood (regularized)

µi represents frequency of word i

Divergence: L2 norm

Newsgroup20

Dataset

Baselines

1. Maximum likelihood at each node (no hierarchy)

2. Cross-validate regularization (no hierarchy)

3. Shrinkage (McCallum et al. ’98, with hierarchy)

qparent

qB

qA

q* = aq + (1-a)qparent

q

Can It Handle Multiple Levels?

Newsgroup Topic Classification

0.7

0.65

0.6

0.55

Classification Rate

0.5

0.45

Max Likelihood (No regularization)

Shrinkage

Regularized Max Likelihood

0.4

Undirected HB

0.35

75

150

225

300

375

Total Number of Training Instances

Application: Shape Modeling

Mammals Dataset (Fink, ’05)

(Density estimation – test likelihood)

Instances represented by 60 x-y

coordinates of landmarks on outline

Divergence:

L2 norm over mean and variance

Covariance over landmarks

Mean landmark location

Regularization

MEAN

Principal

Components

Does Hierarchy Help?

Mammal Pairs

50

Regularized Max Likelihood

0

Bison-Rhino

-50

-100

Delta log-loss / instance

-150

Bison-Rhino

Elephant-Bison

-200

Elephant-Rhino

Giraffe-Bison

Giraffe-Elephant

-250

Giraffe-Rhino

Llama-Bison

Llama-Elephant

-300

Llama-Giraffe

Llama-Rhino

-350

6

10

20

30

Total Number of Training Instances

Unregularized max likelihood, shrinkage: Much worse, not shown

Transfer

Not all parameters deserve equal sharing

Degrees of Transfer

How do we estimate all these parameters?

Split q into subcomponents µiwith weights l

Allows for different strengths for different

subcomponents, child-parent pairs

Learning Degrees of Transfer
• Bootstrap approach

If and have a consistent relationship, want to encourage them to be similar

• Hyper-prior approach

Bayesian idea:

Put prior on ¸

Add ¸ as parameter to optimization along with £

Concretely: inverse-Gamma prior (forced to be positive)

• If likelihood is concave, entire objective is convex!

Prior on Degree of Transfer

Do Degrees of Transfer Help?

Mammal Pairs

15

Hyperprior

10

Bison-Rhino

5

Delta log-loss / instance

0

Regularized Max Likelihood

Bison-Rhino

-5

Elephant-Bison

Elephant-Rhino

Giraffe-Bison

Giraffe-Elephant

Giraffe-Rhino

-10

Llama-Bison

Llama-Elephant

Llama-Giraffe

Llama-Rhino

-15

6

10

20

30

Total Number of Training Instances

Degrees of Transfer

Distribution of DOT coefficients using Hyperprior

20

18

qroot

16

14

12

10

8

6

4

2

0

0

5

10

15

20

25

30

35

40

45

50

1/l

Weaker transfer

Stronger transfer

Summary
• Transfer between related classes
• Probabilistic motivation
• Multilevel, complex hierarchies
• Simple, efficient computation
• Refined transfer of components
Future Work
• Non-tree hierarchies

(multiple inheritance)

• Block degrees of transfer
• Structure learning

Gene Ontology (GO) network

WordNet Hierarchy

General undirected model doesn’t require tree structure

Part discovery