Learning Tree Conditional Random Fields

Learning Tree Conditional Random Fields Joseph K. Bradley Carlos Guestrin

Reading people’s minds predict Correlated! • E.g., • Person? & Live in water? • Colorful? & Yellow? (Application from Palatucci et al., 2009) X: fMRI voxels Y: semantic features • Metal? • Manmade? • Found in house? • ... We want to model conditional correlations Predict independently? Yi ~ X, for all i Image from http://en.wikipedia.org/wiki/File:FMRI.jpg

Conditional Random Fields (CRFs) In fMRI, X ≈ 500 to 10,000 voxels Pro: Avoid modeling P(X) (Lafferty et al., 2001)

Conditional Random Fields (CRFs) encode conditional independence structure Y4 Y3 Y2 Y1 Pro: Avoid modeling P(X)

Conditional Random Fields (CRFs) Y4 Y3 Y2 Normalization depends on X=x Y1 Con: Compute Z(x) for each inference Pro: Avoid modeling P(X)

Conditional Random Fields (CRFs) Y4 Y3 Y2 Y1 Exact inference intractable in general. Approximate inference expensive. Use tree CRFs! Con: Compute Z(x) for each inference Pro: Avoid modeling P(X)

Conditional Random Fields (CRFs) Y4 Y3 Y2 Y1 Use tree CRFs! Pro: Fast, exact inference Con: Compute Z(x) for each inference Pro: Avoid modeling P(X)

CRF Structure Learning Y4 Y3 Structure learning Y2 Y1 Feature selection Tree CRFs Fast, exact inference Avoid modeling P(X)

CRF Structure Learning instead of Global inputs (not scalable) (scalable) Local inputs Tree CRFs Fast, exact inference Avoid modeling P(X)

This work Goals: • Structured conditional models P(Y|X) • Scalable methods • Tree structures • Local inputs Xij • Max spanning trees Outline • Gold standard • Max spanning trees • Generalized edge weights • Heuristic weights • Experiments: synthetic & fMRI

Related work • Vs. our work • Choice of edge weights • Local inputs

Chow-Liu For generative models:

Chow-Liu for CRFs? For CRFs with global inputs: Global CMI (Conditional Mutual Information): Pro: “Gold standard” Con: I(Yi;Yj | X) intractable for big X

Where now? Global CMI (Conditional Mutual Information): Pros: “Gold standard” Cons: I(Yi;Yj | X) intractable for big X Algorithmic framework • Given: data {(y(i),x(i))}. • Given: input mapping Yi Xi • Weight potential edge (Yi,Yj) with Score(i,j) • Choose max spanning tree Local inputs!

Generalized edge scores Local Linear Entropy Scores: Score(i,j) = linear combination of entropies over Yi,Yj,Xi,Xj E.g., Local Conditional Mutual Information Key step: Weight edge (Yi,Yj) with Score(i,j).

Generalized edge scores Local Linear Entropy Scores: Score(i,j) = linear combination of entropies over Yi,Yj,Xi,Xj Theorem Assume true P(Y|X) is tree CRF (w/ non-trivial parameters). No Local Linear Entropy Score can recover all such tree CRFs (even with exact entropies). Key step: Weight edge (Yi,Yj) with Score(i,j).

Heuristics Outline • Gold standard • Max spanning trees • Generalized edge weights • Heuristic weights • Experiments: synthetic & fMRI  Piecewise likelihood Local CMI DCI

Piecewise likelihood (PWL) Sutton and McCallum (2005,2007): PWL for parameter learning Main idea: Bound Z(X) For tree CRFs, optimal parameters give: Fails on simple counterexample Does badly in practice Helps explain other edge scores Edge score w/ local inputs Xij Bounds log likelihood

Piecewise likelihood (PWL) True P(Y,X) Y1 Y2 Y3 Yn ... Strong potential  Choose (2,j) Over (j,k) X1 X2 X3 Xn Y3 Y1 Y2 Yn

Local Conditional Mutual Info Decomposable score w/ local inputs Xij Theorem: Local CMI bounds log likelihood gain • Does pretty well in practice • Can fail with strong potentials

Local Conditional Mutual Info True P(Y,X) Y1 Y2 Y3 Yn ... Strong potential  X1 X2 X3 Xn Y3 Y2 Y1

Decomposable Conditional Influence (DCI) PWL From Y2 Y1 Y3 • Exact measure of gain for some edges • Edge score w/ local inputs Xij • Succeeds on counterexample • Does best in practice

Experiments Algorithmic details Regress P(Yij|Xij) (10-fold CV to choose regularization) • Choose max spanning tree • Parameter learning: • Conjugate gradient on L2-regularized log likelihood • 10-fold CV to choose regularization Given: Data {(yi,xi)}; input mapping Yi Xi Compute edge scores:

Synthetic experiments P(Y|X) P(X) Y1 Y2 Y3 Yn ... ... X1 X2 X3 Xn X1 X2 X3 Xn Experiments: Binary Y,X; tabular edge factors Use natural input mapping: Yi Xi

Synthetic experiments P(Y|X) P(X) Y4 X4 Y3 X3 Y2 X2 Y1 X1 Y5 X5 tractable P(Y,X) intractable P(Y,X) P(Y,X): tractable & intractable X4 X5 Φ(Yij,Xij): X3 X2 X1 P(Y|X), P(X): chains & trees

Synthetic experiments P(Y|X) Y1 Y2 Y3 Yn ... cross factors X1 X2 X3 Xn P(Y,X): tractable & intractable Φ(Yij,Xij): With & without cross-factors Associative (all positive & alternating +/-) & random factors P(Y|X): chains & trees

Synthetic: vary # train exs.

Synthetic: vary # train exs. Tree Intractable P(Y,X) Associative Φ (alternating +/-) |Y|=40 1000 test examples

Synthetic: vary # train exs.

Synthetic: vary model size Fixed 50 train exs., 1000 test exs.

fMRI experiments Decode (hand-built map) Object (60 total) • Bear • Screwdriver • ... X (500 fMRI voxels) Y (218 semantic features) predict • Metal? • Manmade? • Found in house? • ... Data, setup from Palatucci et al. (2009) Zero-shot learning: Can predict objects not in training data (given decoding). Image from http://en.wikipedia.org/wiki/File:FMRI.jpg

fMRI experiments Y,X real-valued  Gaussian factors: Regularized A & C,b separately  CV for parameter learning very expensive  Do CV on subject 0 only 2 methods: CRF1: K=10 & CRF2: K=20 & Added fixed X (500 fMRI voxels) Y (218 semantic features) predict Input mapping: Regressed Yi ~ Y-i,X Chose top K inputs

fMRI experiments Accuracy: (for zero-shot learning) Hold out objects i,j. Predict Y(i)’, Y(j)’ If ||Y(i) - Y(i)’||2 < ||Y(j) - Y(i)’||2 then we got i right.

fMRI experiments Accuracy: CRFs a bit worse better

fMRI experiments Accuracy: CRFs a bit worse Log likelihood: CRFs better better

fMRI experiments Accuracy: CRFs a bit worse Log likelihood: CRFs better Squared error: CRFs better better

Conclusion • Scalable learning of CRF structure • Analyzed edge scores for spanning tree methods • Local Linear Entropy Scores imperfect • Heuristics • Pleasing theoretical properties • Empirical success—we recommend DCI Future work • Templated CRFs • Learning edge score • Assumptions on model/factors which give learnability Thank you!

Thank you! References M. Craven, D. DiPasquo, D. Freitag, A. McCallum, T. Mitchell, K. Nigam, S. Slattery. Learning to Extract Symbolic Knowledge from the World Wide Web. AAAI 1998. Lafferty, J.D., McCallum, A., Pereira, F.C.N. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. ICML 2001. M. Palatucci, D. Pomerleau, G. Hinton, T. Mitchell. Zero-Shot Learning with Semantic Output Codes. NIPS 2009. M. Schmidt, K. Murphy, G. Fung, R. Rosales. Structure learning in random fields for heart motion abnormality detection. CVPR 2008. D. Shahaf, A. Chechetka, C. Guestrin. Learning Thin Junction Trees via Graph Cuts. AI-STATS 2009. C. Sutton, A. McCallum. Piecewise training of undirected models. UAI 2005. C. Sutton, A. McCallum. Piecewise pseudolikelihood for efficient training of conditional random fields. ICML, 2007. A. Torralba, K. Murphy, W. Freeman. Contextual models for object detection using boosted random fields. NIPS 2004.

(extra slides)

B: Score Decay Assumption

B: Example complexity

Future work: Templated CRFs WebKB (Craven et al., 1998) Given webpages {(Yi=page type, Xi=content)} Use template to: Choose tree over pages Instantiate parameters  P(Y|X=x) = P(pages’ types | pages’ content) Requires local inputs Potentially very fast Learn template, e.g. Score(i,j) = DCI(i,j) Parametrization

Future work: Learn score Given training queries: Data Ground-truth model (E.g., from expensive structure learning method) Learn function Score(Yi,Yj) for MST algorithm.

Learning Tree Conditional Random Fields