490 likes | 604 Views
Learning Tree Conditional Random Fields. Joseph K. Bradley Carlos Guestrin. Reading people’s minds. predict. Correlated!. E.g., Person? & Live in water? Colorful? & Yellow?. (Application from Palatucci et al., 2009). X : fMRI voxels. Y : semantic features. Metal? Manmade?
E N D
Learning Tree Conditional Random Fields Joseph K. Bradley Carlos Guestrin
Reading people’s minds predict Correlated! • E.g., • Person? & Live in water? • Colorful? & Yellow? (Application from Palatucci et al., 2009) X: fMRI voxels Y: semantic features • Metal? • Manmade? • Found in house? • ... We want to model conditional correlations Predict independently? Yi ~ X, for all i Image from http://en.wikipedia.org/wiki/File:FMRI.jpg
Conditional Random Fields (CRFs) In fMRI, X ≈ 500 to 10,000 voxels Pro: Avoid modeling P(X) (Lafferty et al., 2001)
Conditional Random Fields (CRFs) encode conditional independence structure Y4 Y3 Y2 Y1 Pro: Avoid modeling P(X)
Conditional Random Fields (CRFs) encode conditional independence structure Y4 Y3 Y2 Y1 Pro: Avoid modeling P(X)
Conditional Random Fields (CRFs) Y4 Y3 Y2 Normalization depends on X=x Y1 Con: Compute Z(x) for each inference Pro: Avoid modeling P(X)
Conditional Random Fields (CRFs) Y4 Y3 Y2 Y1 Exact inference intractable in general. Approximate inference expensive. Use tree CRFs! Con: Compute Z(x) for each inference Pro: Avoid modeling P(X)
Conditional Random Fields (CRFs) Y4 Y3 Y2 Y1 Use tree CRFs! Pro: Fast, exact inference Con: Compute Z(x) for each inference Pro: Avoid modeling P(X)
CRF Structure Learning Y4 Y3 Structure learning Y2 Y1 Feature selection Tree CRFs Fast, exact inference Avoid modeling P(X)
CRF Structure Learning instead of Global inputs (not scalable) (scalable) Local inputs Tree CRFs Fast, exact inference Avoid modeling P(X)
This work Goals: • Structured conditional models P(Y|X) • Scalable methods • Tree structures • Local inputs Xij • Max spanning trees Outline • Gold standard • Max spanning trees • Generalized edge weights • Heuristic weights • Experiments: synthetic & fMRI
Related work • Vs. our work • Choice of edge weights • Local inputs
Chow-Liu For generative models:
Chow-Liu for CRFs? For CRFs with global inputs: Global CMI (Conditional Mutual Information): Pro: “Gold standard” Con: I(Yi;Yj | X) intractable for big X
Where now? Global CMI (Conditional Mutual Information): Pros: “Gold standard” Cons: I(Yi;Yj | X) intractable for big X Algorithmic framework • Given: data {(y(i),x(i))}. • Given: input mapping Yi Xi • Weight potential edge (Yi,Yj) with Score(i,j) • Choose max spanning tree Local inputs!
Generalized edge scores Local Linear Entropy Scores: Score(i,j) = linear combination of entropies over Yi,Yj,Xi,Xj E.g., Local Conditional Mutual Information Key step: Weight edge (Yi,Yj) with Score(i,j).
Generalized edge scores Local Linear Entropy Scores: Score(i,j) = linear combination of entropies over Yi,Yj,Xi,Xj Theorem Assume true P(Y|X) is tree CRF (w/ non-trivial parameters). No Local Linear Entropy Score can recover all such tree CRFs (even with exact entropies). Key step: Weight edge (Yi,Yj) with Score(i,j).
Heuristics Outline • Gold standard • Max spanning trees • Generalized edge weights • Heuristic weights • Experiments: synthetic & fMRI Piecewise likelihood Local CMI DCI
Piecewise likelihood (PWL) Sutton and McCallum (2005,2007): PWL for parameter learning Main idea: Bound Z(X) For tree CRFs, optimal parameters give: Fails on simple counterexample Does badly in practice Helps explain other edge scores Edge score w/ local inputs Xij Bounds log likelihood
Piecewise likelihood (PWL) True P(Y,X) Y1 Y2 Y3 Yn ... Strong potential Choose (2,j) Over (j,k) X1 X2 X3 Xn Y3 Y1 Y2 Yn
Local Conditional Mutual Info Decomposable score w/ local inputs Xij Theorem: Local CMI bounds log likelihood gain • Does pretty well in practice • Can fail with strong potentials
Local Conditional Mutual Info True P(Y,X) Y1 Y2 Y3 Yn ... Strong potential X1 X2 X3 Xn Y3 Y2 Y1
Decomposable Conditional Influence (DCI) PWL From Y2 Y1 Y3 • Exact measure of gain for some edges • Edge score w/ local inputs Xij • Succeeds on counterexample • Does best in practice
Experiments Algorithmic details Regress P(Yij|Xij) (10-fold CV to choose regularization) • Choose max spanning tree • Parameter learning: • Conjugate gradient on L2-regularized log likelihood • 10-fold CV to choose regularization Given: Data {(yi,xi)}; input mapping Yi Xi Compute edge scores:
Synthetic experiments P(Y|X) P(X) Y1 Y2 Y3 Yn ... ... X1 X2 X3 Xn X1 X2 X3 Xn Experiments: Binary Y,X; tabular edge factors Use natural input mapping: Yi Xi
Synthetic experiments P(Y|X) P(X) Y4 X4 Y3 X3 Y2 X2 Y1 X1 Y5 X5 tractable P(Y,X) intractable P(Y,X) P(Y,X): tractable & intractable X4 X5 Φ(Yij,Xij): X3 X2 X1 P(Y|X), P(X): chains & trees
Synthetic experiments P(Y|X) Y1 Y2 Y3 Yn ... cross factors X1 X2 X3 Xn P(Y,X): tractable & intractable Φ(Yij,Xij): With & without cross-factors Associative (all positive & alternating +/-) & random factors P(Y|X): chains & trees
Synthetic: vary # train exs. Tree Intractable P(Y,X) Associative Φ (alternating +/-) |Y|=40 1000 test examples
Synthetic: vary # train exs. Tree Intractable P(Y,X) Associative Φ (alternating +/-) |Y|=40 1000 test examples
Synthetic: vary # train exs. Tree Intractable P(Y,X) Associative Φ (alternating +/-) |Y|=40 1000 test examples
Synthetic: vary # train exs. Tree Intractable P(Y,X) Associative Φ (alternating +/-) |Y|=40 1000 test examples
Synthetic: vary # train exs. Tree Intractable P(Y,X) Associative Φ (alternating +/-) |Y|=40 1000 test examples
Synthetic: vary model size Fixed 50 train exs., 1000 test exs.
fMRI experiments Decode (hand-built map) Object (60 total) • Bear • Screwdriver • ... X (500 fMRI voxels) Y (218 semantic features) predict • Metal? • Manmade? • Found in house? • ... Data, setup from Palatucci et al. (2009) Zero-shot learning: Can predict objects not in training data (given decoding). Image from http://en.wikipedia.org/wiki/File:FMRI.jpg
fMRI experiments Y,X real-valued Gaussian factors: Regularized A & C,b separately CV for parameter learning very expensive Do CV on subject 0 only 2 methods: CRF1: K=10 & CRF2: K=20 & Added fixed X (500 fMRI voxels) Y (218 semantic features) predict Input mapping: Regressed Yi ~ Y-i,X Chose top K inputs
fMRI experiments Accuracy: (for zero-shot learning) Hold out objects i,j. Predict Y(i)’, Y(j)’ If ||Y(i) - Y(i)’||2 < ||Y(j) - Y(i)’||2 then we got i right.
fMRI experiments Accuracy: CRFs a bit worse better
fMRI experiments Accuracy: CRFs a bit worse Log likelihood: CRFs better better
fMRI experiments Accuracy: CRFs a bit worse Log likelihood: CRFs better Squared error: CRFs better better
fMRI experiments Accuracy: CRFs a bit worse Log likelihood: CRFs better Squared error: CRFs better better
Conclusion • Scalable learning of CRF structure • Analyzed edge scores for spanning tree methods • Local Linear Entropy Scores imperfect • Heuristics • Pleasing theoretical properties • Empirical success—we recommend DCI Future work • Templated CRFs • Learning edge score • Assumptions on model/factors which give learnability Thank you!
Thank you! References M. Craven, D. DiPasquo, D. Freitag, A. McCallum, T. Mitchell, K. Nigam, S. Slattery. Learning to Extract Symbolic Knowledge from the World Wide Web. AAAI 1998. Lafferty, J.D., McCallum, A., Pereira, F.C.N. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. ICML 2001. M. Palatucci, D. Pomerleau, G. Hinton, T. Mitchell. Zero-Shot Learning with Semantic Output Codes. NIPS 2009. M. Schmidt, K. Murphy, G. Fung, R. Rosales. Structure learning in random fields for heart motion abnormality detection. CVPR 2008. D. Shahaf, A. Chechetka, C. Guestrin. Learning Thin Junction Trees via Graph Cuts. AI-STATS 2009. C. Sutton, A. McCallum. Piecewise training of undirected models. UAI 2005. C. Sutton, A. McCallum. Piecewise pseudolikelihood for efficient training of conditional random fields. ICML, 2007. A. Torralba, K. Murphy, W. Freeman. Contextual models for object detection using boosted random fields. NIPS 2004.
Future work: Templated CRFs WebKB (Craven et al., 1998) Given webpages {(Yi=page type, Xi=content)} Use template to: Choose tree over pages Instantiate parameters P(Y|X=x) = P(pages’ types | pages’ content) Requires local inputs Potentially very fast Learn template, e.g. Score(i,j) = DCI(i,j) Parametrization
Future work: Learn score Given training queries: Data Ground-truth model (E.g., from expensive structure learning method) Learn function Score(Yi,Yj) for MST algorithm.