910 likes | 1.05k Views
This thesis focuses on the development of robust methodologies for learning and inference in probabilistic graphical models, particularly in the context of noisy, high-dimensional data with local interactions. It addresses challenges in domains such as sensor networks, hypertext classification, and image segmentation, presenting innovative algorithms to improve accuracy and efficiency in reasoning about complex datasets. The work emphasizes the importance of computational aspects, offering strategies to construct tractable models while compensating for reduced expressive power. Key contributions include advancements in generative and discriminative settings, and enhanced inference techniques for real-world applications.
E N D
Query-Specific Learning and Inferencefor Probabilistic Graphical Models Anton Chechetka Thesis committee: Carlos Guestrin Eric Xing J. Andrew Bagnell Pedro Domingos (University of Washington) 14 June 2011
Motivation Fundamental problem: to reason accurately aboutnoisyhigh-dimensionaldata with local interactions
Sensor networks • noisy: sensors failnoise in readings • high-dimensional:many sensors,(temperature, humidity, …) per sensor • local interactions: nearby locations have high correlations
Hypertext classification • noisy: automated text understanding is far from perfect • high-dimensional:a variable for every webpage • local interactions: directly linked pages have correlated topics
Image segmentation • noisy: local information is not enough camera sensor noise compression artifacts • high-dimensional:a variable for every patch • local interactions: cows are next to grass, airplanes next to sky
Probabilistic graphical models Noisyhigh-dimensionaldata with local interactions Probabilistic inference a graph to encodeonlydirect interactionsover many variables query evidence
Graphical models semantics Factorized distributions Graph structure X3 X4 X5 X2 X1 X7 X6 separator X are small subsets of X compact representation
Graphical models workflow Factorized distributions Graph structure X3 X4 X5 X2 X1 X7 X6 Learn/constructstructure Learn/defineparameters Inference P(Q|E=E)
Graph. models fundamental problems Compoundingerrors Learn/constructstructure NP-complete Learn/defineparameters exp(|X|) Inference #P-complete (exact)NP-complete (approx) P(Q|E=E)
Domain knowledge structures don’t help Domain knowledge-based structuresdo not support tractable inference (webpages)
This thesis: general directions • Emphasizing the computational aspects of the graph • Learn accurate and tractable models • Compensate for reduced expressive power withexact inference and optimal parameters • Gain significant speedups • Inference speedups via better prioritization of computation • Estimate the long-term effects of propagating information through the graph • Use long-term estimates to prioritize updates New algorithms for learning and inferencein graphical modelsto make answering the queries better
Thesis contributions • Learn accurate and tractable models • In the generative setting P(Q,E) [NIPS 2007] • In the discriminative setting P(Q|E) [NIPS 2010] • Speed up belief propagation for cases with many nuisance variables [AISTATS 2010]
Generative learning • Useful when E is not known in advance • Sensors fail unpredictably • Measurements are expensive (e.g. user time), want adaptive evidence selection learning goal query goal
Tractable vs intractable models workflow Tractable models Intractable models learnsimpletractablestructure from domainknowledge +data constructintractablestructure from domain knowledge learnintractablestructure from data optimal parameters,exact inference approximate algs: no qualityguarantees approx. P(Q|E=E) approx. P(Q|E=E)
Tractability via low treewidth • Exact inference exponential in treewidth (sum-product) • Treewidth NP-complete to compute in general • Low-treewidth graphs are easy to construct • Convenient representation: junction tree • Other tractable model classes exist too 3 4 Treewidth:size of largest clique in atriangulated graph 1 5 6 7 2
Junction trees • Cliques connected by edges with separators • Running intersection property • Most likely junction tree of given treewidth >1 is NP-complete • We will look for good approximations X4,X5 X1,X4,X5 X4,X5,X6 X1,X4,X5 X1,X4,X5 X4,X5,X6 C1 C4 X1,X5 X1,X5 X1,X2,X5 X1,X3,X5 X1,X2,X5 X1,X2,X5 X1,X3,X5 C2 C5 X1,X2 X4,X5,X6 X1,X2,X7 C3 3 4 X1,X3,X5 1 5 6 7 2
Independencies in low-treewidth distributions P(X)factorizes according to a JT conditional independencies hold conditional mutual information works in the other way too! X4,X5,X6 X1,X3,X5 X = X2 X3 X7 X = X4 X6 X1,X5 X1,X4,X5 X1,X2,X5 X1,X2,X7
Constraint-based structure learning I(X , XX | S3) < Look for JTs where this holds(constraint-based structure learning) S8 C1 C4 S1 X X X S3 C2 X1 X4 C5 S7 … C3 all variables X partition remainingvariables into weaklydependent subsets find consistentjunction tree all candidateseparators
Mutual information complexity I(X , X- | S) = H(X | S) - H(X | X- S3) everything except for X conditional entropy I(X , X- | S) depends on all assignments to X:exp(|X|) complexity in general Our contribution: polynomial-time upper bound
Mutual info upper bound: intuition • Only look at small subsets D, F • Poly number of small subsets • Poly complexity for every pair hard I(A,B | C)=?? easy A B I(D,F|C) D F |DF| k Any conclusions about I(A,B|C)? In general, no If a good junction tree exists, yes
Contribution: mutual info upper bound • Suppose an -JT of treewidth kforP(ABC)exists: • Let for |DF| k+1 • Then Theorem: = max I(D, F | C) I(A, B | C) |ABC| ( + ) A B I(D,F|C) |DF| treewidth+1 D F
Mutual info upper bound: complexity • Direct computation: complexity exp(|ABC|) • Our upper bound: • O(|AB|treewidth + 1)small subsets • exp(|C|+ treewidth) time each • |C| = treewidthfor structure learning A B I(D,F|C) D F |DF| treewidth+1 polynomial(|ABC|) complexity
Guarantees on learned model quality Theorem:Suppose a strongly connected-JT of treewidthk for P(X) exists. • Then our alg. will with probability at least (1-)find a JT s.t. quality guarantee using samples and time. poly samples poly time Corollary: strongly connected junction trees are PAC-learnable
Results – typical convergence time Test log-likelihood better good results early on in practice
Results – log-likelihood OBS local search in limited in-degree Bayes nets Chow-Liu most likely JTs of treewidth 1 Karger-Srebro constant-factor approximation JTs better ourmethod
Conclusions • A tractable upper bound on conditional mutual info • Graceful quality degradation and PAC learnabilityguarantees • Analysis on when dynamic programming works[in the thesis] • Dealing with unknown mutual information threshold[in the thesis] • Speedups preserving the guarantees • Further speedups without guarantees
Thesis contributions • Learn accurate and tractable models • In the generative setting P(Q,E) [NIPS 2007] • In the discriminative setting P(Q|E) [NIPS 2010] • Speed up belief propagation for cases with many nuisance variables [AISTATS 2010]
Discriminative learning • Useful when variables E are always the same • Non-adaptive, one-shot observation • Image pixels scene description • Document text topic, named entities • Better accuracy than generative models query goal learning goal
Discriminative log-linear models weight(learn from data) • Don’t sum over all values of E • Don’t model P(E) feature(domain knowledge) Evidence evidence-dependentnormalization f34 f12 No need for structure overE Query
Model tractability still important Observation #1: tractable models are necessary for exact inference and parameter learning in the discriminative setting Tractability is determined by thestructure over query
Simple local models: motivation query Locally almost linear Q Q=f(E) E evidence Exploiting evidence values overcomes the expressive power deficit of simple models We will learn local tractable models
Context-specific independence noedge Observation #2: use evidence values at test time to tune the structure of the models,do not commit to a single tractable model
Low-dimensional dependencies in generative structure learning Generative structure learning often relies only on low-dimensional marginals separators cliques Junction trees: decomposable scores Low-dimensional independence tests: Small changes to structure quick score recomputation Discriminative structure learning: need inference in full modelfor every datapointeven for small changes in structure
Leverage generative learning Observation #3: generative structure learningalgorithms have very useful properties, can we leverage them?
Observations so far • Discriminative setting has extra information, including evidence values at test time • Want to use to learn local tractable models • Good structure learning algorithms exist for generative settingthat only require low-dimensional marginalsP(Q) • Approach: 1. use local conditionalsP(Q | E=E)as “fake marginals” to learn local tractable structures 2. learn exact discriminative feature weights
Evidence-specific CRF overview • Approach: 1. use local conditionalsP(Q | E=E)as “fake marginals” to learn local tractable structures 2. learn exact discriminative feature weights Local conditional density estimators P(Q | E) Evidencevalue E=E Generative structurelearning algorithm Featureweights w P(Q | E=E) Tractable structurefor E=E Tractable evidence-specific CRF
Evidence-specific CRF formalism Observation: identically zero feature 0does not affect the model extra “structural” parameters evidence-specific structure:I(E,u){0, 1} ( ) ( ) × = ( ) ( ) Evidence-specific model Fixed dense model Evidence-specificfeature values Evidence-specifictree “mask” × E=E1 × E=E2 × E=E3
Evidence-specific CRF learning Learning is in the same order as testing Local conditional density estimators P(Q | E) Evidencevalue E=E Generative structurelearning algorithm Featureweights w P(Q | E=E) Tractable structurefor E=E Tractable evidence-specific CRF
Plug in generative structure learning encodes the output of the chosen structure learning algorithm Directly generalize generative algorithms : Generative Discriminative P(Qi,Qj) (pairwisemarginals)+Chow-Liu algorithm=optimal tree P(Qi,Qj|E=E) (pairwiseconditionals)+Chow-Liu algorithm=good tree for E=E
Evidence-specific CRF learning: structure Choose generative structure learning algorithm A Chow-Liu Identify low-dimensional subsets Qβ that A may need All pairs (Qi, Qj) E Q E E Q1,Q3 E Q3,Q4 Q1,Q2 , … original problem low-dimensional pairwise problems
Estimating low-dimensional conditionals Use the same features as the baseline high-treewidth model Baseline CRF Scope restriction Low-dimensionalmodel End result: optimal u
Evidence-specific CRF learning: weights • Already chosen the algorithm behind I(E,u) • Already learned parameters u “effective features” Only need to learn feature weights w log P(Q|E,w,u) is concave in w unique global optimum
Evidence-specific CRF learning: weights Tree-structured distribution ( ) ( ) ) ( ) ( Exacttree-structuredgradients wrtw Overall gradient(dense) Fixed dense model Evidence-specifictree “mask” E=E1 Q=Q1 E=E2 Q=Q2 Σ E=E3 Q=Q3
Results – WebKB Text + links webpage topic Ignore links Standard dense CRF Our work Max-margin model better Prediction error Time
Image segmentation - accuract local segment features + neighbor segments type of object Ignore links Standard dense CRF Our work better Accuracy
Image segmentation - time Ignore links Standard dense CRF Our work better Train time (log scale) Test time (log scale)
Conclusions • Using evidence valuesto tune low-treewidth model structure • Compensates for the reduced expressive power • Order of magnitude speedup at test time (sometimes train time too) • General framework for plugging in existing generative structure learners • Straightforward relational extension [in the thesis]
Thesis contributions • Learn accurate and tractable models • In the generative setting P(Q,E) [NIPS 2007] • In the discriminative setting P(Q|E) [NIPS 2010] • Speed up belief propagation for cases with many nuisance variables [AISTATS 2010]
Why high-treewidth models? • A dense model expressing laws of nature • Protein folding • Max-margin parameters don’t work well (yet?) with evidence-specific structures