Loading in 2 Seconds...

Integrating Topics and Syntax - Thomas L. Griffiths, Mark Steyvers, David M. Blei, Joshua B. Tenenbaum

Loading in 2 Seconds...

342 Views

Download Presentation
##### Integrating Topics and Syntax - Thomas L. Griffiths, Mark Steyvers, David M. Blei, Joshua B. Tenenbaum

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -

**Integrating Topics and Syntax-Thomas L. Griffiths, Mark**Steyvers, David M. Blei, Joshua B. Tenenbaum**Outline**• Motivations – Syntactic vs. semantic modeling • Formalization– Notations and terminology • Generative Models – pLSI; Latent Dirichlet Allocation • Composite Models –HMMs + LDA • Inference – MCMC (Metropolis; Gibbs Sampling ) • Experiments – Performance and evaluations • Summary – Bayesian hierarchical models Discussions ! Han Liu**Motivations**• Statistical language modeling - Syntactic dependencies short range dependencies -Semantic dependencies long-range • Current models only consider one aspect - Hidden Markov Models (HMMs) : syntactic modeling - Latent Dirichlet Allocation (LDA) : semantic modeling - Probabilistic Latent Semantic Indexing (LSI) : semantic modeling A model which could capture both kinds of dependencies may be more useful! Han Liu**Problem Formalization**• Word -A word is an item from a vocabulary indexed by {1,…,V}. Which is represented as unit-basis vectors. The vth word is represented by a V-vector w such that only the vth element is 1, while the others are 0 • Document -A document is a sequence of N words denoted by w = {w1, w2, … , wN}, where wi is the ith word in the sequence. • Corpus - A corpus is a collection of M documents, denoted by D = {w1, w2, … , wM} Han Liu**Latent Semantic Structure**Distribution over words Latent Structure Inferring latent structure Words Prediction Han Liu**Probabilistic Generative Models**• Probabilistic Latent Semantic Indexing (pLSI) -Hoffman (1999) ACM SIGIR -Probabilistic semantic model • Latent Dirichlet Allocation (LDA) -Blei, Ng, & Jordan (2003) J. of Machine Learning Res. -Probabilistic semantic model • Hidden Markov Models (HMMs) -Baum, & Petrie (1966) Ann. Math. Stat. -Probabilistic syntactic model Han Liu**Dirichelt vs. Multinomial Distributions**• Dirichlet Distribution (conjugate prior) • Multinomial Distribution Han Liu**Probabilistic LSI : Graphical Model**model the distribution over topics d Topic as latent variables z generate a word from that topic w Nd d D Han Liu**Probabilistic LSI- Parameter Estimation**• The log-likelihood of Probabilistic LSI • EM - algorithm -E - Step -M- Step Han Liu**LDA : Graphical Model**sample a distribution over topics a q sample a topic z b f sample a word from that topic w Nd d D T Han Liu**Latent Dirichlet Allocation**• A variant LDA developed by Griffith 2003 -choose N|x ~Poisson ( x ) -sample q |a ~ Dir (a ) - sample f |b ~ Dir( b ) -sample z |q ~ Multinomial (q) -sample w| z, f(z)~ Multinomial (f(z)) • Model Inference - all the Dirichlet prior is assumed to be symmetric - Instead of using variational inference and empirical Bayes parameter estimation, Gibbs Sampling is adopted Han Liu**q**Semantic state: generate words from LDA z1 z2 z3 z4 w1 w2 w3 w4 s1 s2 s3 s4 The Composite Model • An intuitive representation Syntactic states: generate words from HMMs Han Liu**Composite Model : Graphical Model**a q c p g z b F(z) w F(c) d Nd d C T M Han Liu**Composite Model**• All the Dirichelt are assumed to be symmetric -choose N|x ~Poisson ( x ) -sample q(d) |a ~ Dir (a ) - sample f(zi)|b ~ Dir (b ) - sample f(ci)| g ~ Dir (g ) - sample p(ci-1)| d ~ Dir (d ) -sample zi|q(d)~ Multinomial (q(d)) -sample ci|p(ci-1)~ Multinomial (p(ci-1)) -sample wi| zi, f(zi) ~ Multinomial (f(zi))if ci = 1 -sample wi| ci, f(ci) ~ Multinomial (f(ci))if not Han Liu**Bayesian Inference**• EM algorithm can be applied to the composite model -treating q, f(z) , f(c) , p(c)as parameters - log P(w| q, f(z) , f(c) , p(c)) as the likelihood - too many parameters and too slow convergence - the dirichelet priors are necessary assumptions ! • Markov Chain Monte Carlo (MCMC) - Instead of explicitly representing q, f(z) , f(c) , p(c) , we consider the posterior distribution over the assignment of words to topics or classes P( z|w)andP(c|w) Han Liu**Markov Chain Monte Carlo**• Sampling posterior distribution according to a Markov Chain -an ergodic (irreducible & aperiodic ) Markov chain converges to a unique equilibrium distribution p (x) - Try to sample the parameters according to a Makrov chain, whose equilibrium distribution p (x) is just he posterior distribution p (x) • The key task is to construct the suitable T(x,x’) Han Liu**Metropolis-Hastings Algorithm**• Sampling by constructing a reversible Markov chain - a reversible Markov chain could guarantee the condition of the equilibrium distribution p (x) - Simultaneous Metropolis Hastings Algorithm holds a similar idea as rejection sampling Han Liu**r=1.0**r=p(x*)/p(xt) x* xt x* Metropolis-Hastings Algorithm (cont.) • Algorithm loop sample x’ from Q( x, x’); a =min{1, (p (x’)/ p (x))*(Q( x(t), x’) / Q (x’, x(t)))}; r = U(0,1); if a < r reject, x(t+1) = x(t); else accept, x(t+1) =x’; end; - Metropolis Hastings Intuition Han Liu**Metropolis-Hastings Algorithm**• Why it works Single-site Updating algorithm Han Liu**Gibbs Sampling**• A special case of single-site Updating Metropolis Han Liu**Gibbs Sampling for Composite Model**q, f, p are all integrated out from the corresponding terms, hyperparameters are sampled with single-site Metropolis-Hastings algorithm Han Liu**Experiments**• Corpora -Brown corpus 500 documents, 1,137,466 words -TASA corpus, 37,651 documents, 12,190,931 word tokens -NIPS corpus, 1713 documents, 4,312,614 word tokens -W = 37,202 (Brown + TASA); W = 17,268 (NIPS) • Experimental Design - one class for sentence start/end markers {., ?,!} -T=200 & C=20 (composite); C=2 (LDA); T=1 (HMMs) - 4,000 iterations, with 2000 burn in and 100 lag - 1st,2nd, 3rd Markov Chains are considered Han Liu**Marginal probabilities**• Bayesian model comparison - P(w|M ) are calculated using the harmonic mean of the likelihoods over the 2000 iterations - To evaluate the Bayes factors Han Liu**Part of Speech Tagging**• Assessed performance on the Brown corpus - One set consisted all Brown tags (297) - The other set collapsed Browns tags into 10 designations - The20th sample used, evaluated by Adjusted Rand Index - Compare with DC on the 1000 most frequent words on 19 clusters Han Liu**Document Classification**• Evaluated by Naïve Bayes Classifier - 500 documents in Brown are classified into 15 groups - The topic vectors produced by LDA and composite model are used for training Naïve Bayes classifier - 10-flod cross validation is used to evaluate the 20th sample • Result (baseline accuracy: 0.09) - Trained on Brown : LDA (0.51); 1st Composite model (0.45) - Brown + TASA : LDA (0.54); 1st Composite model (0.45) - Explanation: only about 20% words are allocated to the semantic component, too few to find correlations! Han Liu**Summary**• Bayesian hierarchical models are natural for text modeling • Simultaneously learn syntactic classes and semantic topics is possible through the combination of basic modules • Discovering the syntactic and semantic building blocks form the basis of more sophisticated representation • Similar ideas could be generalized to the other areas Han Liu**Discussions**• Gibbs Sampling vs. EM algorithm ? • Hieratical models reduce the number of Parameters, what about model complexity? • Equal prior for Bayesian model comparison? • Whether there is really any effect of the 4 hyper-parameters? • Probabilistic LSI does not have normal distribution assumption, while Probabilistic PCA assumes normal! • EM is sensitive to local maxima, why Bayesian goes through? • Is document classification experiment a good evaluation? • Majority vote for tagging? Han Liu