Integrating Topics and Syntax - Thomas L. Griffiths, Mark Steyvers, David M. Blei, Joshua B. Tenenbaum

Integrating Topics and Syntax-Thomas L. Griffiths, Mark Steyvers, David M. Blei, Joshua B. Tenenbaum

Outline • Motivations – Syntactic vs. semantic modeling • Formalization– Notations and terminology • Generative Models – pLSI; Latent Dirichlet Allocation • Composite Models –HMMs + LDA • Inference – MCMC (Metropolis; Gibbs Sampling ) • Experiments – Performance and evaluations • Summary – Bayesian hierarchical models Discussions ! Han Liu

Motivations • Statistical language modeling - Syntactic dependencies  short range dependencies -Semantic dependencies  long-range • Current models only consider one aspect - Hidden Markov Models (HMMs) : syntactic modeling - Latent Dirichlet Allocation (LDA) : semantic modeling - Probabilistic Latent Semantic Indexing (LSI) : semantic modeling A model which could capture both kinds of dependencies may be more useful! Han Liu

Problem Formalization • Word -A word is an item from a vocabulary indexed by {1,…,V}. Which is represented as unit-basis vectors. The vth word is represented by a V-vector w such that only the vth element is 1, while the others are 0 • Document -A document is a sequence of N words denoted by w = {w1, w2, … , wN}, where wi is the ith word in the sequence. • Corpus - A corpus is a collection of M documents, denoted by D = {w1, w2, … , wM} Han Liu

Latent Semantic Structure Distribution over words Latent Structure Inferring latent structure Words Prediction Han Liu

Probabilistic Generative Models • Probabilistic Latent Semantic Indexing (pLSI) -Hoffman (1999) ACM SIGIR -Probabilistic semantic model • Latent Dirichlet Allocation (LDA) -Blei, Ng, & Jordan (2003) J. of Machine Learning Res. -Probabilistic semantic model • Hidden Markov Models (HMMs) -Baum, & Petrie (1966) Ann. Math. Stat. -Probabilistic syntactic model Han Liu

Dirichelt vs. Multinomial Distributions • Dirichlet Distribution (conjugate prior) • Multinomial Distribution Han Liu

Probabilistic LSI : Graphical Model model the distribution over topics d Topic as latent variables z generate a word from that topic w Nd d D Han Liu

Probabilistic LSI- Parameter Estimation • The log-likelihood of Probabilistic LSI • EM - algorithm -E - Step -M- Step Han Liu

LDA : Graphical Model sample a distribution over topics a q sample a topic z b f sample a word from that topic w Nd d D T Han Liu

Latent Dirichlet Allocation • A variant LDA developed by Griffith 2003 -choose N|x ~Poisson ( x ) -sample q |a ~ Dir (a ) - sample f |b ~ Dir( b ) -sample z |q ~ Multinomial (q) -sample w| z, f(z)~ Multinomial (f(z)) • Model Inference - all the Dirichlet prior is assumed to be symmetric - Instead of using variational inference and empirical Bayes parameter estimation, Gibbs Sampling is adopted Han Liu

q Semantic state: generate words from LDA z1 z2 z3 z4 w1 w2 w3 w4 s1 s2 s3 s4 The Composite Model • An intuitive representation Syntactic states: generate words from HMMs Han Liu

Composite Model : Graphical Model a q c p g z b F(z) w F(c) d Nd d C T M Han Liu

The Composite Model: Generative process Han Liu

Bayesian Inference • EM algorithm can be applied to the composite model -treating q, f(z) , f(c) , p(c)as parameters - log P(w| q, f(z) , f(c) , p(c)) as the likelihood - too many parameters and too slow convergence - the dirichelet priors are necessary assumptions ! • Markov Chain Monte Carlo (MCMC) - Instead of explicitly representing q, f(z) , f(c) , p(c) , we consider the posterior distribution over the assignment of words to topics or classes P( z|w)andP(c|w) Han Liu

Markov Chain Monte Carlo • Sampling posterior distribution according to a Markov Chain -an ergodic (irreducible & aperiodic ) Markov chain converges to a unique equilibrium distribution p (x) - Try to sample the parameters according to a Makrov chain, whose equilibrium distribution p (x) is just he posterior distribution p (x) • The key task is to construct the suitable T(x,x’) Han Liu

Metropolis-Hastings Algorithm • Sampling by constructing a reversible Markov chain - a reversible Markov chain could guarantee the condition of the equilibrium distribution p (x) - Simultaneous Metropolis Hastings Algorithm holds a similar idea as rejection sampling Han Liu

r=1.0 r=p(x*)/p(xt) x* xt x* Metropolis-Hastings Algorithm (cont.) • Algorithm loop sample x’ from Q( x, x’); a =min{1, (p (x’)/ p (x))*(Q( x(t), x’) / Q (x’, x(t)))}; r = U(0,1); if a < r reject, x(t+1) = x(t); else accept, x(t+1) =x’; end; - Metropolis Hastings Intuition Han Liu

Metropolis-Hastings Algorithm • Why it works Single-site Updating algorithm Han Liu

Gibbs Sampling • A special case of single-site Updating Metropolis Han Liu

Gibbs Sampling for Composite Model q, f, p are all integrated out from the corresponding terms, hyperparameters are sampled with single-site Metropolis-Hastings algorithm Han Liu

Experiments • Corpora -Brown corpus 500 documents, 1,137,466 words -TASA corpus, 37,651 documents, 12,190,931 word tokens -NIPS corpus, 1713 documents, 4,312,614 word tokens -W = 37,202 (Brown + TASA); W = 17,268 (NIPS) • Experimental Design - one class for sentence start/end markers {., ?,!} -T=200 & C=20 (composite); C=2 (LDA); T=1 (HMMs) - 4,000 iterations, with 2000 burn in and 100 lag - 1st,2nd, 3rd Markov Chains are considered Han Liu

Identifying function and content words Han Liu

Comparative study on NIPS corpus (T=100 & C = 50) Han Liu

Identifying function and content words (NIPS) Han Liu

Marginal probabilities • Bayesian model comparison - P(w|M ) are calculated using the harmonic mean of the likelihoods over the 2000 iterations - To evaluate the Bayes factors Han Liu

Part of Speech Tagging • Assessed performance on the Brown corpus - One set consisted all Brown tags (297) - The other set collapsed Browns tags into 10 designations - The20th sample used, evaluated by Adjusted Rand Index - Compare with DC on the 1000 most frequent words on 19 clusters Han Liu

Document Classification • Evaluated by Naïve Bayes Classifier - 500 documents in Brown are classified into 15 groups - The topic vectors produced by LDA and composite model are used for training Naïve Bayes classifier - 10-flod cross validation is used to evaluate the 20th sample • Result (baseline accuracy: 0.09) - Trained on Brown : LDA (0.51); 1st Composite model (0.45) - Brown + TASA : LDA (0.54); 1st Composite model (0.45) - Explanation: only about 20% words are allocated to the semantic component, too few to find correlations! Han Liu

Summary • Bayesian hierarchical models are natural for text modeling • Simultaneously learn syntactic classes and semantic topics is possible through the combination of basic modules • Discovering the syntactic and semantic building blocks form the basis of more sophisticated representation • Similar ideas could be generalized to the other areas Han Liu

Discussions • Gibbs Sampling vs. EM algorithm ? • Hieratical models reduce the number of Parameters, what about model complexity? • Equal prior for Bayesian model comparison? • Whether there is really any effect of the 4 hyper-parameters? • Probabilistic LSI does not have normal distribution assumption, while Probabilistic PCA assumes normal! • EM is sensitive to local maxima, why Bayesian goes through? • Is document classification experiment a good evaluation? • Majority vote for tagging? Han Liu

Integrating Topics and Syntax - Thomas L. Griffiths, Mark Steyvers, David M. Blei, Joshua B. Tenenbaum

Integrating Topics and Syntax - Thomas L. Griffiths, Mark Steyvers, David M. Blei, Joshua B. Tenenbaum

Presentation Transcript

Probabilistic inference in human semantic memory Mark Steyvers , Tomas L. Griffiths, and Simon Dennis

Joshua David LeMaster

B U B B L E G U M ! !! ! !!!

History of Israel: Joshua-David

By: Joshua,Thomas,Ronan,and George

THOMAS L. WHEELEN J. DAVID HUNGER

David M. L. Sills 1 and Neil M. Taylor 2

David A. Jay, Philip M. Orton and Thomas A. Chisholm

David Griffiths, Phillip Beauvoir, Mark Baxendale, Paul Hazlewood, Amanda Oddie

THOMAS L. WHEELEN J. DAVID HUNGER

A P h o t o A l b u m by Joshua Ong

THOMAS L. WHEELEN J. DAVID HUNGER

B L M

A P h o t o A l b u m by Joshua Ong