Latent Dirichlet Allocation (LDA)

Latent Dirichlet Allocation(LDA) Week 8

Motivation • What do we want to do with text corpora? classification, novelty detection, • summarization and similarity/relevance judgments. • Given a text corpora or other collection of discrete data we wish to: • Find a short description of the data. • Preserve the essential statistical relationships

Term Frequency – Inverse Document Frequency • tf-idf (Salton and McGill, 1983) • The term frequency count is compared to an inverse document frequency count. • Results in a txd matrix – thus reducing the corpus to a fixed-length list • Basic identification of sets of words that are discriminative for documents in the collection • Used for search engines

LSI (Deerwester et al., 1990) • Latent Semantic Indexing • Classic attempt at solving this problem in information retrieval • Uses SVD to reduce document representations • Models synonymy and polysemy • Computing SVD is slow • Non-probabilistic model

pLSIHoffman (1999) • A generative model • Models each word in a document as a sample from a mixture model. • Each word is generated from a single topic, different words in the document may be generated from different topics. • Each document is represented as a list of mixing proportions for the mixture components.

Exchangeability • A finite set of random variables is said to be exchangeable if the joint distribution is invariant to permutation. If π is a permutation of the integers from 1 to N: • An infinite sequence of random is infinitely exchangeable if every finite subsequence is exchangeable

bag-of-words Assumption • Word order is ignored • “bag-of-words” – exchangeability, not i.i.d • Theorem (De Finetti, 1935) – if are infinitely exchangeable, then the joint probability has a representation as a mixture: For some random variable θ

Notation and terminology • A word is an item from a vocabulary indexed by {1,…,V}. We represent words using unit-basis vectors. The vth word is represented by a V-vector w such that and for • A document is a sequence of N words denoted by , where is the nth word in the sequence. • A corpus is a collection of M documents denoted by

Supervised text categorization through Naïve Bayes • Generative model: first generate a document category, then words in the document (unigram model) • Inference: obtain posterior over document categories using Bayes rule (argmax to choose the category) Cat w1 w2 w3 w4 w5 w6 …

What we’re doing today • Supervised categorization requires hand-labeling documents • This can be extremely time-consuming • Unlabeled documents are cheap • So we’d really like to dounsupervised text categorization • Today we’ll look at unsupervised learning within the Naïve Bayes model

Compact graphical model representations • We’re going to lean heavily on graphical model representations here. • We’ll use a more compact notation: Cat … w4 w2 w1 w3 wn Cat “generate a word from Cat n times” a “plate” w1 n

Cat • Now suppose that Catisn’t observed • We need to learn two distributions: • P(Cat) • P(w|Cat) • How do we do this? • We might use the method of maximum likelihood (MLE) • But it turns out that the likelihood surface is highly non-convex and lots of information isn’t contained in a point estimate • Alternative: Bayesian methods w1 n

Bayesian document categorization priors   P(Cat)  Cat  P(w|Cat) w1 nD D

Latent Dirichlet allocation • LDA is a generative probabilistic model of a corpus. The basic idea is that the documents are represented as random mixtures over latent topics, where a topic is characterized by a distribution over words.

Dirichlet distribution • A k-dimensional Dirichlet random variable θ can take values in the (k-1)-simplex, and has the following probability density on this simplex:

Dirichlet priors • Multivariate equivalent of Beta distribution • Hyperparameters determine form of the prior

distribution over topics for each document Dirichlet priors distribution over words for each topic topic assignment for each word word generated from assigned topic Latent Dirichlet allocation(Blei, Ng, & Jordan, 2001; 2003) Main difference: one topic per word   (d)  Dirichlet()  (d)  zi  Discrete( (d) ) zi (j)  Dirichlet()  (j) T wi  Discrete((zi) ) wi Nd D

The LDA equations

LDA and exchangeability • We assume that words are generated by topics and that those topics are infinitely exchangeable within a document. • By de Finetti’s theorem: • By marginalizing out the topic variables, we get eq. 3 in the previous slide.

Unigram model

Mixture of unigrams

Probabilistic LSI

A geometric interpretation word simplex

A geometric interpretation topic 1 topic simplex word simplex topic 2 topic 3

Inference • We want to compute the posterior dist. Of the hidden variables given a document: • Unfortunately, this is intractable to compute in general. We write Eq. (3) as:

Variational Inference • To provide an analytical approximation to the posterior probability of the unobserved variables, in order to do statistical inference over these variables. • To derive a lower bound for the marginal likelihood (sometimes called the "evidence") of the observed data (i.e. the marginal probability of the data given the model, with marginalization performed over unobserved variables). • This is typically used for performing model selection, the general idea being that a higher marginal likelihood for a given model indicates a better fit of the data by that model and hence a greater probability that the model in question was the one that generated the data.

Variational inference

Parameter Estimation • To calculate the parameters in the model using training data •  : topic-word distribution •  : document-topic distribution

The collapsed Gibbs sampler • Using conjugacy of Dirichlet and multinomial distributions, integrate out continuous parameters • Defines a distribution on discrete ensembles z

The collapsed Gibbs sampler • Sample each zi conditioned on z-i • This is nicer than your average Gibbs sampler: • memory: counts can be cached in two sparse matrices • optimization: no special functions, simple arithmetic • the distributions on  and  are analytic given z and w, and can later be found for each sample

Gibbs sampling in LDA iteration 1

Gibbs sampling in LDA iteration 1 2

Gibbs sampling in LDA iteration 1 2 … 1000

Parameter Estimation

Effects of hyperparameters •  and  control the relative sparsity of  and  • smaller , fewer topics per document • smaller , fewer words per topic • Good assignments z compromise in sparsity log (x) x

decreasing  increases sparsity Varying 

decreasing  increases sparsity ? Varying 

Document modeling • Unlabeled data – our goal is density estimation. • Compute the perplexity of a held-out test to evaluate the models – lower perplexity score indicates better generalization. .

Document Modeling – cont.data used • C. Elegans Community abstracts • 5,225 abstracts • 28,414 unique terms • TREC AP corpus (subset) • 16,333 newswire articles • 23,075 unique terms • Held-out data – 10% • Removed terms – 50 stop words, words appearing once (AP)

nematode

Latent Dirichlet Allocation (LDA)