Latent Dirichlet Allocation( LDA)

148 Views

Download Presentation
## Latent Dirichlet Allocation( LDA)

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -

**Motivation**• What do we want to do with text corpora? classification, novelty detection, • summarization and similarity/relevance judgments. • Given a text corpora or other collection of discrete data we wish to: • Find a short description of the data. • Preserve the essential statistical relationships**Term Frequency – Inverse Document Frequency**tf-idf (Salton and McGill, 1983) The term frequency count is compared to an inverse document frequency count. Results in a txd matrix – thus reducing the corpus to a fixed-length list Basic identification of sets of words that are discriminative for documents in the collection Used for search engines**LSI (Deerwester et al., 1990)**• Latent Semantic Indexing • Classic attempt at solving this problem in information retrieval • Uses SVD to reduce document representations • Models synonymy and polysemy • Computing SVD is slow • Non-probabilistic model**pLSIHoffman (1999)**A generative model Models each word in a document as a sample from a mixture model. Each word is generated from a single topic, different words in the document may be generated from different topics. Each document is represented as a list of mixing proportions for the mixture components.**Exchangeability**• A finite set of random variables is said to be exchangeable if the joint distribution is invariant to permutation. If π is a permutation of the integers from 1 to N: • An infinite sequence of random is infinitely exchangeable if every finite subsequence is exchangeable**bag-of-words Assumption**• Word order is ignored • “bag-of-words” – exchangeability, not i.i.d • Theorem (De Finetti, 1935) – if are infinitely exchangeable, then the joint probability has a representation as a mixture: For some random variable θ**Notation and terminology**• A word is an item from a vocabulary indexed by {1,…,V}. We represent words using unit-basis vectors. The vth word is represented by a V-vector w such that and for • A document is a sequence of N words denoted by , where is the nth word in the sequence. • A corpus is a collection of M documents denoted by**Cat**w1 w2 w3 w4 w5 w6 … Supervised text categorization through Naïve Bayes Generative model: first generate a document category, then words in the document (unigram model) Inference: obtain posterior over document categories using Bayes rule (argmax to choose the category)**What we’re doing today**Supervised categorization requires hand-labeling documents This can be extremely time-consuming Unlabeled documents are cheap So we’d really like to dounsupervised text categorization Today we’ll look at unsupervised learning within the Naïve Bayes model**Compact graphical model representations**Cat … w4 w2 w1 w3 wn Cat “generate a word from Cat n times” a “plate” w1 n We’re going to lean heavily on graphical model representations here. We’ll use a more compact notation:**Cat**w1 n • Now suppose that Catisn’t observed • We need to learn two distributions: • P(Cat) • P(w|Cat) • How do we do this? • We might use the method of maximum likelihood (MLE) • But it turns out that the likelihood surface is highly non-convex and lots of information isn’t contained in a point estimate • Alternative: Bayesian methods**Bayesian document categorization**priors P(Cat) Cat P(w|Cat) w1 nD D**Latent Dirichlet allocation**LDA is a generative probabilistic model of a corpus. The basic idea is that the documents are represented as random mixtures over latent topics, where a topic is characterized by a distribution over words.**Dirichlet distribution**• A k-dimensional Dirichlet random variable θ can take values in the (k-1)-simplex, and has the following probability density on this simplex:**Dirichlet priors**• Multivariate equivalent of Beta distribution • Hyperparameters determine form of the prior**distribution over topics**for each document Dirichlet priors distribution over words for each topic topic assignment for each word word generated from assigned topic Latent Dirichlet allocation(Blei, Ng, & Jordan, 2001; 2003) Main difference: one topic per word (d) Dirichlet() (d) zi Discrete( (d) ) zi (j) Dirichlet() (j) T wi Discrete((zi) ) wi Nd D**LDA and exchangeability**• We assume that words are generated by topics and that those topics are infinitely exchangeable within a document. • By de Finetti’s theorem: • By marginalizing out the topic variables, we get eq. 3 in the previous slide.**A geometric interpretation**word simplex**A geometric interpretation**topic 1 topic simplex word simplex topic 2 topic 3**A geometric interpretation**topic 1 topic simplex word simplex topic 2 topic 3**A geometric interpretation**topic 1 topic simplex word simplex topic 2 topic 3**Inference**• We want to compute the posterior dist. Of the hidden variables given a document: • Unfortunately, this is intractable to compute in general. We write Eq. (3) as:**Variational Inference**• To provide an analytical approximation to the posterior probability of the unobserved variables, in order to do statistical inference over these variables. • To derive a lower bound for the marginal likelihood (sometimes called the "evidence") of the observed data (i.e. the marginal probability of the data given the model, with marginalization performed over unobserved variables). • This is typically used for performing model selection, the general idea being that a higher marginal likelihood for a given model indicates a better fit of the data by that model and hence a greater probability that the model in question was the one that generated the data.**Parameter Estimation**• To calculate the parameters in the model using training data • : topic-word distribution • : document-topic distribution**The collapsed Gibbs sampler**• Using conjugacy of Dirichlet and multinomial distributions, integrate out continuous parameters • Defines a distribution on discrete ensembles z**The collapsed Gibbs sampler**• Sample each zi conditioned on z-i • This is nicer than your average Gibbs sampler: • memory: counts can be cached in two sparse matrices • optimization: no special functions, simple arithmetic • the distributions on and are analytic given z and w, and can later be found for each sample**Gibbs sampling in LDA**iteration 1**Gibbs sampling in LDA**iteration 1 2**Gibbs sampling in LDA**iteration 1 2**Gibbs sampling in LDA**iteration 1 2**Gibbs sampling in LDA**iteration 1 2**Gibbs sampling in LDA**iteration 1 2**Gibbs sampling in LDA**iteration 1 2**Gibbs sampling in LDA**iteration 1 2**Gibbs sampling in LDA**iteration 1 2 … 1000**Effects of hyperparameters**• and control the relative sparsity of and • smaller , fewer topics per document • smaller , fewer words per topic • Good assignments z compromise in sparsity log (x) x**decreasing increases sparsity**Varying **decreasing increases sparsity**? Varying **Document modeling**• Unlabeled data – our goal is density estimation. • Compute the perplexity of a held-out test to evaluate the models – lower perplexity score indicates better generalization. .**Document Modeling – cont.data used**• C. Elegans Community abstracts • 5,225 abstracts • 28,414 unique terms • TREC AP corpus (subset) • 16,333 newswire articles • 23,075 unique terms • Held-out data – 10% • Removed terms – 50 stop words, words appearing once (AP)