Correlated Topic Models: Modeling Text Collections and More

Correlated Topic ModelsBy Blei and Lafferty (NIPS 2005) Presented by Chunping Wang ECE, Duke University August 4th, 2006

Outlines • Introduction • Latent Dirichlet Allocation (LDA) • Correlated Topic Models (CTM) • Experimental Results • Conclusions

Introduction(1) Topic models: generative probabilistic models which use a small number of distributions over a vocabulary to describe text collections and other discrete data (such as image). Normally, some latent variables are introduced to capture abstract notions such as topics. Applications: document modeling, text classification, image processing, collaborative filtering, etc. Latent Dirichlet Allocation (LDA): allows each document to exhibit multiple topics, but ignores the correlation between topics. Correlated Topic Models (CTM): is based on LDA and dresses the limitation of LDA.

Introduction(2) Notation and terminology (text collections) • Word:the basic unit from a vocabulary of size V (includes V distinct words). The vth word is represented by • Document: a sequence of N words. • Corpus: a collection of M documents. Assumptions: • The words in a document are exchangeable; • Documents are also exchangeable.

Latent Dirichlet Allocation (LDA) (1) fixed known parameters fixed unknown parameters Random variables (w are observable) Generative process for each document W in a corpus D: • Choose • For each of the N words • Choose a topic index • Choose a word • are document-level variables, z and w are word-level variables.

Latent Dirichlet Allocation (LDA) (2) • Pros: • The Dirichlet distribution is in the exponential family and conjugate to the multinomial distribution --- variational inference is tractable. • are document-specific, so the variational parameters of could be regarded as the representation of a document --- feature set is reduced. • are sampled repeatedly within a document --- one document can be associated with multiple topics. • Cons: • Because of the independence assumption implicit in the Dirichlet distribution, LDA is unable to capture the correlation between different topics.

Correlated Topic Models (CTM) (1) Key point: the topic proportions are drawn from a logistic normal distribution rather than a Dirichlet distribution. Definition of logistic normal distribution Let denote k-dimensional real space, the (k-1)-dimensional positive simplex defined by Suppose that follows a multinormal distribution over . The logistic transformation from to can be used to define a logistic distribution over .

Logistic transformation 1 1 Log ratio transformation 1 Correlated Topic Models (CTM) (2) The density function of The logistic normal distribution is defined over the simplex as Dirichlet distribution and it allows correlation between components.

Correlated Topic Models (CTM) (3) Generative process for each document W in a corpus D: • Choose • For each of the N words • Choose a topic • (b) Choose a word

Correlated Topic Models (CTM) (4) Posterior inference (for in each document) – variational inference where Difficulty: the logistic normal is not exponential conjugate. Solution: we lower bound it with a Taylor expansion concave

Correlated Topic Models (CTM) (5) Parameters estimation (for ) – maximizing the likelihood of the entire corpus of documents Variational EM 1. (E-step) For each document, we maximize the lower bound with respect to the variational parameters ; 2. (M-step) Maximize the lower bound of the likelihood of the entire corpus with respect to the model parameters and

Experimental Results (1) Example: Modeling Science

Experimental Results (2) Comparison with LDA - Document modeling

Experimental Results (3) Comparison with LDA – Collaborative filtering To evaluate how well the models predict the remaining words after observing a portion of the document, we need to define a measure to compare . Lower numbers denote more predictive power.

Conclusions • The main contribution of this paper is that the CTM directly model correlation between topics via the logistic normal distribution. • At the same time, the nonconjugacy of the logistic normal distribution adds complexity to the variational inference process. • As the LDA, the CTM allows multiple topics for each document; its variational parameters could serve as features of the document.

Reference: J. Aitchison and S.M. Shen. Logistic-Normal Distributions: Some Properties and Uses. Biometrika, vol.67, no.2, pp.261-272, 1980. D. Blei, A. Ng and M. Jordan. Latent Dirichlet allocation. Journal of Machine Learning Research, 3: 993-1022, 2003.

Correlated Topic Models: Modeling Text Collections and More

Correlated Topic Models: Modeling Text Collections and More

Presentation Transcript

Kevin Lafferty

Propaganda

Northern Ireland Passenger Survey (NIPS)

PIPs

Topic models

By Klea Ellen Lafferty

Saturday, December 13 Morning session

Topic Models

Generalization to Unseen Cases: (No) Free Lunches and Good-Turing estimation

Network-based Intrusion Detection and Prevention

Robust Fisher Discriminant Analysis

Bayesian Sparse Sampling for On-line Reward Optimization

Presented by: Jonathan Huang (jch1@cs.cmu) Advisor: Carlos Guestrin 1/24/2006

Nonparametric Latent Feature Models for Link Prediction

Semi-Supervised State Space Models

Memoized Online Variational Inference for Dirichlet Process Mixture Models

Probabilistic Topic Models

Correlated Component Analysis (CCA)

Graph-based Consensus Maximization among Multiple Supervised and Unsupervised Models

NIPS