Decoupling Sparsity and Smoothness in the Discrete Hierarchical Dirichlet Process

Decoupling Sparsity and Smoothness in the Discrete Hierarchical Dirichlet Process Chong Wang and David M. Blei NIPS 2009 Discussion led by Chunping Wang ECE, Duke University March 26, 2010

Outline • Motivations • LDA and HDP-LDA • Sparse Topic Models • Inference Using Collapsed Gibbs sampling • Experiments • Conclusions 1/16

Motivations • Topics modeling with the “bag of words” assumption • An extension of the HDP-LDA model • In the LDA and the HDP-LDA models, the topics are drawn from an exchangeable Dirichlet distribution with a scale parameter . As approaches zero, topics will be • sparse: most probability mass on only a few terms • less smooth: empirical counts dominant • Goal: to decouple sparsity and smoothness so that these two properties can be achieved at the same time. • How: a Bernoulli variable for each term and each topic is introduced. 2/16

LDA and HDP-LDA topic : LDA document : word : Base measure HDP-LDA topic : weights document : word : Nonparametric form of LDA, with the number of topics unbounded 3/16

Sparse Topic Models The size of the vocabulary is V Defined on a V-1-simplex Defined on a sub-simplex specified by : a V-length binary vector composed of V Bernoulli variables one selection proportion for each topic Sparsity: the pattern of ones in , controlled by Smoothness: enforced over terms with non-zero ’s through Decoupled! 4/16

Sparse Topic Models 5/16

Inference Using Collapsed Gibbs sampling 6/16

Inference Using Collapsed Gibbs sampling As in the HDP-LDA • Topic proportions and topic distributions are integrated out. 6/16

Inference Using Collapsed Gibbs sampling As in the HDP-LDA • Topic proportions and topic distributions are integrated out. • The direct-assignment method based on the Chinese restaurant franchise (CRF) is used for and an augmented variable, table counts 6/16

Inference Using Collapsed Gibbs sampling • Notation: • : # of customers (words) in restaurant d (document) eating dish k (topic) • : # of tables in restaurant d serving dish k • : marginal counts represented with dots • K, u: current # of topics and new topic index, respectively • : # of times that term v has been assigned to topic k • : # of times that all the terms have been assigned to topic k conditional density of under the topic k given all data except 7/16

Inference Using Collapsed Gibbs sampling • Recall the direct-assignment sampling method for the HDP-LDA • Sampling topic assignments if a new topic is sampled, then sample , and let and and • Sampling stick length • Sampling table counts 8/16

Inference Using Collapsed Gibbs sampling • Recall the direct-assignment sampling method for HDP-LDA • Sampling topic assignments for HDP-LDA for sparse TM straightforward Instead, the authors integrate out for faster convergence. Since there are total possible , this is the central computational challenge for the sparse TM. 8/16

Inference Using Collapsed Gibbs sampling define vocabulary set of terms that have word assignments in topic k where This conditional probability depends on the selector proportions. 9/16

Inference Using Collapsed Gibbs sampling 10/16

Inference Using Collapsed Gibbs sampling • Sampling Bernoulli parameter ( using as an auxiliary variable) • Sampling hyper-parameters • : with Gamma(1,1) priors • : Metropolis-Hastings using symmetric Gaussian proposal • Estimate topic distributions from any single sample of z and b define set of terms with an “on” b • sample conditioned on ; • sample conditioned on . sparsity smoothness on the selected terms 11/16

Experiments Four datasets: • arXiv: online research abstracts, D = 2500, V = 2873 • Nematode Biology: research abstracts, D = 2500, V = 2944 • NIPS: NIPS articles between 1988-1999, V = 5005. 20% of words for each paper are used. • Conf. abstracts: abstracts from CIKM, ICML, KDD, NIPS, SIGIR and WWW, between 2005-2008, V = 3733. Two predictive quantities: where the topic complexity 12/16

Experiments better perplexity, simpler models larger : smoother less topics similar # of terms 13/16

Experiments 14/16

Experiments small (<0.01) 15/16

Experiments small (<0.01) lack of smoothness 15/16

Experiments small (<0.01) lack of smoothness Need more topics to explain all kinds of patterns of empirical word counts 15/16

Experiments small (<0.01) lack of smoothness Need more topics to explain all kinds of patterns of empirical word counts Infrequent words populate “noise” topics. 15/16

Conclusions • A new topic model in the HDP-LDA framework, based on the “bag of words” assumption; • Main contributions: • Decoupling the control of sparsity and smoothness by introducing binary selectors for term assignments in each topic; • Developing a collapsed Gibbs sampler in the HDP-LDA framework. • Held out performance is better than the HDP-LDA. 16/16

Decoupling Sparsity and Smoothness in the Discrete Hierarchical Dirichlet Process

Decoupling Sparsity and Smoothness in the Discrete Hierarchical Dirichlet Process

Presentation Transcript

Smoothness Term

Collapsed Variational Dirichlet Process Mixture Models

Hierarchical Dirichlet Process and Infinite Hidden Markov Model

Hierarchical Dirichlet Process (HDP)

Sparsity and Saliency

Variational Inference for Dirichlet Process Mixture

Exact and Approximate Sum Representations for the Dirichlet Process

Smoothness and Learning Equilibria

Dirichlet Process Prior in a Catch-Effort Hierarchical Model for Animal Abundance

Hierarchical Beta Process and the Indian Buffet Process

Bayesian Multi-Population Haplotype Inference via a Hierarchical Dirichlet Process Mixture

Double Dirichlet Process Mixtures

Hierarchical Dirichlet Processes

Dirichlet :

Dirichlet process tutorial

Generalized Spatial Dirichlet Process Models

Hierarchical Dirichlet Process (HDP)

Dirichlet Process Mixtures A gentle tutorial

Hierarchical Double Dirichlet Process Mixture of Gaussian Processes

The Nested Dirichlet Process

Sugar and Smoothness in One Bottle – Rum

Double Dirichlet Process Mixtures