Loading in 2 Seconds...

Decoupling Sparsity and Smoothness in the Discrete Hierarchical Dirichlet Process

Loading in 2 Seconds...

196 Views

Download Presentation
##### Decoupling Sparsity and Smoothness in the Discrete Hierarchical Dirichlet Process

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -

**Decoupling Sparsity and Smoothness in the Discrete**Hierarchical Dirichlet Process Chong Wang and David M. Blei NIPS 2009 Discussion led by Chunping Wang ECE, Duke University March 26, 2010**Outline**• Motivations • LDA and HDP-LDA • Sparse Topic Models • Inference Using Collapsed Gibbs sampling • Experiments • Conclusions 1/16**Motivations**• Topics modeling with the “bag of words” assumption • An extension of the HDP-LDA model • In the LDA and the HDP-LDA models, the topics are drawn from an exchangeable Dirichlet distribution with a scale parameter . As approaches zero, topics will be • sparse: most probability mass on only a few terms • less smooth: empirical counts dominant • Goal: to decouple sparsity and smoothness so that these two properties can be achieved at the same time. • How: a Bernoulli variable for each term and each topic is introduced. 2/16**LDA and HDP-LDA**topic : LDA document : word : Base measure HDP-LDA topic : weights document : word : Nonparametric form of LDA, with the number of topics unbounded 3/16**Sparse Topic Models**The size of the vocabulary is V Defined on a V-1-simplex Defined on a sub-simplex specified by : a V-length binary vector composed of V Bernoulli variables one selection proportion for each topic Sparsity: the pattern of ones in , controlled by Smoothness: enforced over terms with non-zero ’s through Decoupled! 4/16**Sparse Topic Models**5/16**Inference Using Collapsed Gibbs sampling**As in the HDP-LDA • Topic proportions and topic distributions are integrated out. 6/16**Inference Using Collapsed Gibbs sampling**As in the HDP-LDA • Topic proportions and topic distributions are integrated out. • The direct-assignment method based on the Chinese restaurant franchise (CRF) is used for and an augmented variable, table counts 6/16**Inference Using Collapsed Gibbs sampling**• Notation: • : # of customers (words) in restaurant d (document) eating dish k (topic) • : # of tables in restaurant d serving dish k • : marginal counts represented with dots • K, u: current # of topics and new topic index, respectively • : # of times that term v has been assigned to topic k • : # of times that all the terms have been assigned to topic k conditional density of under the topic k given all data except 7/16**Inference Using Collapsed Gibbs sampling**• Recall the direct-assignment sampling method for the HDP-LDA • Sampling topic assignments if a new topic is sampled, then sample , and let and and • Sampling stick length • Sampling table counts 8/16**Inference Using Collapsed Gibbs sampling**• Recall the direct-assignment sampling method for HDP-LDA • Sampling topic assignments for HDP-LDA for sparse TM straightforward Instead, the authors integrate out for faster convergence. Since there are total possible , this is the central computational challenge for the sparse TM. 8/16**Inference Using Collapsed Gibbs sampling**define vocabulary set of terms that have word assignments in topic k where This conditional probability depends on the selector proportions. 9/16**Inference Using Collapsed Gibbs sampling**• Sampling Bernoulli parameter ( using as an auxiliary variable) • Sampling hyper-parameters • : with Gamma(1,1) priors • : Metropolis-Hastings using symmetric Gaussian proposal • Estimate topic distributions from any single sample of z and b define set of terms with an “on” b • sample conditioned on ; • sample conditioned on . sparsity smoothness on the selected terms 11/16**Experiments**Four datasets: • arXiv: online research abstracts, D = 2500, V = 2873 • Nematode Biology: research abstracts, D = 2500, V = 2944 • NIPS: NIPS articles between 1988-1999, V = 5005. 20% of words for each paper are used. • Conf. abstracts: abstracts from CIKM, ICML, KDD, NIPS, SIGIR and WWW, between 2005-2008, V = 3733. Two predictive quantities: where the topic complexity 12/16**Experiments**better perplexity, simpler models larger : smoother less topics similar # of terms 13/16**Experiments**14/16**Experiments**small (<0.01) 15/16**Experiments**small (<0.01) lack of smoothness 15/16**Experiments**small (<0.01) lack of smoothness Need more topics to explain all kinds of patterns of empirical word counts 15/16**Experiments**small (<0.01) lack of smoothness Need more topics to explain all kinds of patterns of empirical word counts Infrequent words populate “noise” topics. 15/16**Conclusions**• A new topic model in the HDP-LDA framework, based on the “bag of words” assumption; • Main contributions: • Decoupling the control of sparsity and smoothness by introducing binary selectors for term assignments in each topic; • Developing a collapsed Gibbs sampler in the HDP-LDA framework. • Held out performance is better than the HDP-LDA. 16/16