420 likes | 576 Views
Generative Topic Models for Community Analysis. Pilfered from: Ramesh Nallapati http://www.cs.cmu.edu/~wcohen/10-802/lda-sep-18.ppt. Objectives. Cultural literacy for ML: Q: What are “topic models”? A 1 : popular indoor sport for machine learning researchers
E N D
Generative Topic Models for Community Analysis Pilfered from: Ramesh Nallapati http://www.cs.cmu.edu/~wcohen/10-802/lda-sep-18.ppt
Objectives • Cultural literacy for ML: • Q: What are “topic models”? • A1: popular indoor sport for machine learning researchers • A2: a particular way of applying unsupervised learning of Bayes nets to text • Quick historical survey of some sample papers in the area
Outline • Part I: Introduction to Topic Models • Naive Bayes model • Mixture Models • Expectation Maximization • PLSA • LDA • Variational EM • Gibbs Sampling • Part II: Topic Models for Community Analysis • Citation modeling with PLSA • Citation Modeling with LDA • Author Topic Model • Author Topic Recipient Model • Modeling influence of Citations • Mixed membership Stochastic Block Model
Introduction to Topic Models • Multinomial Naïve Bayes • For each document d = 1,, M • Generate Cd ~ Mult( ¢ | ) • For each position n = 1,, Nd • Generate wn ~ Mult(¢|,Cd) C ….. WN W1 W2 W3 M b
Introduction to Topic Models • Naïve Bayes Model: Compact representation C C ….. WN W1 W2 W3 W M N b M b
Introduction to Topic Models • Mixture model: unsupervised naïve Bayes model • Joint probability of words and classes: • But classes are not visible: C Z W N M b
Introduction to Topic Models • Probabilistic Latent Semantic Analysis Model d d • Select document d ~ Mult() • For each position n = 1,, Nd • generate zn ~ Mult( ¢ | d) • generate wn ~ Mult( ¢ | zn) Topic distribution z w N M
Introduction to Topic Models • Probabilistic Latent Semantic Analysis Model • Learning using EM • Not a complete generative model • Has a distribution over the training set of documents: no new document can be generated! • Nevertheless, more realistic than mixture model • Documents can discuss multiple topics!
Introduction to Topic Models • PLSA topics (TDT-1 corpus)
Introduction to Topic Models • Latent Dirichlet Allocation • For each document d = 1,,M • Generate d ~ Dir(¢ | ) • For each position n = 1,, Nd • generate zn ~ Mult( ¢ | d) • generate wn ~ Mult( ¢ | zn) a z w N M
Introduction to Topic Models • Latent Dirichlet Allocation • Overcomes the issues with PLSA • Can generate any random document • Parameter learning: • Variational EM • Numerical approximation using lower-bounds • Results in biased solutions • Convergence has numerical guarantees • Gibbs Sampling • Stochastic simulation • unbiased solutions • Stochastic convergence
Introduction to Topic Models • Variational EM for LDA • Approximate the posterior by a simpler distribution • A convex function in each parameter!
Introduction to Topic Models • Gibbs sampling • Applicable when joint distribution is hard to evaluate but conditional distribution is known • Sequence of samples comprises a Markov Chain • Stationary distribution of the chain is the joint distribution
Introduction to Topic Models • LDA topics
Introduction to Topic Models • LDA’s view of a document
Introduction to Topic Models • Perplexity comparison of various models Unigram Mixture model PLSA Lower is better LDA
Outline • Part I: Introduction to Topic Models • Naive Bayes model • Mixture Models • Expectation Maximization • PLSA • LDA • Variational EM • Gibbs Sampling • Part II: Topic Models for Community Analysis • Citation modeling with PLSA • Citation Modeling with LDA • Author Topic Model • Author Topic Recipient Model • Modeling influence of Citations • Mixed membership Stochastic Block Model
Hyperlink modeling using PLSA[Cohn and Hoffman, NIPS, 2001] • Select document d ~ Mult() • For each position n = 1,, Nd • generate zn ~ Mult( ¢ | d) • generate wn ~ Mult( ¢ | zn) • For each citation j = 1,, Ld • generate zj ~ Mult( ¢ | d) • generate cj ~ Mult( ¢ | zj) d d z z w c N L M g
Hyperlink modeling using PLSA[Cohn and Hoffman, NIPS, 2001] PLSA likelihood: d d z z New likelihood: w c N L M g Learning using EM
Hyperlink modeling using PLSA[Cohn and Hoffman, NIPS, 2001] Heuristic: (1-) 0 ·· 1 determines the relative importance of content and hyperlinks
Hyperlink modeling using PLSA[Cohn and Hoffman, NIPS, 2001] • Classification performance content Hyperlink Hyperlink content
Hyperlink modeling using LDA[Erosheva, Fienberg, Lafferty, PNAS, 2004] a • For each document d = 1,,M • Generate d ~ Dir(¢ | ) • For each position n = 1,, Nd • generate zn ~ Mult( ¢ | d) • generate wn ~ Mult( ¢ | zn) • For each citation j = 1,, Ld • generate zj ~ Mult( . | d) • generate cj ~ Mult( . | zj) z z w c N L M g Learning using variational EM
Hyperlink modeling using LDA[Erosheva, Fienberg, Lafferty, PNAS, 2004]
Author-Topic Model for Scientific Literature[Rozen-Zvi, Griffiths, Steyvers, Smyth UAI, 2004] a P • For each author a = 1,,A • Generate a ~ Dir(¢ | ) • For each topic k = 1,,K • Generate fk ~ Dir( ¢ | ) • For each document d = 1,,M • For each position n = 1,, Nd • Generate author x ~ Unif(¢ | ad) • generate zn ~ Mult( ¢ | a) • generate wn ~ Mult( ¢ | fzn) a x z A w N M f b K
Author-Topic Model for Scientific Literature[Rozen-Zvi, Griffiths, Steyvers, Smyth UAI, 2004] a Learning: Gibbs sampling P x z A w N M f b K
Author-Topic Model for Scientific Literature[Rozen-Zvi, Griffiths, Steyvers, Smyth UAI, 2004] • Topic-Author visualization
Author-Topic-Recipient model for email data [McCallum, Corrada-Emmanuel,Wang, ICJAI’05]
Author-Topic-Recipient model for email data [McCallum, Corrada-Emmanuel,Wang, ICJAI’05] Gibbs sampling
Author-Topic-Recipient model for email data [McCallum, Corrada-Emmanuel,Wang, ICJAI’05] • Datasets • Enron email data • 23,488 messages between 147 users • McCallum’s personal email • 23,488(?) messages with 128 authors
Author-Topic-Recipient model for email data [McCallum, Corrada-Emmanuel,Wang, ICJAI’05] • Topic Visualization: Enron set
Author-Topic-Recipient model for email data [McCallum, Corrada-Emmanuel,Wang, ICJAI’05] • Topic Visualization: McCallum’s data
Modeling Citation Influences[Dietz, Bickel, Scheffer, ICML 2007] • Citation influence model
Modeling Citation Influences[Dietz, Bickel, Scheffer, ICML 2007] • Citation influence graph for LDA paper
Modeling Citation Influences[Dietz, Bickel, Scheffer, ICML 2007] • Words in LDA paper assigned to citations
Link-PLSA-LDA: Topic Influence in Blogs (ICWSM 2008) Ramesh Nallapati, Amr Ahmed Eric Xing