1 / 55

Generative Topic Models for Community Analysis

Generative Topic Models for Community Analysis . Ramesh Nallapati. Objectives. Provide an overview of topic models and their learning techniques Mixture models, PLSA, LDA EM, variational EM, Gibbs sampling Convince you that topic models are an attractive framework for community analysis

chuong
Download Presentation

Generative Topic Models for Community Analysis

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Generative Topic Models for Community Analysis Ramesh Nallapati

  2. Objectives • Provide an overview of topic models and their learning techniques • Mixture models, PLSA, LDA • EM, variational EM, Gibbs sampling • Convince you that topic models are an attractive framework for community analysis • 5 definitive papers 10-802: Guest Lecture

  3. Outline • Part I: Introduction to Topic Models • Naive Bayes model • Mixture Models • Expectation Maximization • PLSA • LDA • Variational EM • Gibbs Sampling • Part II: Topic Models for Community Analysis • Citation modeling with PLSA • Citation Modeling with LDA • Author Topic Model • Author Topic Recipient Model • Modeling influence of Citations • Mixed membership Stochastic Block Model 10-802: Guest Lecture

  4. Introduction to Topic Models • Multinomial Naïve Bayes  • For each document d = 1,, M • Generate Cd ~ Mult( ¢ | ) • For each position n = 1,, Nd • Generate wn ~ Mult(¢|,Cd) C ….. WN W1 W2 W3 M b 10-802: Guest Lecture

  5. Introduction to Topic Models • Naïve Bayes Model: Compact representation   C C ….. WN W1 W2 W3 W M N b M b 10-802: Guest Lecture

  6. Introduction to Topic Models • Multinomial naïve Bayes: Learning • Maximize the log-likelihood of observed variables w.r.t. the parameters: • Convex function: global optimum • Solution: 10-802: Guest Lecture

  7. Introduction to Topic Models • Mixture model: unsupervised naïve Bayes model • Joint probability of words and classes: • But classes are not visible:  C Z W N M b 10-802: Guest Lecture

  8. Introduction to Topic Models • Mixture model: learning • Not a convex function • No global optimum solution • Solution: Expectation Maximization • Iterative algorithm • Finds local optimum • Guaranteed to maximize a lower-bound on the log-likelihood of the observed data 10-802: Guest Lecture

  9. Introduction to Topic Models log(0.5x1+0.5x2) • Quick summary of EM: • Log is a concave function • Lower-bound is convex! • Optimize this lower-bound w.r.t. each variable instead 0.5log(x1)+0.5log(x2) X2 X1 0.5x1+0.5x2 H() 10-802: Guest Lecture

  10. Introduction to Topic Models • Mixture model: EM solution E-step: M-step: 10-802: Guest Lecture

  11. Introduction to Topic Models 10-802: Guest Lecture

  12. Introduction to Topic Models • Probabilistic Latent Semantic Analysis Model d d • Select document d ~ Mult() • For each position n = 1,, Nd • generate zn ~ Mult( ¢ | d) • generate wn ~ Mult( ¢ | zn)  Topic distribution z w N M  10-802: Guest Lecture

  13. Introduction to Topic Models • Probabilistic Latent Semantic Analysis Model • Learning using EM • Not a complete generative model • Has a distribution  over the training set of documents: no new document can be generated! • Nevertheless, more realistic than mixture model • Documents can discuss multiple topics! 10-802: Guest Lecture

  14. Introduction to Topic Models • PLSA topics (TDT-1 corpus) 10-802: Guest Lecture

  15. Introduction to Topic Models 10-802: Guest Lecture

  16. Introduction to Topic Models • Latent Dirichlet Allocation  • For each document d = 1,,M • Generate d ~ Dir(¢ | ) • For each position n = 1,, Nd • generate zn ~ Mult( ¢ | d) • generate wn ~ Mult( ¢ | zn) a z w N M  10-802: Guest Lecture

  17. Introduction to Topic Models • Latent Dirichlet Allocation • Overcomes the issues with PLSA • Can generate any random document • Parameter learning: • Variational EM • Numerical approximation using lower-bounds • Results in biased solutions • Convergence has numerical guarantees • Gibbs Sampling • Stochastic simulation • unbiased solutions • Stochastic convergence 10-802: Guest Lecture

  18. Introduction to Topic Models • Variational EM for LDA • Approximate the posterior by a simpler distribution • A convex function in each parameter! 10-802: Guest Lecture

  19. Introduction to Topic Models • Gibbs sampling • Applicable when joint distribution is hard to evaluate but conditional distribution is known • Sequence of samples comprises a Markov Chain • Stationary distribution of the chain is the joint distribution 10-802: Guest Lecture

  20. Introduction to Topic Models • LDA topics 10-802: Guest Lecture

  21. Introduction to Topic Models • LDA’s view of a document 10-802: Guest Lecture

  22. Introduction to Topic Models • Perplexity comparison of various models Unigram Mixture model PLSA Lower is better LDA 10-802: Guest Lecture

  23. Introduction to Topic Models • Summary • Generative models for exchangeable data • Unsupervised models • Automatically discover topics • Well developed approximate techniques available for inference and learning 10-802: Guest Lecture

  24. Outline • Part I: Introduction to Topic Models • Naive Bayes model • Mixture Models • Expectation Maximization • PLSA • LDA • Variational EM • Gibbs Sampling • Part II: Topic Models for Community Analysis • Citation modeling with PLSA • Citation Modeling with LDA • Author Topic Model • Author Topic Recipient Model • Modeling influence of Citations • Mixed membership Stochastic Block Model 10-802: Guest Lecture

  25. Hyperlink modeling using PLSA 10-802: Guest Lecture

  26. Hyperlink modeling using PLSA[Cohn and Hoffman, NIPS, 2001]  • Select document d ~ Mult() • For each position n = 1,, Nd • generate zn ~ Mult( ¢ | d) • generate wn ~ Mult( ¢ | zn) • For each citation j = 1,, Ld • generate zj ~ Mult( ¢ | d) • generate cj ~ Mult( ¢ | zj) d d z z w c N L M  g 10-802: Guest Lecture

  27. Hyperlink modeling using PLSA[Cohn and Hoffman, NIPS, 2001]  PLSA likelihood: d d z z New likelihood: w c N L M  g Learning using EM 10-802: Guest Lecture

  28. Hyperlink modeling using PLSA[Cohn and Hoffman, NIPS, 2001] Heuristic:  (1-) 0 ·· 1 determines the relative importance of content and hyperlinks 10-802: Guest Lecture

  29. Hyperlink modeling using PLSA[Cohn and Hoffman, NIPS, 2001] • Experiments: Text Classification • Datasets: • Web KB • 6000 CS dept web pages with hyperlinks • 6 Classes: faculty, course, student, staff, etc. • Cora • 2000 Machine learning abstracts with citations • 7 classes: sub-areas of machine learning • Methodology: • Learn the model on complete data and obtain d for each document • Test documents classified into the label of the nearest neighbor in training set • Distance measured as cosine similarity in the  space • Measure the performance as a function of  10-802: Guest Lecture

  30. Hyperlink modeling using PLSA[Cohn and Hoffman, NIPS, 2001] • Classification performance content Hyperlink Hyperlink content 10-802: Guest Lecture

  31. Hyperlink modeling using LDA 10-802: Guest Lecture

  32. Hyperlink modeling using LDA[Erosheva, Fienberg, Lafferty, PNAS, 2004] a  • For each document d = 1,,M • Generate d ~ Dir(¢ | ) • For each position n = 1,, Nd • generate zn ~ Mult( ¢ | d) • generate wn ~ Mult( ¢ | zn) • For each citation j = 1,, Ld • generate zj ~ Mult( . | d) • generate cj ~ Mult( . | zj) z z w c N L M  g Learning using variational EM 10-802: Guest Lecture

  33. Hyperlink modeling using LDA[Erosheva, Fienberg, Lafferty, PNAS, 2004] 10-802: Guest Lecture

  34. Author-Topic Model for Scientific Literature 10-802: Guest Lecture

  35. Author-Topic Model for Scientific Literature[Rozen-Zvi, Griffiths, Steyvers, Smyth UAI, 2004] a P • For each author a = 1,,A • Generate a ~ Dir(¢ | ) • For each topic k = 1,,K • Generate fk ~ Dir( ¢ | ) • For each document d = 1,,M • For each position n = 1,, Nd • Generate author x ~ Unif(¢ | ad) • generate zn ~ Mult( ¢ | a) • generate wn ~ Mult( ¢ | fzn) a x z  A w N M f b K 10-802: Guest Lecture

  36. Author-Topic Model for Scientific Literature[Rozen-Zvi, Griffiths, Steyvers, Smyth UAI, 2004] a Learning: Gibbs sampling P  x z  A w N M f b K 10-802: Guest Lecture

  37. Author-Topic Model for Scientific Literature[Rozen-Zvi, Griffiths, Steyvers, Smyth UAI, 2004] • Perplexity results 10-802: Guest Lecture

  38. Author-Topic Model for Scientific Literature[Rozen-Zvi, Griffiths, Steyvers, Smyth UAI, 2004] • Topic-Author visualization 10-802: Guest Lecture

  39. Author-Topic Model for Scientific Literature[Rozen-Zvi, Griffiths, Steyvers, Smyth UAI, 2004] • Application 1: Author similarity 10-802: Guest Lecture

  40. Author-Topic Model for Scientific Literature[Rozen-Zvi, Griffiths, Steyvers, Smyth UAI, 2004] • Application 2: Author entropy 10-802: Guest Lecture

  41. Author-Topic-Recipient model for email data [McCallum, Corrada-Emmanuel,Wang, ICJAI’05] 10-802: Guest Lecture

  42. Author-Topic-Recipient model for email data [McCallum, Corrada-Emmanuel,Wang, ICJAI’05] Gibbs sampling 10-802: Guest Lecture

  43. Author-Topic-Recipient model for email data [McCallum, Corrada-Emmanuel,Wang, ICJAI’05] • Datasets • Enron email data • 23,488 messages between 147 users • McCallum’s personal email • 23,488(?) messages with 128 authors 10-802: Guest Lecture

  44. Author-Topic-Recipient model for email data [McCallum, Corrada-Emmanuel,Wang, ICJAI’05] • Topic Visualization: Enron set 10-802: Guest Lecture

  45. Author-Topic-Recipient model for email data [McCallum, Corrada-Emmanuel,Wang, ICJAI’05] • Topic Visualization: McCallum’s data 10-802: Guest Lecture

  46. Author-Topic-Recipient model for email data [McCallum, Corrada-Emmanuel,Wang, ICJAI’05] 10-802: Guest Lecture

  47. Modeling Citation Influences 10-802: Guest Lecture

  48. Modeling Citation Influences[Dietz, Bickel, Scheffer, ICML 2007] • Copycat model 10-802: Guest Lecture

  49. Modeling Citation Influences[Dietz, Bickel, Scheffer, ICML 2007] • Citation influence model 10-802: Guest Lecture

  50. Modeling Citation Influences[Dietz, Bickel, Scheffer, ICML 2007] • Citation influence graph for LDA paper 10-802: Guest Lecture

More Related