1 / 44

Modeling Documents

Modeling Documents. Amruta Joshi Department of Computer Science Stanford University. Outline. Topic Models Topic Extraction 2 Author Information Modeling Topics Modeling Authors Author Topic Model Inference Integrating topics and syntax Probabilistic Models Composite Model Inference.

alder
Download Presentation

Modeling Documents

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Modeling Documents Amruta Joshi Department of Computer Science Stanford University Research in Algorithms for the InterNet

  2. Outline • Topic Models • Topic Extraction2 • Author Information • Modeling Topics • Modeling Authors • Author Topic Model • Inference • Integrating topics and syntax • Probabilistic Models • Composite Model • Inference Research in Algorithms for the InterNet

  3. Motivation • Identifying content of a document • Identifying its latent structure • More specifically • Given a collection of documents we want to create a model to collect information about • Authors • Topics • Syntactic constructs Research in Algorithms for the InterNet

  4. Topics & Authors • Why model topics? • Observe topic trends • How documents relate to one-another • Tagging abstracts • Why model authors’ interests? • Identifying what author writes about • Identifying authors with similar interests • Authorship attribution • Creating reviewer lists • Finding unusual work by an author Research in Algorithms for the InterNet

  5. rivers In floods, the banks of a river overflow Topic Extraction: Overview • Supervised Learning Techniques • Learn from labeled document collection • But Unlabeled documents, Rapidly changing fields (Yang 1998) Research in Algorithms for the InterNet

  6. Topic Extraction: Overview • Dimensionality Reduction • Represent documents in Vector Space of terms • Map to low-dimensionality • Non-linear dim. reduction • WEBSOM (Lagus et. al. 1999) • Linear Projection • LSI (Berry, Dumais, O’Brien 1995) • Regions represent topics Research in Algorithms for the InterNet

  7. Topic Extraction: Overview • Cluster documents on semantic content • Typically, each cluster has just 1 topic • Aspect Model • Topic modeled as distribution over words • Documents generated from multiple topics Research in Algorithms for the InterNet

  8. As doth the lion in the Capitol, A man no mightier than thyself or me … Author Information: Overview • Analyzing text using • Stylometry • statistical analysis using literary style, frequency of word usage, etc • Semantics • Content of document Research in Algorithms for the InterNet

  9. D1 D2 D3 D4 Author Information: Overview • Graph-based models • Build Interactive ReferralWeb using citations • Kautz, Selman, Shah 1997 • Build Co-Author Graphs • White & Smith • Page-Rank for analysis Research in Algorithms for the InterNet

  10. The Big Idea • Topic Model • Model topics as distribution over words • Author Model • Model author as distribution over words • Author-Topic Model • Probabilistic Model for both • Model topics as distribution over words • Model authors as distribution over topics Research in Algorithms for the InterNet

  11. Pneumonia Tuberculosis Lung Infiltrates XRay Sputum Smear Bayesian Networks nodes = random variables edges = direct probabilistic influence Topology captures independence: XRay conditionally independent of Pneumonia given Infiltrates Slide Credit: Lisa Getoor, UMD College Park Research in Algorithms for the InterNet

  12. Pneumonia Tuberculosis P T P(I |P, T ) p t 0.7 0.3 p t 0.6 0.4 Lung Infiltrates p t 0.2 0.8 p t 0.01 0.99 XRay Sputum Smear Bayesian Networks • Associated with each node Xi there is a conditional probability distribution P(Xi|Pai:) — distribution over Xi for each assignment to parents • If variables are discrete, P is usually multinomial • Pcan be linear Gaussian, mixture of Gaussians, … Slide Credit: Lisa Getoor, UMD College Park Research in Algorithms for the InterNet

  13. BN models can be learned from empirical data parameter estimation via numerical optimization structure learning via combinatorial search. P T Inducer I Data X S BN Learning Slide Credit: Lisa Getoor, UMD College Park Research in Algorithms for the InterNet

  14. Probabilistic Generative Process Statistical Inference Generative Model Mixture weights Mixture components Bayesian approach: use priors Mixture weights ~ Dirichlet( a ) Mixture components ~ Dirichlet( b ) Research in Algorithms for the InterNet

  15. Doc 1 Z W   Z   W  … TT T1 T2 w1 w2 wv … Bayesian Network for modeling document generation Research in Algorithms for the InterNet

  16. Document specific distribution over topics Document   z  Topic Topic distribution over words Word  w Nd T D Topic Model: Plate Notation Research in Algorithms for the InterNet

  17. Topic Model: Geometric Representation Research in Algorithms for the InterNet

  18. Uniform distribution over authors of doc Document x  Author Distribution of authors over words Word  w ad Nd A D Modeling Authors with words Research in Algorithms for the InterNet

  19. Uniform distribution of documents over authors Document x  z  Distribution of authors over topics Topic Author Word Topic distribution over words   w ad Nd T A D Author-Topic Model Research in Algorithms for the InterNet

  20. Inference • Expectation Maximization • But poor results (local Maxima) • Gibbs Sampling • Parameters: ,  • Start with initial random assignment • Update parameter using other parameters • Converges after ‘n’ iterations • Burn-in time Research in Algorithms for the InterNet

  21. # of times topic j has occurred in document d Prob. that ith topic is assigned to topic j keeping other topic assn unchanged # of times word m is assigned to topic j Inference and Learning for Documents mj dj Research in Algorithms for the InterNet

  22. Matrix Factorization Research in Algorithms for the InterNet

  23. Topic Model: Inference River Stream Bank Money Loan documents Can we recover the original topics and topic mixtures from this data? Slide Credit: Padhraic Smyth, UC Irvine Research in Algorithms for the InterNet

  24. Example of Gibbs Sampling • Assign word tokens randomly to topics (●=topic 1; ●=topic 2 ) River Stream Bank Money Loan Slide Credit: Padhraic Smyth, UC Irvine Research in Algorithms for the InterNet

  25. After 1 iteration • Apply sampling equation to each word token River Stream Bank Money Loan Slide Credit: Padhraic Smyth, UC Irvine Research in Algorithms for the InterNet

  26. After 4 iterations River Stream Bank Money Loan Slide Credit: Padhraic Smyth, UC Irvine Research in Algorithms for the InterNet

  27. After 32 iterations ● ● River Stream Bank Money Loan Slide Credit: Padhraic Smyth, UC Irvine Research in Algorithms for the InterNet

  28. Results • Tested on Scientific Papers • NIPS Dataset • V=13,649 D=1,740 K=2,037 • #Topics = 100 • #tokens = 2,301,375 • CiteSeer Dataset • V=30,799 D=162,489 K=85,465 • #Topics = 300 • #tokens = 11,685,514 Research in Algorithms for the InterNet

  29. Lower the better Evaluating Predictive Power • Perplexity • Indicates ability to predict words on new unseen documents Research in Algorithms for the InterNet

  30. Results: Perplexity Research in Algorithms for the InterNet

  31. Recap • First • Author Model • Topic Model • Then • Author-Topic Model • Next… • Integrating Topics & Syntax Research in Algorithms for the InterNet

  32. Integrating topics & syntax • Probabilistic Models • Short-range dependencies • Syntactic Constraints • Represented as distinct syntactic classes • HMM, Probabilistic CFGs • Long-range dependencies • Semantic Constraints • Represented as probabilistic distribution • Bayes Model, Topic Model • New Idea! Use both Research in Algorithms for the InterNet

  33. How to integrate these? • Mixture of Models • Each word exhibits either short or long range dependencies • Product of Models • Each word exhibits both short or long range dependencies • Composite Model • Asymmetric • All words exhibit short-range dependencies • Subset of words exhibit long-range dependencies Research in Algorithms for the InterNet

  34. The Composite Model 1 • Capturing asymmetry • Replace probability distribution over words with semantic model • Syntactic model chooses when to emit content word • Semantic model chooses which word to emit • Methods • Syntactic component is HMM • Semantic component is Topic model Research in Algorithms for the InterNet

  35. 0.2 0.5 0.4 0.1 network neural output networks ... image images object objects ... kernel support svm vector ... 0.9 0.7 0.9 used trained obtained described ... in with for on ... Generating phrases networkused forimagesimage obtained with kerneloutputdescribed withobjectsneural networktrained withsvm images Research in Algorithms for the InterNet

  36. Doc’s distribution over topics z4 z1 z2 z3 Classes Topics   w4 w3 w1 w2 Words c3 c2 c1 c4 The Composite Model 2 (Graphical) Research in Algorithms for the InterNet

  37. The Composite Model 3 • (d) : document’s distribution over topics • Transitions between classes ci-1 and ci follow distribution (Ci-1) • A document is generated as: • For each word wi in document d • Draw zi from (d) • Draw ci from (Ci-1) • If ci=1, then draw wi from (zi), • else draw wi from (ci) Research in Algorithms for the InterNet

  38. Results • Tested on • Brown corpus (tagged with word types) • Concatenated Brown & TASA corpus • HMM & Topic Model • 20 Classes • start/end Markers Class + 19 classes • T = 200 Research in Algorithms for the InterNet

  39. Results • Identifying Syntactic classes & semantic topics • Clean separation observed • Identifying function words & content words • “control” : plain verb (syntax) or semantic word • Part-of-Speech Tagging • Identifying syntactic class • Document Classification • Brown corpus: 500 docs => 15 groups • Results similar to plain Topic Model Research in Algorithms for the InterNet

  40. Extensions to Topic Model • Integrating link information (Cohn, Hofmann 2001) • Learning Topic Hierarchies • Integrating Syntax & Topics • Integrate authorship info with content (author-topic model) • Grade-of-membership Models • Random sentence generation Research in Algorithms for the InterNet

  41. Conclusion • Identifying its latent structure • Document Content is modeled for • Semantic Associations – topic model • Authorship - author topic model • Syntactic Constructs – HMM Research in Algorithms for the InterNet

  42. Acknowledgements • Prof. Rajeev Motwani • Advice and guidance regarding topic selection • T. K. Satish Kumar • Help on Probabilistic Models Research in Algorithms for the InterNet

  43. Thank you! Research in Algorithms for the InterNet

  44. References • Primary • Steyvers, M., Smyth, P., Rosen-Zvi, M., & Griffiths, T. (2004). Probabilistic Author-Topic Models for Information Discovery. The Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Seattle, Washington. • Steyvers, M. & Griffiths, T. Probabilistic topic models. (http://psiexp.ss.uci.edu/research/papers/SteyversGriffithsLSABookFormatted.pdf) • Rosen-Zvi, M., Griffiths T., Steyvers, M., & Smyth, P. (2004). The Author-Topic Model for Authors and Documents. In 20th Conference on Uncertainty in Artificial Intelligence. Banff, Canada • Griffiths, T.L., & Steyvers, M.,  Blei, D.M., & Tenenbaum, J.B. (in press). Integrating Topics and Syntax. In: Advances in Neural Information Processing Systems, 17. • Griffiths, T., & Steyvers, M. (2004). Finding Scientific Topics. Proceedings of the National Academy of Sciences, 101 (suppl. 1), 5228-5235. Research in Algorithms for the InterNet

More Related