1 / 45

Cross-Cultural Analysis of Blogs and Forums with Mixed-Collection Topic Models

Cross-Cultural Analysis of Blogs and Forums with Mixed-Collection Topic Models. Michael Paul and Roxana Girju. Outline. Overview of topic models Cross-Collection LDA Cross-cultural analysis with ccLDA Other applications of ccLDA Model evaluation An alternative cross-collection model.

Faraday
Download Presentation

Cross-Cultural Analysis of Blogs and Forums with Mixed-Collection Topic Models

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Cross-Cultural Analysis of Blogs and Forums with Mixed-Collection Topic Models Michael Paul and Roxana Girju

  2. Outline • Overview of topic models • Cross-Collection LDA • Cross-cultural analysis with ccLDA • Other applications of ccLDA • Model evaluation • An alternative cross-collection model

  3. Outline • Overview of topic models • PLSI and LDA • Some slides borrowed from CS410 – ChengXiangZhai • Cross-Collection LDA • Cross-cultural analysis with ccLDA • Other applications of ccLDA • Model evaluation • An alternative cross-collection model

  4. Probabilistic Topic Models • Idea: each document is some mix of topics • Each word in the document belongs to a topic

  5. Document as a Sample of Mixed Topics [ Criticism of government response to the hurricane primarily consisted of criticism of its response to the approach of the storm and its aftermath, specifically in the delayed response ] to the [ flooding of New Orleans. … 80% of the 1.3 million residents of the greater New Orleans metropolitan area evacuated ] …[ Over seventy countries pledged monetary donations or other assistance]. … government 0.3 response 0.2... Topic 1 • Applications of topic models: • Summarize themes/aspects • Facilitate navigation/browsing • Retrieve documents • Segment documents • Many others • How can we discover these topic word distributions? city 0.2new 0.1orleans 0.05 ... Topic 2 … donate 0.1relief 0.05help 0.02 ... Topic k is 0.05the 0.04a 0.03 ... Background B

  6. Probabilistic Latent Semantic Indexing[Hofmann, 1999] • Each token in a document is associated with 2 variables: • a word w (observable) • a topic z (hidden) • P(w,z|d) = P(z|d) P(w|z)

  7. PLSA as a Mixture Model Document d warning 0.3 system 0.2.. ? Topic 1 d,1 ? 1 “Generating” word w in doc d in the collection 2 aid 0.1donation 0.05support 0.02 .. ? Topic 2 d,2 1 - B ? ? d, k W k … statistics 0.2loss 0.1dead 0.05 .. ? B ? Topic k ? B is 0.05the 0.04a 0.03 .. ? ? Background B Parameters: B=noise-level (manually set) ’s and ’s are estimated with Maximum Likelihood ?

  8. M-Step: Max. Likelihood Estimator based on “fractional counts” How to Estimate Multiple Topics?(Expectation Maximization) the 0.2 a 0.1 we 0.01 to 0.02 … E-Step: Predict topic labels using Bayes Rule Known Background p(w | B) Observed Doc(s) … text =? mining =? association =? word =? … Unknown topic model p(w|1)=? “Text mining” … Unknown topic model p(w|2)=? “informationretrieval” … information =? retrieval =? query =? document =? …

  9. PLSI - Problems • Each document is represented as a dummy variable d • Number of parameters grows linearly with corpus size • Overfitting • Not fully generative • Not clear how to model previously unseen documents

  10. Latent Dirichlet Allocation[Blei et al, 2003] • Per-document topic mixtures and word multinomials come from Dirichlet priors • Exact solution is intractable • Inference is more complicated • Variational methods • Monte Carlo

  11. Dirichlet Distribution • Conjugate prior of multinomial distribution

  12. Latent Dirichlet Allocation

  13. Outline • Overview of topic models • Cross-Collection LDA • Cross-cultural analysis with ccLDA • Other applications of ccLDA • Model evaluation • An alternative cross-collection model

  14. Cross-Collection LDA (ccLDA) • LDA extension for modeling multiple text collections • Each topic has a probability distribution that is shared among all collections as well as word distributions that are unique to each collection • Automatically discovers differences between collections and organizes them by topic

  15. Example • Topic of weather and the outdoors in travel forums

  16. ccLDA Graphical representation: The generative process: • Inference can be done with Gibbs sampling αφβ CT θ z w c x D γ0 ψσ δ γ1 TC N

  17. Previous Work • Comparative mixture model (CCMix) • ChengXiang Zhai, Atulya Velivelli, Bei Yu. A cross-collection mixture model for comparative text mining.Proceedings of ACM KDD 2004. • Improvements in ccLDA: • Does not rely on user-defined parameters • Distributions have Dirichlet/Beta priors • Document-topic distributions have collection-dependent priors • P(x) depends on the topic and collection

  18. Outline • Overview of topic models • Cross-Collection LDA • Cross-cultural analysis with ccLDA • Other applications of ccLDA • Model evaluation • An alternative cross-collection model

  19. Cross-Cultural Analysis • Documents from or about 3 countries: • United Kingdom • India • Singapore • 3,266 forum discussions • collected from lonelyplanet.com • represents the perspective of tourists • 7,388 English-language blogs • collected through blogcatalog.com • represents the perspective of locals

  20. Cross-Cultural Analysis • Topic of religion from the blogs

  21. Cross-Cultural Analysis • Topic of entertainment from the blogs • Compare against ccMix

  22. Cross-Cultural Analysis • Topic of travel from the blogs • Compare against LDA (on each collection individually)

  23. Cross-Cultural Analysis • Topic of food from both datasets • Compare the view of tourists and locals

  24. Outline • Overview of topic models • Cross-Collection LDA • Cross-cultural analysis with ccLDA • Other applications of ccLDA • Scientific research/literature analysis • Media analysis and bias detection • Model evaluation • An alternative cross-collection model

  25. Research Analysis • 16,186 abstracts from computational linguistics and linguistics journals • Interdisciplinary research topic discovery • Topic evolution over time

  26. Research Analysis • Topic of communication

  27. Research Analysis • Topic of parsing/grammars across two time intervals

  28. Media Analysis • 623 news articles from msnbc.com and foxnews.com from August 2008 • Discover editorial differences within topics

  29. Outline • Overview of topic models • Cross-Collection LDA • Cross-cultural analysis with ccLDA • Other applications of ccLDA • Model evaluation • An alternative cross-collection model

  30. Model Evaluation • Greater likelihood of held-out data than alternative models

  31. Model Evaluation • Document classification – new vs old • Compare to NB and SVM (linear kernel)

  32. Outline • Overview of topic models • Cross-Collection LDA • Cross-cultural analysis with ccLDA • Other applications of ccLDA • Model evaluation • An alternative cross-collection model

  33. Alternative Model • Similar to hierarchical Pachinko Allocation [Mimno et al, 2007] • Model as 2-level hierarchy

  34. Alternative Model • Single, global set of “super-topics” • One set of “sub-topics” for each collection • Choose super-topic T from P(T|d) • Choose sub-topic t from P(t|T,c) • Choose hierarchy level l from P(l|t,T) • if l = 0, choose word from P(w|T)else if l = 1, choose word from P(w|t)

  35. Alternative Model • This is just a generalization of ccLDA! • ccLDA = special case,constrained such that for each super-topic T=jthere is exactly one sub-topic such that P(t=j|T=j)=1 and P(t=i|T=j)=0 for all i ≠ j

  36. Alternative Model • Topic of religion in the blogs 0.970483

  37. Alternative Model • Topic of religion in the blogs 0.984414

  38. Alternative Model • Topic of religion in the blogs 0.851749 0.102534

  39. ccLDA • Topic of religion from the blogs

  40. Alternative Model • Topic of politicsin the blogs 0.29108 0.699227

  41. Alternative Model • Topic of politics in the blogs 0.987059

  42. Alternative Model • Topic of politics in the blogs 0.970675

  43. ccLDA • Topic of politics from the blogs

  44. Outline • Overview of topic models • Cross-Collection LDA • Cross-cultural analysis with ccLDA • Other applications of ccLDA • Model evaluation • An alternative cross-collection model

  45. Questions?

More Related