Sparse Word Graphs: A Scalable Algorithm for Capturing Word Correlations in Topic Models - PowerPoint PPT Presentation

paul2
sparse word graphs a scalable algorithm for capturing word correlations in topic models l.
Skip this Video
Loading SlideShow in 5 Seconds..
Sparse Word Graphs: A Scalable Algorithm for Capturing Word Correlations in Topic Models PowerPoint Presentation
Download Presentation
Sparse Word Graphs: A Scalable Algorithm for Capturing Word Correlations in Topic Models

play fullscreen
1 / 28
Download Presentation
Sparse Word Graphs: A Scalable Algorithm for Capturing Word Correlations in Topic Models
361 Views
Download Presentation

Sparse Word Graphs: A Scalable Algorithm for Capturing Word Correlations in Topic Models

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. Sparse Word Graphs:A Scalable Algorithm for Capturing Word Correlations in Topic Models Ramesh NallapatiJoint work with John Lafferty, Amr Ahmed, William Cohen and Eric Xing Machine Learning Department Carnegie Mellon University

  2. Introduction • Statistical topic modeling: an attractive framework for topic discovery • Completely unsupervised • Models text very well • Lower perplexity compared to unigram models • Reveals meaningful semantic patterns • Can help summarize and visualize document collections • e.g.: PLSA, LDA, DPM, DTM, CTM, PA ICDM’07 HPDM workskop

  3. Introduction • A common assumption in all the variants: • Exchangeability: “bag of words” assumption • Topics represented as a ranked list of words • Consequences: • Word Correlation information is lost • e.g.: “white-house” vs. “white” and “house” • Long distance correlations ICDM’07 HPDM workskop

  4. Introduction • Objective: • To capture correlations between words within topics • Motivation: • More interpretable representation of topics as a network of words rather than a list • Helps better visualize and summarize document collections • May reveal unexpected relationships and patterns within topics ICDM’07 HPDM workskop

  5. Past Work: Topic Models • Bigram topic models[Wallach, ICML 2006] • Requires KV(K-1) parameters • Only captures local dependencies • Does not model sparsity of correlations • Does not capture “within-topic” correlations ICDM’07 HPDM workskop

  6. Past work: Other approaches • Hyperspace Analog to Language (HAL) [Lund and Burges, Cog. Sci., ‘96] • Word pair correlation measured as a weighted count of number of times they occur within a fixed length window • Weight of an occurrence / 1/(mutual distance) ICDM’07 HPDM workskop

  7. Past work: Other approaches • Hyperspace Analog to Language (HAL) [Lund and Burges, Cog. Sci., ‘96] • Plusses: • Sparse solutions, scalability • Minuses: • Only unearths global correlations, not semantic correlations • E.g.: “river – bank”, “bank – check” • Only local dependencies ICDM’07 HPDM workskop

  8. Past work: Other approaches • Query expansion in IR • Similar in spirit: finds words that highly co-occur with the query words • However, not a corpus visualization tool: requires a context to operate on • Wordnet • Semantic networks • Human labeled: not directly related to our goal ICDM’07 HPDM workskop

  9. Our approach • L1 norm regularization • Known to enforce sparse solutions • Sparsity permits scalability • Convex optimization problem • Globally optimal solutions • Recent advances in learning structure of graphical models: • L1 regularization framework asymptotically leads to true structure ICDM’07 HPDM workskop

  10. Background:LASSO • Example: linear regression • Regularization used to improve generalizability • E.g.1: Ridge regression: L2 norm regularization • E.g.2: Lasso: L1 norm regularization ICDM’07 HPDM workskop

  11. Background: LASSO • Lasso encourages sparse solutions ICDM’07 HPDM workskop

  12. Background: Gaussian Random Fields • Multivariate Gaussian distribution • Random field structure: G = (V,E) • V: set of all variables {X1,,Xp} • (s,t) 2 E ,-1st 0 • Xs? Xu | XN(s) where u  N(s) ICDM’07 HPDM workskop

  13. Background: Gaussian Random Fields • Estimating the graph structure of GRF from data [Meinshausen and Buhlmann, Annals. Stats., 2006] • Regress each variable onto others imposing L1 penalty to encourage sparsity • Estimated neighborhood: ICDM’07 HPDM workskop

  14. Background: Gaussian Random Fields Estimated graph True Graph Courtesy: [Meinshausen and Buhlmann, Annals. Stats., 2006] ICDM’07 HPDM workskop

  15. Background: Gaussian Random Fields • Application to topic models: CTM [Blei and Lafferty, NIPS, 2006] ICDM’07 HPDM workskop

  16. Background: Gaussian Random Fields • Application to CTM:[Blei & Lafferty, Annals. Appl. Stats., ‘07] ICDM’07 HPDM workskop

  17. Structure learning of an MRF • Ising model • L1 regularized conditional likelihood learns true structure asymptotically [Wainwright, Ravikumar and Lafferty, NIPS’06] ICDM’07 HPDM workskop

  18. Structure learning of an MRF Courtesy: [Wainwright, Ravikumar and Lafferty, NIPS’06] ICDM’07 HPDM workskop

  19. Sparse Word Graphs • Algorithm • Run LDA on the document collection and obtain topic assignments • Convert topic assignments for each document into K binary vectors X: • Assume an MRF for each topic with X as underlying data • Apply structure learning for MRF using regularized conditional likelihood ICDM’07 HPDM workskop

  20. Sparse Word Graphs ICDM’07 HPDM workskop

  21. Sparse Word Graphs: Scalability • We still run V logistic regression problems, each of size V for each topic: O(KV2) ! • However, each example is very sparse • L1 penalty results in sparse solutions • Can run each topic in parallel • Efficient interior point based L1 regularized logistic regression [Koh, Kim & Boyd, JMLR,’07] ICDM’07 HPDM workskop

  22. Experiments • Small AP corpus • 2.2K Docs, 10.5K unique words • Ran 10 topic LDA model • Used  = 0.1 in L1 logistic regression • Took just 45 min. per topic • Very sparse solutions • Computes only under 0.1% of the total number of possible edges ICDM’07 HPDM workskop

  23. Topic “Business”: neighborhood of top LDA terms ICDM’07 HPDM workskop

  24. Topic “Business”: neighborhood of top edges ICDM’07 HPDM workskop

  25. Topic “War”: neighborhood of top LDA terms ICDM’07 HPDM workskop

  26. Topic “War”: neighborhood of top edges ICDM’07 HPDM workskop

  27. Concluding remarks • Pros • A highly scalable algorithm for capturing within topic word correlations • Captures both short distance and long distance correlations • Makes topics more interpretable • Cons • Not a complete probabilistic model • Significant modeling challenge since the correlations are latent ICDM’07 HPDM workskop

  28. Concluding remarks • Applications of Sparse Word Graphs • Better document summarization and visualization tool • Word sense disambiguation • Semantic query expansion • Future Work • Evaluation on a “real task” • Build a unified statistical model ICDM’07 HPDM workskop