Sparse Word Graphs: A Scalable Algorithm for Capturing Word Correlations in Topic Models

Sparse Word Graphs:A Scalable Algorithm for Capturing Word Correlations in Topic Models Ramesh NallapatiJoint work with John Lafferty, Amr Ahmed, William Cohen and Eric Xing Machine Learning Department Carnegie Mellon University

Introduction • Statistical topic modeling: an attractive framework for topic discovery • Completely unsupervised • Models text very well • Lower perplexity compared to unigram models • Reveals meaningful semantic patterns • Can help summarize and visualize document collections • e.g.: PLSA, LDA, DPM, DTM, CTM, PA ICDM’07 HPDM workskop

Introduction • A common assumption in all the variants: • Exchangeability: “bag of words” assumption • Topics represented as a ranked list of words • Consequences: • Word Correlation information is lost • e.g.: “white-house” vs. “white” and “house” • Long distance correlations ICDM’07 HPDM workskop

Introduction • Objective: • To capture correlations between words within topics • Motivation: • More interpretable representation of topics as a network of words rather than a list • Helps better visualize and summarize document collections • May reveal unexpected relationships and patterns within topics ICDM’07 HPDM workskop

Past Work: Topic Models • Bigram topic models[Wallach, ICML 2006] • Requires KV(K-1) parameters • Only captures local dependencies • Does not model sparsity of correlations • Does not capture “within-topic” correlations ICDM’07 HPDM workskop

Past work: Other approaches • Hyperspace Analog to Language (HAL) [Lund and Burges, Cog. Sci., ‘96] • Word pair correlation measured as a weighted count of number of times they occur within a fixed length window • Weight of an occurrence / 1/(mutual distance) ICDM’07 HPDM workskop

Past work: Other approaches • Hyperspace Analog to Language (HAL) [Lund and Burges, Cog. Sci., ‘96] • Plusses: • Sparse solutions, scalability • Minuses: • Only unearths global correlations, not semantic correlations • E.g.: “river – bank”, “bank – check” • Only local dependencies ICDM’07 HPDM workskop

Past work: Other approaches • Query expansion in IR • Similar in spirit: finds words that highly co-occur with the query words • However, not a corpus visualization tool: requires a context to operate on • Wordnet • Semantic networks • Human labeled: not directly related to our goal ICDM’07 HPDM workskop

Our approach • L1 norm regularization • Known to enforce sparse solutions • Sparsity permits scalability • Convex optimization problem • Globally optimal solutions • Recent advances in learning structure of graphical models: • L1 regularization framework asymptotically leads to true structure ICDM’07 HPDM workskop

Background:LASSO • Example: linear regression • Regularization used to improve generalizability • E.g.1: Ridge regression: L2 norm regularization • E.g.2: Lasso: L1 norm regularization ICDM’07 HPDM workskop

Background: LASSO • Lasso encourages sparse solutions ICDM’07 HPDM workskop

Background: Gaussian Random Fields • Multivariate Gaussian distribution • Random field structure: G = (V,E) • V: set of all variables {X1,,Xp} • (s,t) 2 E ,-1st 0 • Xs? Xu | XN(s) where u  N(s) ICDM’07 HPDM workskop

Background: Gaussian Random Fields • Estimating the graph structure of GRF from data [Meinshausen and Buhlmann, Annals. Stats., 2006] • Regress each variable onto others imposing L1 penalty to encourage sparsity • Estimated neighborhood: ICDM’07 HPDM workskop

Background: Gaussian Random Fields Estimated graph True Graph Courtesy: [Meinshausen and Buhlmann, Annals. Stats., 2006] ICDM’07 HPDM workskop

Background: Gaussian Random Fields • Application to topic models: CTM [Blei and Lafferty, NIPS, 2006] ICDM’07 HPDM workskop

Background: Gaussian Random Fields • Application to CTM:[Blei & Lafferty, Annals. Appl. Stats., ‘07] ICDM’07 HPDM workskop

Structure learning of an MRF • Ising model • L1 regularized conditional likelihood learns true structure asymptotically [Wainwright, Ravikumar and Lafferty, NIPS’06] ICDM’07 HPDM workskop

Structure learning of an MRF Courtesy: [Wainwright, Ravikumar and Lafferty, NIPS’06] ICDM’07 HPDM workskop

Sparse Word Graphs • Algorithm • Run LDA on the document collection and obtain topic assignments • Convert topic assignments for each document into K binary vectors X: • Assume an MRF for each topic with X as underlying data • Apply structure learning for MRF using regularized conditional likelihood ICDM’07 HPDM workskop

Sparse Word Graphs ICDM’07 HPDM workskop

Sparse Word Graphs: Scalability • We still run V logistic regression problems, each of size V for each topic: O(KV2) ! • However, each example is very sparse • L1 penalty results in sparse solutions • Can run each topic in parallel • Efficient interior point based L1 regularized logistic regression [Koh, Kim & Boyd, JMLR,’07] ICDM’07 HPDM workskop

Experiments • Small AP corpus • 2.2K Docs, 10.5K unique words • Ran 10 topic LDA model • Used  = 0.1 in L1 logistic regression • Took just 45 min. per topic • Very sparse solutions • Computes only under 0.1% of the total number of possible edges ICDM’07 HPDM workskop

Topic “Business”: neighborhood of top LDA terms ICDM’07 HPDM workskop

Topic “Business”: neighborhood of top edges ICDM’07 HPDM workskop

Topic “War”: neighborhood of top LDA terms ICDM’07 HPDM workskop

Topic “War”: neighborhood of top edges ICDM’07 HPDM workskop

Concluding remarks • Pros • A highly scalable algorithm for capturing within topic word correlations • Captures both short distance and long distance correlations • Makes topics more interpretable • Cons • Not a complete probabilistic model • Significant modeling challenge since the correlations are latent ICDM’07 HPDM workskop

Concluding remarks • Applications of Sparse Word Graphs • Better document summarization and visualization tool • Word sense disambiguation • Semantic query expansion • Future Work • Evaluation on a “real task” • Build a unified statistical model ICDM’07 HPDM workskop

Sparse Word Graphs: A Scalable Algorithm for Capturing Word Correlations in Topic Models

Sparse Word Graphs: A Scalable Algorithm for Capturing Word Correlations in Topic Models

Presentation Transcript

Sparse Word Graphs: A Scalable Algorithm for Capturing Word Correlations in Topic Models

Topic 6: word meaning

Models of word reading

Word Within a Word

Word within a word

Word Within a Word

Word Within A Word

Word within a Word

A Word On The Word

Word Within a Word

Sparse Compact Directed Acyclic Word Graphs

Word Within a Word

A Word

JIGSAW: an Algorithm for Word Sense Disambiguation

Word within a Word

Word Within a Word

Word within a Word

172 200 word for word

Word Segmentation Models: Overview

Topic models for corpora and for graphs