Sparse Word Graphs: A Scalable Algorithm for Capturing Word Correlations in Topic Models

Sparse Word Graphs: A Scalable Algorithm for Capturing Word Correlations in Topic Models

361 Views

Download Presentation
## Sparse Word Graphs: A Scalable Algorithm for Capturing Word Correlations in Topic Models

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -

**Sparse Word Graphs:A Scalable Algorithm for Capturing Word**Correlations in Topic Models Ramesh NallapatiJoint work with John Lafferty, Amr Ahmed, William Cohen and Eric Xing Machine Learning Department Carnegie Mellon University**Introduction**• Statistical topic modeling: an attractive framework for topic discovery • Completely unsupervised • Models text very well • Lower perplexity compared to unigram models • Reveals meaningful semantic patterns • Can help summarize and visualize document collections • e.g.: PLSA, LDA, DPM, DTM, CTM, PA ICDM’07 HPDM workskop**Introduction**• A common assumption in all the variants: • Exchangeability: “bag of words” assumption • Topics represented as a ranked list of words • Consequences: • Word Correlation information is lost • e.g.: “white-house” vs. “white” and “house” • Long distance correlations ICDM’07 HPDM workskop**Introduction**• Objective: • To capture correlations between words within topics • Motivation: • More interpretable representation of topics as a network of words rather than a list • Helps better visualize and summarize document collections • May reveal unexpected relationships and patterns within topics ICDM’07 HPDM workskop**Past Work: Topic Models**• Bigram topic models[Wallach, ICML 2006] • Requires KV(K-1) parameters • Only captures local dependencies • Does not model sparsity of correlations • Does not capture “within-topic” correlations ICDM’07 HPDM workskop**Past work: Other approaches**• Hyperspace Analog to Language (HAL) [Lund and Burges, Cog. Sci., ‘96] • Word pair correlation measured as a weighted count of number of times they occur within a fixed length window • Weight of an occurrence / 1/(mutual distance) ICDM’07 HPDM workskop**Past work: Other approaches**• Hyperspace Analog to Language (HAL) [Lund and Burges, Cog. Sci., ‘96] • Plusses: • Sparse solutions, scalability • Minuses: • Only unearths global correlations, not semantic correlations • E.g.: “river – bank”, “bank – check” • Only local dependencies ICDM’07 HPDM workskop**Past work: Other approaches**• Query expansion in IR • Similar in spirit: finds words that highly co-occur with the query words • However, not a corpus visualization tool: requires a context to operate on • Wordnet • Semantic networks • Human labeled: not directly related to our goal ICDM’07 HPDM workskop**Our approach**• L1 norm regularization • Known to enforce sparse solutions • Sparsity permits scalability • Convex optimization problem • Globally optimal solutions • Recent advances in learning structure of graphical models: • L1 regularization framework asymptotically leads to true structure ICDM’07 HPDM workskop**Background:LASSO**• Example: linear regression • Regularization used to improve generalizability • E.g.1: Ridge regression: L2 norm regularization • E.g.2: Lasso: L1 norm regularization ICDM’07 HPDM workskop**Background: LASSO**• Lasso encourages sparse solutions ICDM’07 HPDM workskop**Background: Gaussian Random Fields**• Multivariate Gaussian distribution • Random field structure: G = (V,E) • V: set of all variables {X1,,Xp} • (s,t) 2 E ,-1st 0 • Xs? Xu | XN(s) where u N(s) ICDM’07 HPDM workskop**Background: Gaussian Random Fields**• Estimating the graph structure of GRF from data [Meinshausen and Buhlmann, Annals. Stats., 2006] • Regress each variable onto others imposing L1 penalty to encourage sparsity • Estimated neighborhood: ICDM’07 HPDM workskop**Background: Gaussian Random Fields**Estimated graph True Graph Courtesy: [Meinshausen and Buhlmann, Annals. Stats., 2006] ICDM’07 HPDM workskop**Background: Gaussian Random Fields**• Application to topic models: CTM [Blei and Lafferty, NIPS, 2006] ICDM’07 HPDM workskop**Background: Gaussian Random Fields**• Application to CTM:[Blei & Lafferty, Annals. Appl. Stats., ‘07] ICDM’07 HPDM workskop**Structure learning of an MRF**• Ising model • L1 regularized conditional likelihood learns true structure asymptotically [Wainwright, Ravikumar and Lafferty, NIPS’06] ICDM’07 HPDM workskop**Structure learning of an MRF**Courtesy: [Wainwright, Ravikumar and Lafferty, NIPS’06] ICDM’07 HPDM workskop**Sparse Word Graphs**• Algorithm • Run LDA on the document collection and obtain topic assignments • Convert topic assignments for each document into K binary vectors X: • Assume an MRF for each topic with X as underlying data • Apply structure learning for MRF using regularized conditional likelihood ICDM’07 HPDM workskop**Sparse Word Graphs**ICDM’07 HPDM workskop**Sparse Word Graphs: Scalability**• We still run V logistic regression problems, each of size V for each topic: O(KV2) ! • However, each example is very sparse • L1 penalty results in sparse solutions • Can run each topic in parallel • Efficient interior point based L1 regularized logistic regression [Koh, Kim & Boyd, JMLR,’07] ICDM’07 HPDM workskop**Experiments**• Small AP corpus • 2.2K Docs, 10.5K unique words • Ran 10 topic LDA model • Used = 0.1 in L1 logistic regression • Took just 45 min. per topic • Very sparse solutions • Computes only under 0.1% of the total number of possible edges ICDM’07 HPDM workskop**Topic “Business”: neighborhood of top LDA terms**ICDM’07 HPDM workskop**Topic “Business”: neighborhood of top edges**ICDM’07 HPDM workskop**Topic “War”: neighborhood of top LDA terms**ICDM’07 HPDM workskop**Topic “War”: neighborhood of top edges**ICDM’07 HPDM workskop**Concluding remarks**• Pros • A highly scalable algorithm for capturing within topic word correlations • Captures both short distance and long distance correlations • Makes topics more interpretable • Cons • Not a complete probabilistic model • Significant modeling challenge since the correlations are latent ICDM’07 HPDM workskop**Concluding remarks**• Applications of Sparse Word Graphs • Better document summarization and visualization tool • Word sense disambiguation • Semantic query expansion • Future Work • Evaluation on a “real task” • Build a unified statistical model ICDM’07 HPDM workskop