1 / 39

The Text Revolution

Statistical Modeling of Large Text Collections Padhraic Smyth Department of Computer Science University of California, Irvine MURI Project Kick-off Meeting November 18th 2008. The Text Revolution. Widespread availability of text in digital form is driving

aammons
Download Presentation

The Text Revolution

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Statistical Modeling of Large Text CollectionsPadhraic SmythDepartment of Computer ScienceUniversity of California, Irvine MURI Project Kick-off MeetingNovember 18th 2008

  2. The Text Revolution Widespread availability of text in digital form is driving many new applications based on automated text analysis • Categorization/classification • Automated summarization • Machine translation • Information extraction • And so on….

  3. The Text Revolution Widespread availability of text in digital form is driving many new applications based on automated text analysis • Categorization/classification • Automated summarization • Machine translation • Information extraction • And so on…. • Most of this work is happening in computing, but many of the underlying techniques are statistical

  4. Motivation Pennsylvania Gazette 80,000 articles 1728-1800 16 million Medline articles NYT 1.5 million articles

  5. Problems of Interest • What topics do these documents “span”? • Which documents are about a particular topic? • How have topics changed over time? • What does author X write about? • and so on…..

  6. Problems of Interest • What topics do these documents “span”? • Which documents are about a particular topic? • How have topics changed over time? • What does author X write about? • and so on….. Key Ideas: • Learn a probabilistic model over words and docs • Treat query-answering as computation of appropriate conditional probabilities

  7. Topic Models for Documents P( words | document ) = ?? = S P(words|topic) P (topic|document) Topic = probability distribution over words Coefficients for each document Automatically learned from text corpus

  8. Topics = Multinomials over Words

  9. Topics = Multinomials over Words

  10. Basic Concepts • Topics = distributions over words • Unknown a priori, learned from data • Documents represented as mixtures of topics • Learning algorithm • Gibbs sampling (stochastic search) • Linear time per iteration • Provides a full probabilistic model over words, documents, and topics • Query answering = computation of conditional probabilities

  11. Enron email data 250,000 emails 28,000 individuals 1999-2002

  12. Enron email: business topics

  13. Enron: non-work topics…

  14. Examples of Topics from New York Times Terrorism Wall Street Firms Stock Market Bankruptcy SEPT_11 WAR SECURITY IRAQ TERRORISM NATION KILLED AFGHANISTAN ATTACKS OSAMA_BIN_LADEN AMERICAN ATTACK NEW_YORK_REGION NEW MILITARY NEW_YORK WORLD NATIONAL QAEDA TERRORIST_ATTACKS WALL_STREET ANALYSTS INVESTORS FIRM GOLDMAN_SACHS FIRMS INVESTMENT MERRILL_LYNCH COMPANIES SECURITIES RESEARCH STOCK BUSINESS ANALYST WALL_STREET_FIRMS SALOMON_SMITH_BARNEY CLIENTS INVESTMENT_BANKING INVESTMENT_BANKERS INVESTMENT_BANKS WEEK DOW_JONES POINTS 10_YR_TREASURY_YIELD PERCENT CLOSE NASDAQ_COMPOSITE STANDARD_POOR CHANGE FRIDAY DOW_INDUSTRIALS GRAPH_TRACKS EXPECTED BILLION NASDAQ_COMPOSITE_INDEX EST_02 PHOTO_YESTERDAY YEN 10 500_STOCK_INDEX BANKRUPTCY CREDITORS BANKRUPTCY_PROTECTION ASSETS COMPANY FILED BANKRUPTCY_FILING ENRON BANKRUPTCY_COURT KMART CHAPTER_11 FILING COOPER BILLIONS COMPANIES BANKRUPTCY_PROCEEDINGS DEBTS RESTRUCTURING CASE GROUP

  15. Topic trends from New York Times Tour-de-France TOUR RIDER LANCE_ARMSTRONG TEAM BIKE RACE FRANCE COMPANY QUARTER PERCENT ANALYST SHARE SALES EARNING Quarterly Earnings 330,000 articles 2000-2002 ANTHRAX LETTER MAIL WORKER OFFICE SPORES POSTAL BUILDING Anthrax

  16. What does an author write about? • Author = Jerry Friedman, Stanford:

  17. What does an author write about? • Author = Jerry Friedman, Stanford: • Topic 1: regression, estimate, variance, data, series,… • Topic 2: classification, training, accuracy, decision, data,… • Topic 3: distance, metric, similarity, measure, nearest,…

  18. What does an author write about? • Author = Jerry Friedman, Stanford: • Topic 1: regression, estimate, variance, data, series,… • Topic 2: classification, training, accuracy, decision, data,… • Topic 3: distance, metric, similarity, measure, nearest,… • Author = Rakesh Agrawal, IBM:

  19. What does an author write about? • Author = Jerry Friedman, Stanford: • Topic 1: regression, estimate, variance, data, series,… • Topic 2: classification, training, accuracy, decision, data,… • Topic 3: distance, metric, similarity, measure, nearest,… • Author = Rakesh Agrawal, IBM: - Topic 1: index, data, update, join, efficient…. - Topic 2: query, database, relational, optimization, answer…. - Topic 3: data, mining, association, discovery, attributes,…

  20. Examples of Data Sets Modeled • 1,200 Bible chapters (KJV) • 4,000 Blog entries • 20,000 PNAS abstracts • 80,000 Pennsylvania Gazette articles • 250,000 Enron emails • 300,000 North Carolina vehicle accident police reports • 500,000 New York Times articles • 650,000 CiteSeer abstracts • 8 million MEDLINE abstracts • Books by Austen, Dickens, and Melville • ….. • Exactly the same algorithm used in all cases – and in all cases interpretable topics produced automatically

  21. Related Work • Statistical origins • Latent class models in statistics (late 60’s) • Admixture models in genetics • LDA Model: Blei, Ng, and Jordan (2003) • Variational EM • Topic Model: Griffiths and Steyvers (2004) • Collapsed Gibbs sampler • Alternative approaches • Latent semantic indexing (LSI/LSA) • less interpretable, not appropriate for count data • Document clustering: • simpler but less powerful

  22. Clusters v. Topics

  23. Clusters v. Topics One Cluster

  24. Clusters v. Topics Multiple Topics One Cluster

  25. Extensions • Author-topic models • Authors = mixtures over topics (Steyvers, Smyth, Rosen-Zvi, Griffiths, 2004) • Special-words model • Documents = mixtures of topics + idiosyncratic words (Chemudugunta, Smyth, Steyvers, 2006) • Entity-topic models • Topic models that can reason about entities (Newman, Chemudugunta, Smyth, Steyvers, 2006) • See also work by McCallum, Blei, Buntine, Welling, Fienberg, Xing, etc • Probabilistic basis allows for a wide range of generalizations

  26. Combining Models for Networks and Text

  27. Combining Models for Networks and Text

  28. Combining Models for Networks and Text

  29. Combining Models for Networks and Text

  30. Technical Approach and Challenges • Develop flexible probabilistic network models that can incorporate textual information • e.g., ERGMs with text as node or edge covariates • e.g., latent space models with text-based covariates • e.g., dynamic relational models with text as edge covariates • Research challenges • Computational scalability • ERGMS not directly applicable to large text data sets • What text representation to use: • High-dimensional “bag of words” ? • Low-dimensional latent topics ? • Utility of text • Does the incorporation of textual information produce more accurate models or predictions? How can this be quantified?

  31. Graphical Model z Group Variable .......... Word 2 Word 1 Word n

  32. Graphical Model z Group Variable w Word n words

  33. Graphical Model z Group Variable w Word n words D documents

  34. Mixture Model for Documents Group Probabilities a z Group Variable f Group-Word distributions w Word n words D documents

  35. Clustering with a Mixture Model Cluster Probabilities a z Cluster Variable f Cluster-Word distributions w Word n words D documents

  36. Graphical Model for Topics Document-Topic distributions q z Topic f Topic-Word distributions w Word n D

  37. Learning via Gibbs sampling Document-Topic distributions q Gibbs sampler to estimate z for each word occurrence, …… marginalizing over other parameters z Topic f Topic-Word distributions w Word n D

  38. More Details on Learning • Gibbs sampling for word-topic assignments (z) • 1 iteration = full pass through all words in all documents • Typically run a few hundred Gibbs iterations • Estimating θand  • use z samples to get point estimates • non-informative Dirichlet priors forθand  • Computational Efficiency • Learning is linear in the number of word tokens  • Can still take order of a day on 100k or more docs

  39. Gibbs Sampler Stability

More Related