Author-Topic Models for Large Text Corpora

Author-Topic Models for Large Text Corpora Padhraic SmythDepartment of Computer Science University of California, Irvine In collaboration with: Mark Steyvers (UCI) Michal Rosen-Zvi (UCI) Tom Griffiths (Stanford)

Outline • Problem motivation: • Modeling large sets of documents • Probabilistic approaches • topic models -> author-topic models • Results • Author-topic results from CiteSeer, NIPS, Enron data • Applications of the model • (Demo of author-topic query tool) • Future directions

Data Sets of Interest • Data = set of documents • Large collection of documents: 10k, 100k, etc • Know authors of the documents • Know years/dates of the documents • …… • (will typically assume bag of words representation)

Examples of Data Sets • CiteSeer: • 160k abstracts, 80k authors, 1986-2002 • NIPS papers • 2k papers, 1k authors, 1987-1999 • Reuters • 20k newspaper articles, 114 authors

Pennsylvania Gazette 1728-1800 80,000 articles 25 million words www.accessible.com

Enron email data 500,000 emails 5000 authors 1999-2002

Problems of Interest • What topics do these documents “span”? • Which documents are about a particular topic? • How have topics changed over time? • What does author X write about? • Who is likely to write about topic Y? • Who wrote this specific document? • and so on…..

A topic is represented as a (multinomial) distribution over words

Cluster Models DOCUMENT 1 DOCUMENT 2 Probabilistic Information Learning Retrieval Learning Information Bayesian Retrieval

Graphical Model z Cluster Variable w Word n words

Graphical Model z Cluster Variable w Word n words D documents

Graphical Model Cluster Weights a z Cluster Variable f Cluster-Word distributions w Word n words D documents

Cluster Models DOCUMENT 1 DOCUMENT 2 Probabilistic Information Learning Retrieval Learning Information Bayesian Retrieval DOCUMENT 3 Probabilistic Learning Information Retrieval

Topic Models DOCUMENT 1 DOCUMENT 2 Probabilistic Information Learning Retrieval Learning Information Bayesian Retrieval

Topic Models DOCUMENT 1 DOCUMENT 2 Probabilistic Information Learning Retrieval Learning Information Bayesian Retrieval DOCUMENT 3 Probabilistic Learning Information Retrieval

History of topic models • Latent class models in statistics (late 60’s) • Hoffman (1999) • Original application to documents • Blei, Ng, and Jordan (2001, 2003) • Variational methods • Griffiths and Steyvers (2003, 2004) • Gibbs sampling approach (very efficient)

Word/Document countsfor 16 Artificial Documents documents Can we recover the original topics and topic mixtures from this data?

Example of Gibbs Sampling • Assign word tokens randomly to topics: (●=topic 1; ●=topic 2 )

After 1 iteration • Apply sampling equation to each word token

After 4 iterations

After 32 iterations  ● ●

Topic Models DOCUMENT 1 DOCUMENT 2 Probabilistic Information Learning Retrieval Learning Information Bayesian Retrieval DOCUMENT 3 Probabilistic Learning Information Retrieval

Author-Topic Models DOCUMENT 1 DOCUMENT 2 Probabilistic Information Learning Retrieval Learning Information Bayesian Retrieval

Author-Topic Models DOCUMENT 1 DOCUMENT 2 Probabilistic Information Learning Retrieval Learning Information Bayesian Retrieval DOCUMENT 3 Probabilistic Learning Information Retrieval

Approach • The author-topic model • a probabilistic model linking authors and topics • authors -> topics -> words • learned from data • completely unsupervised, no labels • generative model • Different questions or queries can be answered by appropriate probability calculus • E.g., p(author | words in document) • E.g., p(topic | author)

Graphical Model x Author z Topic

Graphical Model x Author z Topic w Word

Graphical Model x Author z Topic w Word n

Graphical Model a x Author z Topic w Word n D

Graphical Model a x Author Author-Topic distributions q z Topic f Topic-Word distributions w Word n D

Generative Process • Let’s assume authors A1 and A2 collaborate and produce a paper • A1 has multinomial topic distribution q1 • A2 has multinomial topic distribution q2 • For each word in the paper: • Sample an author x (uniformly) from A1,A2 • Sample a topic z from qX • Sample a word w from a multinomial topic distribution z

Graphical Model a x Author Author-Topic distributions q z Topic f Topic-Word distributions w Word n D

Learning • Observed • W = observed words, A = sets of known authors • Unknown • x, z : hidden variables • Θ,  : unknown parameters • Interested in: • p( x, z | W, A) • p( θ ,  | W, A) • But exact inference is not tractable

Step 1: Gibbs sampling of x and z a x Author q Marginalize over unknown parameters z Topic f w Word n D

Step 2: MAP estimates of θand  a x Author Condition on particular samples of x and z q z Topic f w Word n D

Step 2: MAP estimates of θand  a x Author q Point estimates of unknown parameters z Topic f w Word n D

More Details on Learning • Gibbs sampling for x and z • Typically run 2000 Gibbs iterations • 1 iteration = full pass through all documents • Estimating θand  • x and z sample -> point estimates • non-informative Dirichlet priors forθand  • Computational Efficiency • Learning is linear in the number of word tokens  • Predictions on new documents • can average over θand  (from different samples, different runs)

Gibbs Sampling • Need full conditional distributions for variables • The probability of assigning the current word i to topic j and author k given everything else: number of times word w assigned to topic j number of times topic j assigned to author k

Experiments on Real Data • Corpora • CiteSeer: 160K abstracts, 85K authors • NIPS: 1.7K papers, 2K authors • Enron: 115K emails, 5K authors (sender) • Pubmed: 27K abstracts, 50K authors • Removed stop words; no stemming • Ignore word order, just use word counts • Processing time: Nips: 2000 Gibbs iterations  8 hours CiteSeer: 2000 Gibbs iterations  4 days

Four example topics from CiteSeer (T=300)

More CiteSeer Topics

Some topics relate to generic word usage

What can the Model be used for? • We can analyze our document set through the “topic lens” • Applications • Queries • Who writes on this topic? • e.g., finding experts or reviewers in a particular area • What topics does this person do research on? • Discovering trends over time • Detecting unusual papers and authors • Interactive browsing of a digital library via topics • Parsing documents (and parts of documents) by topic • and more…..

Some likely topics per author (CiteSeer) • Author = Andrew McCallum, U Mass: • Topic 1: classification, training, generalization, decision, data,… • Topic 2: learning, machine, examples, reinforcement, inductive,….. • Topic 3: retrieval, text, document, information, content,… • Author = Hector Garcia-Molina, Stanford: - Topic 1: query, index, data, join, processing, aggregate…. - Topic 2: transaction, concurrency, copy, permission, distributed…. - Topic 3: source, separation, paper, heterogeneous, merging….. • Author = Paul Cohen, USC/ISI: - Topic 1: agent, multi, coordination, autonomous, intelligent…. - Topic 2: planning, action, goal, world, execution, situation… - Topic 3: human, interaction, people, cognitive, social, natural….

Temporal patterns in topics: hot and cold topics • We have CiteSeer papers from 1986-2002 • For each year, calculate the fraction of words assigned to each topic • -> a time-series for topics • Hot topics become more prevalent • Cold topics become less prevalent

Author-Topic Models for Large Text Corpora

Author-Topic Models for Large Text Corpora

Presentation Transcript

Generative Models For Text

Text Corpora and Lexical Resources

Beyond Search: Statistical Topic Models for Text Analysis

Probabilistic Graphical Models For Text Mining: A Topic Modeling Survey

Topic models

Text Models

Models for Writing Expository Text

Topic Mapping Tools for Biomedical Corpora

Topic Models

Topic Models in Text Processing

Probabilistic Topic Models for Text Mining

Large Models for Large Corpora: preliminary findings

Noun Homograph Disambiguation Using Local Context in Large Text Corpora

Issues: Large Corpora

Analyzing unstructured text with topic models

Beyond Search: Statistical Topic Models for Text Analysis

Contextual Text Mining with Probabilistic Topic Models

Contextual Text Mining with Probabilistic Topic Models

Topic models for corpora and for graphs