龙星计划课程 : 信息检索 Topic Models for Text Mining

龙星计划课程:信息检索Topic Models for Text Mining ChengXiang Zhai (翟成祥) Department of Computer Science Graduate School of Library & Information Science Institute for Genomic Biology, Statistics University of Illinois, Urbana-Champaign http://www-faculty.cs.uiuc.edu/~czhai, czhai@cs.uiuc.edu

Text Management Applications Mining Access Select information Create Knowledge Add Structure/Annotations Organization

What Is Text Mining? “The objective of Text Mining is to exploit information contained in textual documents in various ways, including …discovery of patterns and trends in data, associations among entities, predictive rules, etc.” (Grobelnik et al., 2001) “Another way to view text data mining is as a process of exploratory data analysis that leads to heretofore unknown information, or to answers for questions for which the answer is not currently known.” (Hearst, 1999) (Slide from Rebecca Hwa’s “Intro to Text Mining”)

Two Different Views of Text Mining Shallow mining Deep mining • Data Mining View: Explore patterns in textual data • Find latent topics • Find topical trends • Find outliers and other hidden patterns • Natural Language Processing View: Make inferences based on partial understanding natural language text • Information extraction • Question answering

Applications of Text Mining • Direct applications: Go beyond search to find knowledge • Question-driven (Bioinformatics, Business Intelligence, etc): We have specific questions; how can we exploit data mining to answer the questions? • Data-driven (WWW, literature, email, customer reviews, etc): We have a lot of data; what can we do with it? • Indirect applications • Assist information access (e.g., discover latent topics to better summarize search results) • Assist information organization (e.g., discover hidden structures)

Text Mining Methods Topic of this lecture • Data Mining Style: View text as high dimensional data • Frequent pattern finding • Association analysis • Outlier detection • Information Retrieval Style: Fine granularity topical analysis • Topic extraction • Exploit term weighting and text similarity measures • Natural Language Processing Style: Information Extraction • Entity extraction • Relation extraction • Sentiment analysis • Question answering • Machine Learning Style: Unsupervised or semi-supervised learning • Mixture models • Dimension reduction

Outline • The Basic Topic Models: • Probabilistic Latent Semantic Analysis (PLSA) [Hofmann 99] • Latent Dirichlet Allocation (LDA) [Blei et al. 02] • Extensions • Contextual Probabilistic Latent Semantic Analysis (CPLSA) [Mei & Zhai 06] • Other extensions

Basic Topic Model: PLSA

PLSA: Motivation What did people say in their blog articles about “Hurricane Katrina”? Query = “Hurricane Katrina” Results:

Probabilistic Latent Semantic Analysis/Indexing (PLSA/PLSI) [Hofmann 99] Mix k multinomial distributions to generate a document Each document has a potentially different set of mixing weights which captures the topic coverage When generating words in a document, each word may be generated using a DIFFERENT multinomial distribution (this is in contrast with the document clustering model where, once a multinomial distribution is chosen, all the words in a document would be generated using the same model) We may add a background distribution to “attract” background words

PLSA as a Mixture Model Document d warning 0.3 system 0.2.. ? Topic 1 d,1 ? 1 “Generating” word w in doc d in the collection 2 aid 0.1donation 0.05support 0.02 .. ? Topic 2 d,2 1 - B ? ? d, k W k … statistics 0.2loss 0.1dead 0.05 .. ? B ? Topic k ? B is 0.05the 0.04a 0.03 .. ? ? Background B Parameters: B=noise-level (manually set) ’s and ’s are estimated with Maximum Likelihood ?

Special Case: Model-based Feedback Background words P(w| B)  P(source) Topic words 1- P(w| F ) Maximum Likelihood: What about there are k topics? Simple case: there is only one topic

How to Estimate j: EM Algorithm ML Estimator the 0.2 a 0.1 we 0.01 to 0.02 … Known Background p(w | B) Observed Doc(s) Suppose, we know the identity of each word ... … text =? mining =? association =? word =? … Unknown topic model p(w|1)=? “Text mining” … Unknown topic model p(w|2)=? “informationretrieval” … information =? retrieval =? query =? document =? …

How the Algorithm Works c(w,d)(1 - p(zd,w = B))p(zd,w=j) πd1,1( P(θ1|d1) ) πd1,2( P(θ2|d1) ) c(w,d)p(zd,w = B) c(w, d) aid 7 price d1 5 Initial value 6 oil πd2,1( P(θ1|d2) ) πd2,2( P(θ2|d2) ) aid 8 d2 price 7 5 oil Initial value Topic 1 Topic 2 P(w| θ) Iteration 2: E Step: split word counts with different topics (by computing z’ s) Iteration 1: M Step: re-estimate πd, j and P(w| θj) by adding and normalizing the splitted word counts Initializing πd, j and P(w| θj) with random values Iteration 3, 4, 5, … Until converging Iteration 2: M Step: re-estimate πd, j and P(w| θj) by adding and normalizing the splitted word counts Iteration 1: E Step: split word counts with different topics (by computing z’ s) aid Initial value price 14 oil

Parameter Estimation • E-Step: • Word w in doc d is generated • from cluster j • from background Application of Bayes rule • M-Step: • Re-estimate • mixing weights • cluster LM • Fractional counts contributing to • using cluster j in generating d • generating w from cluster j Sum over all docs (in multiple collections) m = 1 if one collection

PLSA with Prior Knowledge 16 • There are different ways of choosing aspects (topics) • Google = Google News + Google Map + Google scholar, … • Google = Google US + Google France + Google China, … • Users have some domain knowledge in mind, e.g., • We expect to see “retrieval models” as a topic in IR. • We want to show the aspects of “history” and “statistics” for Youtube • A flexible way to incorporate such knowledge as priors of PLSA model • In Bayesian, it’s your “belief” on the topic distributions

Adding Prior Most likely  Document d warning 0.3 system 0.2.. Topic 1 d,1 1 “Generating” word w in doc d in the collection 2 aid 0.1donation 0.05support 0.02 .. Topic 2 d,2 1 - B d, k W k … statistics 0.2loss 0.1dead 0.05 .. B Topic k B is 0.05the 0.04a 0.03 .. Background B Parameters: B=noise-level (manually set) ’s and ’s are estimated with Maximum Likelihood 17

Adding Prior as Pseudo Counts MAP Estimator Observed Doc(s) the 0.2 a 0.1 we 0.01 to 0.02 … Known Background p(w | B) Suppose, we know the identity of each word ... … text =? mining =? association =? word =? … Unknown topic model p(w|1)=? “Text mining” Pseudo Doc … Unknown topic model p(w|2)=? “informationretrieval” … information =? retrieval =? query =? document =? … Size = μ text 18 mining

Maximum A Posterior (MAP) Estimation +p(w|’j) + Pseudo counts of w from prior ’ Sum of all pseudo counts What if =0? What if =+? 19

Basic Topic Model: LDA The following slides about LDA are taken from Michael C. Mozer’s course lecture http://www.cs.colorado.edu/~mozer/courses/ProbabilisticModels/

LDA: Motivation • “Documents have no generative probabilistic semantics” • i.e., document is just a symbol • Model has many parameters • linear in number of documents • need heuristic methods to prevent overfitting • Cannot generalize to new documents

Unigram Model

Mixture of Unigrams

Topic Model / Probabilistic LSI • d is a localist representation of (trained) documents • LDA provides a distributed representation

LDA • Vocabulary of |V| words • Document is a collection of words from vocabulary. • N words in document • w = (w1, ..., wN) • Latent topics • random variable z, with values 1, ..., k • Like topic model, document is generated by sampling a topic from a mixture and then sampling a word from a mixture. • But topic model assumes a fixed mixture of topics (multinomial distribution) for each document. • LDA assumes a random mixture of topics (Dirichlet distribution) for each topic.

Generative Model • “Plates” indicate looping structure • Outer plate replicated for each document • Inner plate replicated for each word • Same conditional distributions apply for each replicate • Document probability

Fancier Version

Inference

Inference • In general, this formula is intractable: • Expanded version: 1 if wn is the j'th vocab word

Variational Approximation • Computing log likelihood and introducing Jensen's inequality: log(E[x]) >= E[log(x)] • Find variational distribution q such that the above equation is computable. • q parameterized by γ and φn • Maximize bound with respect to γ and φn to obtain best approximation to p(w | α, β) • Lead to variational EM algorithm • Sampling algorithms (e.g., Gibbs sampling) are also common

Data Sets • C. Elegans Community abstracts • 5,225 abstracts • 28,414 unique terms • TREC AP corpus (subset) • 16,333 newswire articles • 23,075 unique terms • Held-out data – 10% • Removed terms • 50 stop words, words appearing once

C. Elegans Note: fold in hack for pLSI to allow it to handle novel documents. Involves refitting p(z|dnew) parameters -> sort of a cheat

Summary: PLSA vs. LDA • LDA adds a Dirichlet distribution on top of PLSA to regularize the model • Estimation of LDA is more complicated than PLSA • LDA is a generative model, while PLSA isn’t • PLSA is more likely to over-fit the data than LDA • Which one to use? • If you need generalization capacity, LDA • If you want to mine topics from a collection, PLSA may be better (we want overfitting!)

Extension of PLSA: Contextual Probabilistic Latent Semantic Analysis (CPLSA)

A General Introduction to EM Data: X (observed) + H(hidden) Parameter:  “Incomplete” likelihood: L( )= log p(X| ) “Complete” likelihood: Lc( )= log p(X,H| ) EM tries to iteratively maximize the incomplete likelihood: Starting with an initial guess (0), 1. E-step: compute the expectation of the complete likelihood 2. M-step: compute (n) by maximizing the Q-function

Convergence Guarantee Goal: maximizing “Incomplete” likelihood: L( )= log p(X| ) I.e., choosing (n), so thatL((n))-L((n-1))0 Note that, sincep(X,H| ) =p(H|X, ) P(X| ) , L() =Lc() -log p(H|X, ) L((n))-L((n-1)) = Lc((n))-Lc( (n-1))+log [p(H|X,  (n-1) )/p(H|X, (n))] Taking expectation w.r.t.p(H|X, (n-1)), L((n))-L((n-1)) = Q((n);  (n-1))-Q( (n-1);  (n-1)) + D(p(H|X,  (n-1))||p(H|X,  (n))) Doesn’t contain H EM chooses (n) to maximize Q KL-divergence, always non-negative Therefore, L((n))  L((n-1))!

Another way of looking at EM Likelihood p(X| ) L((n-1)) + Q(; (n-1)) -Q( (n-1);  (n-1) )+ D(p(H|X,  (n-1) )||p(H|X,  )) L((n-1)) + Q(; (n-1)) -Q( (n-1);  (n-1) ) next guess current guess Lower bound (Q function)  E-step = computing the lower bound M-step = maximizing the lower bound

Why Contextual PLSA?

Motivating Example:Comparing Product Reviews IBM Laptop Reviews APPLE Laptop Reviews DELL Laptop Reviews Unsupervised discovery of common topics and their variations

Motivating Example:Comparing News about Similar Topics Vietnam War Afghan War Iraq War Unsupervised discovery of common topics and their variations

Motivating Example:Discovering Topical Trends in Literature Theme Strength Time 1980 1990 1998 2003 TF-IDF Retrieval Language Model Text Categorization IR Applications Unsupervised discovery of topics and their temporal variations

Motivating Example:Analyzing Spatial Topic Patterns • How do blog writers in different states respond to topics such as “oil price increase during Hurricane Karina”? • Unsupervised discovery of topics and their variations in different locations

Motivating Example: Sentiment Summary Unsupervised/Semi-supervised discovery of topics and different sentiments of the topics

Research Questions • Can we model all these problems generally? • Can we solve these problems with a unified approach? • How can we bring human into the loop?

Contextual Text Mining • Given collections of text with contextual information (meta-data) • Discover themes/subtopics/topics (interesting word clusters) • Compute variations of themes over contexts • Applications: • Summarizing search results • Federation of text information • Opinion analysis • Social network analysis • Business intelligence • ..

Context Features of Text (Meta-data) Weblog Article communities Author source Location Time Author’s Occupation

Context = Partitioning of Text papers written in 1998 Papers about Web papers written by authors in US 1998 1999 …… …… 2005 2006 WWW SIGIR ACL KDD SIGMOD

Themes/Topics • Uses of themes: • Summarize topics/subtopics • Navigate in a document space • Retrieve documents • Segment documents • … government 0.3 response 0.2.. [ Criticism of government response to the hurricane primarily consisted of criticism of its response to the approach of the storm and its aftermath, specifically in the delayed response ] to the [flooding of New Orleans. … 80% of the 1.3 million residents of the greater New Orleansmetropolitan area evacuated ] …[ Over seventy countries pledged monetary donations or other assistance]. … Theme 1 donate 0.1relief 0.05help 0.02 .. Theme 2 … city 0.2new 0.1orleans 0.05 .. Theme k Is 0.05the 0.04a 0.03 .. Background B

View of Themes: Context-Specific Version of Views vector space TF-IDF Theme 2: Feedback Okapi vector retrieve Theme 1: Retrieval Model LSI feedback Rocchio model retrieval weighting judge relevance feedback expansion term document pseudo query query language mixture model model estimate smoothing EM query feedback generation pseudo Context: After 1998 (Language models) Context: Before 1998 (Traditional models)

龙星计划课程 : 信息检索 Topic Models for Text Mining