- 52 Views
- Uploaded on
- Presentation posted in: General

龙星计划课程 : 信息检索 Topic Models for Text Mining

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

龙星计划课程:信息检索Topic Models for Text Mining

ChengXiang Zhai (翟成祥)

Department of Computer Science

Graduate School of Library & Information Science

Institute for Genomic Biology, Statistics

University of Illinois, Urbana-Champaign

http://www-faculty.cs.uiuc.edu/~czhai, [email protected]

Mining

Access

Select

information

Create Knowledge

Add

Structure/Annotations

Organization

“The objective of Text Mining is to exploit information contained in textual documents in various ways, including …discovery of patterns and trends in data, associations among entities, predictive rules, etc.” (Grobelnik et al., 2001)

“Another way to view text data mining is as a process of exploratory data analysis that leads to heretofore unknown information, or to answers for questions for which the answer is not currently known.” (Hearst, 1999)

(Slide from Rebecca Hwa’s “Intro to Text Mining”)

Shallow mining

Deep mining

- Data Mining View: Explore patterns in textual data
- Find latent topics
- Find topical trends
- Find outliers and other hidden patterns

- Natural Language Processing View: Make inferences based on partial understanding natural language text
- Information extraction
- Question answering

- Direct applications: Go beyond search to find knowledge
- Question-driven (Bioinformatics, Business Intelligence, etc): We have specific questions; how can we exploit data mining to answer the questions?
- Data-driven (WWW, literature, email, customer reviews, etc): We have a lot of data; what can we do with it?

- Indirect applications
- Assist information access (e.g., discover latent topics to better summarize search results)
- Assist information organization (e.g., discover hidden structures)

Topic of this lecture

- Data Mining Style: View text as high dimensional data
- Frequent pattern finding
- Association analysis
- Outlier detection

- Information Retrieval Style: Fine granularity topical analysis
- Topic extraction
- Exploit term weighting and text similarity measures

- Natural Language Processing Style: Information Extraction
- Entity extraction
- Relation extraction
- Sentiment analysis
- Question answering

- Machine Learning Style: Unsupervised or semi-supervised learning
- Mixture models
- Dimension reduction

- The Basic Topic Models:
- Probabilistic Latent Semantic Analysis (PLSA) [Hofmann 99]
- Latent Dirichlet Allocation (LDA) [Blei et al. 02]

- Extensions
- Contextual Probabilistic Latent Semantic Analysis (CPLSA) [Mei & Zhai 06]
- Other extensions

Basic Topic Model: PLSA

What did people say in their blog articles about “Hurricane Katrina”?

Query = “Hurricane Katrina”

Results:

Mix k multinomial distributions to generate a document

Each document has a potentially different set of mixing weights which captures the topic coverage

When generating words in a document, each word may be generated using a DIFFERENT multinomial distribution (this is in contrast with the document clustering model where, once a multinomial distribution is chosen, all the words in a document would be generated using the same model)

We may add a background distribution to “attract” background words

Document d

warning 0.3 system 0.2..

?

Topic 1

d,1

?

1

“Generating” word w

in doc d in the collection

2

aid 0.1donation 0.05support 0.02 ..

?

Topic 2

d,2

1 - B

?

?

d, k

W

k

…

statistics 0.2loss 0.1dead 0.05 ..

?

B

?

Topic k

?

B

is 0.05the 0.04a 0.03 ..

?

?

Background B

Parameters:

B=noise-level (manually set)

’s and ’s are estimated with Maximum Likelihood

?

Background words

P(w| B)

P(source)

Topic words

1-

P(w| F )

Maximum Likelihood:

What about there are k topics?

Simple case: there is only one topic

ML

Estimator

the 0.2

a 0.1

we 0.01

to 0.02

…

Known

Background

p(w | B)

Observed Doc(s)

Suppose,

we know

the identity

of each

word ...

…

text =?

mining =?

association =?

word =?

…

Unknown

topic model

p(w|1)=?

“Text mining”

…

Unknown

topic model

p(w|2)=?

“informationretrieval”

…

information =?

retrieval =?

query =?

document =?

…

c(w,d)(1 - p(zd,w = B))p(zd,w=j)

πd1,1( P(θ1|d1) )

πd1,2( P(θ2|d1) )

c(w,d)p(zd,w = B)

c(w, d)

aid

7

price

d1

5

Initial value

6

oil

πd2,1( P(θ1|d2) )

πd2,2( P(θ2|d2) )

aid

8

d2

price

7

5

oil

Initial value

Topic 1

Topic 2

P(w| θ)

Iteration 2: E Step: split word counts with different topics (by computing z’ s)

Iteration 1: M Step: re-estimate πd, j and P(w| θj) by adding and normalizing the splitted word counts

Initializing πd, j and P(w| θj) with random values

Iteration 3, 4, 5, …

Until converging

Iteration 2: M Step: re-estimate πd, j and P(w| θj) by adding and normalizing the splitted word counts

Iteration 1: E Step: split word counts with different topics (by computing z’ s)

aid

Initial value

price

14

oil

- E-Step:
- Word w in doc d is generated
- from cluster j
- from background

Application of Bayes rule

- M-Step:
- Re-estimate
- mixing weights
- cluster LM

- Fractional counts contributing to
- using cluster j in generating d
- generating w from cluster j

Sum over all docs

(in multiple collections)

m = 1 if one collection

16

- There are different ways of choosing aspects (topics)
- Google = Google News + Google Map + Google scholar, …
- Google = Google US + Google France + Google China, …

- Users have some domain knowledge in mind, e.g.,
- We expect to see “retrieval models” as a topic in IR.
- We want to show the aspects of “history” and “statistics” for Youtube

- A flexible way to incorporate such knowledge as priors of PLSA model
- In Bayesian, it’s your “belief” on the topic distributions

Most likely

Document d

warning 0.3 system 0.2..

Topic 1

d,1

1

“Generating” word w

in doc d in the collection

2

aid 0.1donation 0.05support 0.02 ..

Topic 2

d,2

1 - B

d, k

W

k

…

statistics 0.2loss 0.1dead 0.05 ..

B

Topic k

B

is 0.05the 0.04a 0.03 ..

Background B

Parameters:

B=noise-level (manually set)

’s and ’s are estimated with Maximum Likelihood

17

MAP

Estimator

Observed Doc(s)

the 0.2

a 0.1

we 0.01

to 0.02

…

Known

Background

p(w | B)

Suppose,

we know

the identity

of each

word ...

…

text =?

mining =?

association =?

word =?

…

Unknown

topic model

p(w|1)=?

“Text mining”

Pseudo Doc

…

Unknown

topic model

p(w|2)=?

“informationretrieval”

…

information =?

retrieval =?

query =?

document =?

…

Size = μ

text

18

mining

+p(w|’j)

+

Pseudo counts of w from prior ’

Sum of all pseudo counts

What if =0? What if =+?

19

Basic Topic Model: LDA

The following slides about LDA are taken from Michael C. Mozer’s course lecture

http://www.cs.colorado.edu/~mozer/courses/ProbabilisticModels/

- “Documents have no generative probabilistic semantics”
- i.e., document is just a symbol

- Model has many parameters
- linear in number of documents
- need heuristic methods to prevent overfitting

- Cannot generalize to new documents

- d is a localist representation of (trained) documents
- LDA provides a distributed representation

- Vocabulary of |V| words
- Document is a collection of words from vocabulary.
- N words in document
- w = (w1, ..., wN)

- random variable z, with values 1, ..., k

- But topic model assumes a fixed mixture of topics (multinomial distribution) for each document.
- LDA assumes a random mixture of topics (Dirichlet distribution) for each topic.

- “Plates” indicate looping structure
- Outer plate replicated for each document
- Inner plate replicated for each word
- Same conditional distributions apply for each replicate

- In general, this formula is intractable:
- Expanded version:

1 if wn is the j'th vocab word

- Computing log likelihood and introducing Jensen's inequality: log(E[x]) >= E[log(x)]
- Find variational distribution q such that the above equation is computable.
- q parameterized by γ and φn
- Maximize bound with respect to γ and φn to obtain best approximation to p(w | α, β)
- Lead to variational EM algorithm

- Sampling algorithms (e.g., Gibbs sampling) are also common

Data Sets

- C. Elegans Community abstracts
- 5,225 abstracts
- 28,414 unique terms

- 16,333 newswire articles
- 23,075 unique terms

- 50 stop words, words appearing once

C. Elegans

Note: fold in hack for pLSI to allow it to handle novel documents.

Involves refitting p(z|dnew) parameters -> sort of a cheat

AP

- LDA adds a Dirichlet distribution on top of PLSA to regularize the model
- Estimation of LDA is more complicated than PLSA
- LDA is a generative model, while PLSA isn’t
- PLSA is more likely to over-fit the data than LDA
- Which one to use?
- If you need generalization capacity, LDA
- If you want to mine topics from a collection, PLSA may be better (we want overfitting!)

Extension of PLSA: Contextual Probabilistic Latent Semantic Analysis (CPLSA)

Data: X (observed) + H(hidden) Parameter:

“Incomplete” likelihood: L( )= log p(X| )

“Complete” likelihood: Lc( )= log p(X,H| )

EM tries to iteratively maximize the incomplete likelihood:

Starting with an initial guess (0),

1. E-step: compute the expectation of the complete likelihood

2. M-step: compute (n) by maximizing the Q-function

Goal: maximizing “Incomplete” likelihood: L( )= log p(X| )

I.e., choosing (n), so thatL((n))-L((n-1))0

Note that, sincep(X,H| ) =p(H|X, ) P(X| ) , L() =Lc() -log p(H|X, )

L((n))-L((n-1)) = Lc((n))-Lc( (n-1))+log [p(H|X, (n-1) )/p(H|X, (n))]

Taking expectation w.r.t.p(H|X, (n-1)),

L((n))-L((n-1)) = Q((n); (n-1))-Q( (n-1); (n-1)) + D(p(H|X, (n-1))||p(H|X, (n)))

Doesn’t contain H

EM chooses (n) to maximize Q

KL-divergence, always non-negative

Therefore, L((n)) L((n-1))!

Likelihood p(X| )

L((n-1)) + Q(; (n-1)) -Q( (n-1); (n-1) )+ D(p(H|X, (n-1) )||p(H|X, ))

L((n-1)) + Q(; (n-1)) -Q( (n-1); (n-1) )

next guess

current guess

Lower bound

(Q function)

E-step = computing the lower bound

M-step = maximizing the lower bound

Why Contextual PLSA?

IBM Laptop

Reviews

APPLE Laptop

Reviews

DELL Laptop

Reviews

Unsupervised discovery of common topics and their variations

Vietnam War

Afghan War

Iraq War

Unsupervised discovery of common topics and their variations

Theme Strength

Time

1980

1990

1998

2003

TF-IDF Retrieval

Language Model

Text Categorization

IR Applications

Unsupervised discovery of topics and their temporal variations

- How do blog writers in different states respond to topics such as “oil price increase during Hurricane Karina”?
- Unsupervised discovery of topics and their variations in different locations

Unsupervised/Semi-supervised discovery of topics and

different sentiments of the topics

- Can we model all these problems generally?
- Can we solve these problems with a unified approach?
- How can we bring human into the loop?

- Given collections of text with contextual information (meta-data)
- Discover themes/subtopics/topics (interesting word clusters)
- Compute variations of themes over contexts
- Applications:
- Summarizing search results
- Federation of text information
- Opinion analysis
- Social network analysis
- Business intelligence
- ..

Weblog Article

communities

Author

source

Location

Time

Author’s Occupation

papers written in 1998

Papers about Web

papers written by authors in US

1998

1999

……

……

2005

2006

WWW

SIGIR

ACL

KDD

SIGMOD

- Uses of themes:
- Summarize topics/subtopics
- Navigate in a document space
- Retrieve documents
- Segment documents
- …

government 0.3 response 0.2..

[ Criticism of government response to the hurricane primarily consisted of criticism of its response to the approach of the storm and its aftermath, specifically in the delayed response ] to the [flooding of New Orleans. … 80% of the 1.3 million residents of the greater New Orleansmetropolitan area evacuated ] …[ Over seventy countries pledged monetary donations or other assistance]. …

Theme 1

donate 0.1relief 0.05help 0.02 ..

Theme 2

…

city 0.2new 0.1orleans 0.05 ..

Theme k

Is 0.05the 0.04a 0.03 ..

Background B

vector

space

TF-IDF

Theme 2:

Feedback

Okapi

vector

retrieve

Theme 1:

Retrieval Model

LSI

feedback

Rocchio

model

retrieval

weighting

judge

relevance

feedback

expansion

term

document

pseudo

query

query

language

mixture

model

model

estimate

smoothing

EM

query

feedback

generation

pseudo

Context: After 1998 (Language models)

Context: Before 1998 (Traditional models)

Oil Price

Criticismofgovernment responseto the hurricane primarily consisted ofcriticismof itsresponse to … The totalshut-in oil productionfrom the Gulf of Mexico … approximately 24% of theannual productionand the shut-ingas production … Over seventy countriespledged monetary donationsor otherassistance. …

Government Response

Aid and donation

Background

Context: Texas

Oil Price

Government Response

- Theme coverage can depend on context

Aid and donation

Background

Context: Louisiana

- Theme Extraction:Extract the global salient themes
- Common information shared over all contexts

- View Comparison:Compare a theme from different views
- Analyze the content variation of themes over contexts

- Coverage Comparison: Compare the theme coverage of different contexts
- Reveal how closely a theme is associated to a context

- Others:
- Causal analysis
- Correlation analysis

- CPLAS = Contextual Probabilistic Latent Semantic Analysis
- An extension of PLSA model ([Hofmann 99]) by
- Introducing context variables
- Modeling views of topics
- Modeling coverage variations of topics

- Process of contextual text mining
- Instantiation of CPLSA (context, views, coverage)
- Fit the model to text data (EM algorithm)
- Compute probabilistic topic patterns

View1

View2

View3

Themes

government 0.3 response 0.2..

new

donate

government

government

donate 0.1relief 0.05help 0.02 ..

donation

city 0.2new 0.1orleans 0.05 ..

New Orleans

Theme coverages:

……

Texas

document

July 2005

Choose a theme

Criticismofgovernment responseto the hurricane primarily consisted ofcriticismof itsresponse to … The totalshut-in oil productionfrom the Gulf of Mexico … approximately 24% of theannual productionand the shut-ingas production … Over seventy countriespledged monetary donationsor otherassistance. …

Draw a word from i

Documentcontext:

Time = July 2005

Location = Texas

Author = xxx

Occup. = Sociologist

Age Group = 45+

…

response

help

aid

Orleans

Texas

July 2005

sociologist

Choose a view

Choose a Coverage

- To generate a document D with context feature set C:
- Choose a view viaccording to the view distribution
- Choose a coverage кjaccording to the coverage distribution
- Choose a theme according to the coverage кj
- Generate a word using
- The likelihood of the document collection is:

- Interesting patterns:
- Theme content variation for each view:
- Theme strength variation for each context

- Prior from a user can be incorporated using MAP estimation

- Why?
- Generality high complexity (inefficient, multiple local maxima)
- Real applications have domain constraints/knowledge

- Two useful simplifications:
- Fixed-Coverage: Only analyze the content variation of themes (e.g., author-topic analysis, cross-collection comparative analysis )
- Fixed-View: Only analyze the coverage variation of themes (e.g., spatiotemporal theme analysis)

- In general
- Impose priors on model parameters
- Support the whole spectrum from unsupervised to supervised learning

Statistical topic models

term 0.1599relevance 0.0752weight 0.0660 feedback 0.0372independence 0.0311

model 0.0310

frequent 0.0233

probabilistic 0.0188

document 0.0173

…

term 0.1599relevance 0.0752weight 0.0660 feedback 0.0372independence 0.0311

model 0.0310

frequent 0.0233

probabilistic 0.0188

document 0.0173

…

term 0.1599relevance 0.0752weight 0.0660 feedback 0.0372independence 0.0311

model 0.0310

frequent 0.0233

probabilistic 0.0188

document 0.0173

…

Multinomial topic models

Collection (Context)

Coverage; Discrimination

Relevance Score

Re-ranking

clustering algorithm;

distance measure;

…

database system, clustering algorithm, r tree, functional dependency, iceberg cube, concurrency control, index structure …

NLP Chunker

Ngram stat.

Ranked Listof Labels

Candidate label pool

- Intuition: prefer phrases covering high probability words

Clustering

Good Label (l1): “clustering algorithm”

dimensional

algorithm

Latent Topic

…

birch

shape

Bad Label (l2):“body shape”

…

p(w|)

body

C: SIGMOD Proceedings

- Intuition: prefer phrases with similar context (distribution)

Clustering

Clustering

Clustering

dimension

dimension

dimension

Bad Label (l2):“hash join”

Good Label (l1):“clustering algorithm”

Topic

…

partition

partition

algorithm

algorithm

algorithm

join

…

…

Score (l, )

hash

hash

hash

P(w|l2)

P(w|)

P(w|l1)

D(||l1) < D(||l2)

- Comparative text mining
- Spatiotemporal pattern mining
- Sentiment summary
- Event impact analysis
- Temporal author-topic analysis

The common theme indicates that “United Nations” is involved in both wars

Collection-specific themes indicate different roles of “United Nations” in the two wars

Top words serve as “labels” for common themes

(e.g., [sound, speakers], [battery, hours], [cd,drive])

These word distributions can be used to segment text and

add hyperlinks between documents

- Query= “Hurricane Katrina”
- Topics in the results:
- Spatiotemporal patterns

Oil Price

price 0.0772oil 0.0643gas 0.0454 increase 0.0210product 0.0203

fuel 0.0188

company 0.0182

…

New Orleans

city 0.0634orleans 0.0541new 0.0342louisiana 0.0235flood 0.0227

evacuate 0.0211

storm 0.0177

…

Week2: The discussion moves towards the north and west

Week1: The theme is the strongest along the Gulf of Mexico

Week3: The theme distributes more uniformly over the states

Week4: The theme is again strong along the east coast and the Gulf of Mexico

Week5: The theme fades out in most states

gene 0.0173expressions 0.0096probability 0.0081microarray 0.0038…

marketing 0.0087customer 0.0086

model 0.0079business 0.0048…

rules 0.0142association 0.0064

support 0.0053…

Global Themes life cycles of KDD Abstracts

1999

2000

2001

2002

2003

2004

T

web 0.009classifica –tion 0.007features0.006topic 0.005…

SVM 0.007criteria 0.007classifica – tion 0.006linear 0.005

…

mixture 0.005random 0.006cluster 0.006clustering 0.005

variables 0.005…

topic 0.010mixture 0.008LDA 0.006 semantic 0.005

…

decision 0.006tree 0.006classifier 0.005class 0.005Bayes 0.005

…

…

Classifica - tion 0.015text 0.013unlabeled 0.012document 0.008labeled 0.008learning 0.007

…

Informa - tion 0.012web 0.010social 0.008retrieval 0.007distance 0.005networks 0.004

…

…

…

…

Facet: the book “ the da vinci code”. ( Bursts during the movie, Pos > Neg )

Facet: religious beliefs ( Bursts during the movie, Neg > Pos )

Theme: retrieval models

SIGIR papers

term 0.1599relevance 0.0752weight 0.0660 feedback 0.0372independence 0.0311

model 0.0310

frequent 0.0233

probabilistic 0.0188

document 0.0173

…

Publication of the paper “A language modeling approach to information retrieval”

1992

year

Starting of the TREC conferences

xml 0.0678email 0.0197 model 0.0191collect 0.0187

judgment 0.0102

rank 0.0097

subtopic 0.0079

…

vector 0.0514concept 0.0298extend 0.0297 model 0.0291space 0.0236

boolean 0.0151

function 0.0123

feedback 0.0077

…

1998

model 0.1687language 0.0753estimate 0.0520 parameter 0.0281distribution 0.0268

probable 0.0205

smooth 0.0198

markov 0.0137

likelihood 0.0059

…

probabilist 0.0778model 0.0432logic 0.0404 ir 0.0338boolean 0.0281

algebra 0.0200

estimate 0.0119

weight 0.0111

…

Jiawei Han

Rakesh Agrawal

close 0.0805pattern 0.0720sequential 0.0462

min_support 0.0353

threshold 0.0207

top-k 0.0176

fp-tree 0.0102

…

index 0.0440graph 0.0343web 0.0307

gspan 0.0273substructure 0.0201

gindex 0.0164

bide 0.0115

xml 0.0109

…

project 0.0444itemset 0.0433intertransaction 0.0397

support 0.0264associate 0.0258

frequent 0.0181

closet 0.0176

prefixspan 0.0170

…

Author

Author A

Global theme: frequent patterns

time

2000

Author B

pattern 0.1107frequent 0.0406frequent-pattern 0.039

sequential 0.0360

pattern-growth 0.0203

constraint 0.0184

push 0.0138

…

research 0.0551next 0.0308transaction 0.0308

panel 0.0275technical 0.0275

article 0.0258

revolution 0.0154

innovate 0.0154

…

Community 1:

Information Retrieval

Community 2:

Data Mining

Community 3:

Machine Learning

73

- Many extensions of LDA, mostly done by David Blei, Andrew McCallum and their co-authors
- Some examples:
- Hierarchical topic models [Blei et al. 03]
- Modeling annotated data [Blei & Jordan 03]
- Dynamic topic models [Blei & Lafferty 06]
- Pachinko allocation [Li & McCallum 06])

- Also, some specific context extension of PLSA, e.g., author-topic model [Steyvers et al. 04]

- Topic models for text mining
- Evaluation of topic models
- Improve the efficiency of estimation and inferences
- Incorporate linguistic knowledge
- Applications in new domains and for new tasks

- Text mining in general
- Combination of NLP-style and DM-style mining algorithms
- Integrated mining of text (unstructured) and unstructured data (e.g., Text OLAP)
- Interactive mining:
- Incorporate user constraints and support iterative mining
- Design and implement mining languages

- How PLSA works
- How EM algorithm works in general
- Contextual PLSA can be used to perform many quite different interesting text mining tasks

- This lecture: Topic models for text mining
- Next lecture: Next generation search engines