Topic models for text mining
This presentation is the property of its rightful owner.
Sponsored Links
1 / 77

龙星计划课程 : 信息检索 Topic Models for Text Mining PowerPoint PPT Presentation


  • 59 Views
  • Uploaded on
  • Presentation posted in: General

龙星计划课程 : 信息检索 Topic Models for Text Mining. ChengXiang Zhai (翟成祥) Department of Computer Science Graduate School of Library & Information Science Institute for Genomic Biology, Statistics University of Illinois, Urbana-Champaign http://www-faculty.cs.uiuc.edu/~czhai, [email protected]

Download Presentation

龙星计划课程 : 信息检索 Topic Models for Text Mining

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Topic models for text mining

龙星计划课程:信息检索Topic Models for Text Mining

ChengXiang Zhai (翟成祥)

Department of Computer Science

Graduate School of Library & Information Science

Institute for Genomic Biology, Statistics

University of Illinois, Urbana-Champaign

http://www-faculty.cs.uiuc.edu/~czhai, [email protected]


Text management applications

Text Management Applications

Mining

Access

Select

information

Create Knowledge

Add

Structure/Annotations

Organization


What is text mining

What Is Text Mining?

“The objective of Text Mining is to exploit information contained in textual documents in various ways, including …discovery of patterns and trends in data, associations among entities, predictive rules, etc.” (Grobelnik et al., 2001)

“Another way to view text data mining is as a process of exploratory data analysis that leads to heretofore unknown information, or to answers for questions for which the answer is not currently known.” (Hearst, 1999)

(Slide from Rebecca Hwa’s “Intro to Text Mining”)


Two different views of text mining

Two Different Views of Text Mining

Shallow mining

Deep mining

  • Data Mining View: Explore patterns in textual data

    • Find latent topics

    • Find topical trends

    • Find outliers and other hidden patterns

  • Natural Language Processing View: Make inferences based on partial understanding natural language text

    • Information extraction

    • Question answering


Applications of text mining

Applications of Text Mining

  • Direct applications: Go beyond search to find knowledge

    • Question-driven (Bioinformatics, Business Intelligence, etc): We have specific questions; how can we exploit data mining to answer the questions?

    • Data-driven (WWW, literature, email, customer reviews, etc): We have a lot of data; what can we do with it?

  • Indirect applications

    • Assist information access (e.g., discover latent topics to better summarize search results)

    • Assist information organization (e.g., discover hidden structures)


Text mining methods

Text Mining Methods

Topic of this lecture

  • Data Mining Style: View text as high dimensional data

    • Frequent pattern finding

    • Association analysis

    • Outlier detection

  • Information Retrieval Style: Fine granularity topical analysis

    • Topic extraction

    • Exploit term weighting and text similarity measures

  • Natural Language Processing Style: Information Extraction

    • Entity extraction

    • Relation extraction

    • Sentiment analysis

    • Question answering

  • Machine Learning Style: Unsupervised or semi-supervised learning

    • Mixture models

    • Dimension reduction


Outline

Outline

  • The Basic Topic Models:

    • Probabilistic Latent Semantic Analysis (PLSA) [Hofmann 99]

    • Latent Dirichlet Allocation (LDA) [Blei et al. 02]

  • Extensions

    • Contextual Probabilistic Latent Semantic Analysis (CPLSA) [Mei & Zhai 06]

    • Other extensions


Basic topic model plsa

Basic Topic Model: PLSA


Plsa motivation

PLSA: Motivation

What did people say in their blog articles about “Hurricane Katrina”?

Query = “Hurricane Katrina”

Results:


Probabilistic latent semantic analysis indexing plsa plsi hofmann 99

Probabilistic Latent Semantic Analysis/Indexing (PLSA/PLSI) [Hofmann 99]

Mix k multinomial distributions to generate a document

Each document has a potentially different set of mixing weights which captures the topic coverage

When generating words in a document, each word may be generated using a DIFFERENT multinomial distribution (this is in contrast with the document clustering model where, once a multinomial distribution is chosen, all the words in a document would be generated using the same model)

We may add a background distribution to “attract” background words


Plsa as a mixture model

PLSA as a Mixture Model

Document d

warning 0.3 system 0.2..

?

Topic 1

d,1

?

1

“Generating” word w

in doc d in the collection

2

aid 0.1donation 0.05support 0.02 ..

?

Topic 2

d,2

1 - B

?

?

d, k

W

k

statistics 0.2loss 0.1dead 0.05 ..

?

B

?

Topic k

?

B

is 0.05the 0.04a 0.03 ..

?

?

Background B

Parameters:

B=noise-level (manually set)

’s and ’s are estimated with Maximum Likelihood

?


Special case model based feedback

Special Case: Model-based Feedback

Background words

P(w| B)

P(source)

Topic words

1-

P(w| F )

Maximum Likelihood:

What about there are k topics?

Simple case: there is only one topic


How to estimate j em algorithm

How to Estimate j: EM Algorithm

ML

Estimator

the 0.2

a 0.1

we 0.01

to 0.02

Known

Background

p(w | B)

Observed Doc(s)

Suppose,

we know

the identity

of each

word ...

text =?

mining =?

association =?

word =?

Unknown

topic model

p(w|1)=?

“Text mining”

Unknown

topic model

p(w|2)=?

“informationretrieval”

information =?

retrieval =?

query =?

document =?


How the algorithm works

How the Algorithm Works

c(w,d)(1 - p(zd,w = B))p(zd,w=j)

πd1,1( P(θ1|d1) )

πd1,2( P(θ2|d1) )

c(w,d)p(zd,w = B)

c(w, d)

aid

7

price

d1

5

Initial value

6

oil

πd2,1( P(θ1|d2) )

πd2,2( P(θ2|d2) )

aid

8

d2

price

7

5

oil

Initial value

Topic 1

Topic 2

P(w| θ)

Iteration 2: E Step: split word counts with different topics (by computing z’ s)

Iteration 1: M Step: re-estimate πd, j and P(w| θj) by adding and normalizing the splitted word counts

Initializing πd, j and P(w| θj) with random values

Iteration 3, 4, 5, …

Until converging

Iteration 2: M Step: re-estimate πd, j and P(w| θj) by adding and normalizing the splitted word counts

Iteration 1: E Step: split word counts with different topics (by computing z’ s)

aid

Initial value

price

14

oil


Parameter estimation

Parameter Estimation

  • E-Step:

  • Word w in doc d is generated

  • from cluster j

  • from background

Application of Bayes rule

  • M-Step:

  • Re-estimate

  • mixing weights

  • cluster LM

  • Fractional counts contributing to

  • using cluster j in generating d

  • generating w from cluster j

Sum over all docs

(in multiple collections)

m = 1 if one collection


Plsa with prior knowledge

PLSA with Prior Knowledge

16

  • There are different ways of choosing aspects (topics)

    • Google = Google News + Google Map + Google scholar, …

    • Google = Google US + Google France + Google China, …

  • Users have some domain knowledge in mind, e.g.,

    • We expect to see “retrieval models” as a topic in IR.

    • We want to show the aspects of “history” and “statistics” for Youtube

  • A flexible way to incorporate such knowledge as priors of PLSA model

  • In Bayesian, it’s your “belief” on the topic distributions


Adding prior

Adding Prior

Most likely 

Document d

warning 0.3 system 0.2..

Topic 1

d,1

1

“Generating” word w

in doc d in the collection

2

aid 0.1donation 0.05support 0.02 ..

Topic 2

d,2

1 - B

d, k

W

k

statistics 0.2loss 0.1dead 0.05 ..

B

Topic k

B

is 0.05the 0.04a 0.03 ..

Background B

Parameters:

B=noise-level (manually set)

’s and ’s are estimated with Maximum Likelihood

17


Adding prior as pseudo counts

Adding Prior as Pseudo Counts

MAP

Estimator

Observed Doc(s)

the 0.2

a 0.1

we 0.01

to 0.02

Known

Background

p(w | B)

Suppose,

we know

the identity

of each

word ...

text =?

mining =?

association =?

word =?

Unknown

topic model

p(w|1)=?

“Text mining”

Pseudo Doc

Unknown

topic model

p(w|2)=?

“informationretrieval”

information =?

retrieval =?

query =?

document =?

Size = μ

text

18

mining


Maximum a posterior map estimation

Maximum A Posterior (MAP) Estimation

+p(w|’j)

+

Pseudo counts of w from prior ’

Sum of all pseudo counts

What if =0? What if =+?

19


Basic topic model lda

Basic Topic Model: LDA

The following slides about LDA are taken from Michael C. Mozer’s course lecture

http://www.cs.colorado.edu/~mozer/courses/ProbabilisticModels/


Lda motivation

LDA: Motivation

  • “Documents have no generative probabilistic semantics”

    • i.e., document is just a symbol

  • Model has many parameters

    • linear in number of documents

    • need heuristic methods to prevent overfitting

  • Cannot generalize to new documents


Unigram model

Unigram Model


Mixture of unigrams

Mixture of Unigrams


Topic model probabilistic lsi

Topic Model / Probabilistic LSI

  • d is a localist representation of (trained) documents

  • LDA provides a distributed representation


Topic models for text mining

LDA

  • Vocabulary of |V| words

  • Document is a collection of words from vocabulary.

    • N words in document

    • w = (w1, ..., wN)

  • Latent topics

    • random variable z, with values 1, ..., k

  • Like topic model, document is generated by sampling a topic from a mixture and then sampling a word from a mixture.

    • But topic model assumes a fixed mixture of topics (multinomial distribution) for each document.

    • LDA assumes a random mixture of topics (Dirichlet distribution) for each topic.


  • Generative model

    Generative Model

    • “Plates” indicate looping structure

      • Outer plate replicated for each document

      • Inner plate replicated for each word

      • Same conditional distributions apply for each replicate

  • Document probability


  • Fancier version

    Fancier Version


    Inference

    Inference


    Inference1

    Inference

    • In general, this formula is intractable:

    • Expanded version:

    1 if wn is the j'th vocab word


    Variational approximation

    Variational Approximation

    • Computing log likelihood and introducing Jensen's inequality: log(E[x]) >= E[log(x)]

    • Find variational distribution q such that the above equation is computable.

      • q parameterized by γ and φn

      • Maximize bound with respect to γ and φn to obtain best approximation to p(w | α, β)

      • Lead to variational EM algorithm

    • Sampling algorithms (e.g., Gibbs sampling) are also common


    Topic models for text mining

    Data Sets

    • C. Elegans Community abstracts

      • 5,225 abstracts

      • 28,414 unique terms

  • TREC AP corpus (subset)

    • 16,333 newswire articles

    • 23,075 unique terms

  • Held-out data – 10%

  • Removed terms

    • 50 stop words, words appearing once


  • Topic models for text mining

    C. Elegans

    Note: fold in hack for pLSI to allow it to handle novel documents.

    Involves refitting p(z|dnew) parameters -> sort of a cheat


    Topic models for text mining

    AP


    Summary plsa vs lda

    Summary: PLSA vs. LDA

    • LDA adds a Dirichlet distribution on top of PLSA to regularize the model

    • Estimation of LDA is more complicated than PLSA

    • LDA is a generative model, while PLSA isn’t

    • PLSA is more likely to over-fit the data than LDA

    • Which one to use?

      • If you need generalization capacity, LDA

      • If you want to mine topics from a collection, PLSA may be better (we want overfitting!)


    Extension of plsa contextual probabilistic latent semantic analysis cplsa

    Extension of PLSA: Contextual Probabilistic Latent Semantic Analysis (CPLSA)


    A general introduction to em

    A General Introduction to EM

    Data: X (observed) + H(hidden) Parameter: 

    “Incomplete” likelihood: L( )= log p(X| )

    “Complete” likelihood: Lc( )= log p(X,H| )

    EM tries to iteratively maximize the incomplete likelihood:

    Starting with an initial guess (0),

    1. E-step: compute the expectation of the complete likelihood

    2. M-step: compute (n) by maximizing the Q-function


    Convergence guarantee

    Convergence Guarantee

    Goal: maximizing “Incomplete” likelihood: L( )= log p(X| )

    I.e., choosing (n), so thatL((n))-L((n-1))0

    Note that, sincep(X,H| ) =p(H|X, ) P(X| ) , L() =Lc() -log p(H|X, )

    L((n))-L((n-1)) = Lc((n))-Lc( (n-1))+log [p(H|X,  (n-1) )/p(H|X, (n))]

    Taking expectation w.r.t.p(H|X, (n-1)),

    L((n))-L((n-1)) = Q((n);  (n-1))-Q( (n-1);  (n-1)) + D(p(H|X,  (n-1))||p(H|X,  (n)))

    Doesn’t contain H

    EM chooses (n) to maximize Q

    KL-divergence, always non-negative

    Therefore, L((n))  L((n-1))!


    Another way of looking at em

    Another way of looking at EM

    Likelihood p(X| )

    L((n-1)) + Q(; (n-1)) -Q( (n-1);  (n-1) )+ D(p(H|X,  (n-1) )||p(H|X,  ))

    L((n-1)) + Q(; (n-1)) -Q( (n-1);  (n-1) )

    next guess

    current guess

    Lower bound

    (Q function)

    E-step = computing the lower bound

    M-step = maximizing the lower bound


    Why contextual plsa

    Why Contextual PLSA?


    Motivating example comparing product reviews

    Motivating Example:Comparing Product Reviews

    IBM Laptop

    Reviews

    APPLE Laptop

    Reviews

    DELL Laptop

    Reviews

    Unsupervised discovery of common topics and their variations


    Motivating example comparing news about similar topics

    Motivating Example:Comparing News about Similar Topics

    Vietnam War

    Afghan War

    Iraq War

    Unsupervised discovery of common topics and their variations


    Motivating example discovering topical trends in literature

    Motivating Example:Discovering Topical Trends in Literature

    Theme Strength

    Time

    1980

    1990

    1998

    2003

    TF-IDF Retrieval

    Language Model

    Text Categorization

    IR Applications

    Unsupervised discovery of topics and their temporal variations


    Motivating example analyzing spatial topic patterns

    Motivating Example:Analyzing Spatial Topic Patterns

    • How do blog writers in different states respond to topics such as “oil price increase during Hurricane Karina”?

    • Unsupervised discovery of topics and their variations in different locations


    Motivating example sentiment summary

    Motivating Example: Sentiment Summary

    Unsupervised/Semi-supervised discovery of topics and

    different sentiments of the topics


    Research questions

    Research Questions

    • Can we model all these problems generally?

    • Can we solve these problems with a unified approach?

    • How can we bring human into the loop?


    Contextual text mining

    Contextual Text Mining

    • Given collections of text with contextual information (meta-data)

    • Discover themes/subtopics/topics (interesting word clusters)

    • Compute variations of themes over contexts

    • Applications:

      • Summarizing search results

      • Federation of text information

      • Opinion analysis

      • Social network analysis

      • Business intelligence

      • ..


    Context features of text meta data

    Context Features of Text (Meta-data)

    Weblog Article

    communities

    Author

    source

    Location

    Time

    Author’s Occupation


    Context partitioning of text

    Context = Partitioning of Text

    papers written in 1998

    Papers about Web

    papers written by authors in US

    1998

    1999

    ……

    ……

    2005

    2006

    WWW

    SIGIR

    ACL

    KDD

    SIGMOD


    Themes topics

    Themes/Topics

    • Uses of themes:

      • Summarize topics/subtopics

      • Navigate in a document space

      • Retrieve documents

      • Segment documents

    government 0.3 response 0.2..

    [ Criticism of government response to the hurricane primarily consisted of criticism of its response to the approach of the storm and its aftermath, specifically in the delayed response ] to the [flooding of New Orleans. … 80% of the 1.3 million residents of the greater New Orleansmetropolitan area evacuated ] …[ Over seventy countries pledged monetary donations or other assistance]. …

    Theme 1

    donate 0.1relief 0.05help 0.02 ..

    Theme 2

    city 0.2new 0.1orleans 0.05 ..

    Theme k

    Is 0.05the 0.04a 0.03 ..

    Background B


    View of themes context specific version of views

    View of Themes: Context-Specific Version of Views

    vector

    space

    TF-IDF

    Theme 2:

    Feedback

    Okapi

    vector

    retrieve

    Theme 1:

    Retrieval Model

    LSI

    feedback

    Rocchio

    model

    retrieval

    weighting

    judge

    relevance

    feedback

    expansion

    term

    document

    pseudo

    query

    query

    language

    mixture

    model

    model

    estimate

    smoothing

    EM

    query

    feedback

    generation

    pseudo

    Context: After 1998 (Language models)

    Context: Before 1998 (Traditional models)


    Coverage of themes distribution over themes

    Coverage of Themes: Distribution over Themes

    Oil Price

    Criticismofgovernment responseto the hurricane primarily consisted ofcriticismof itsresponse to … The totalshut-in oil productionfrom the Gulf of Mexico … approximately 24% of theannual productionand the shut-ingas production … Over seventy countriespledged monetary donationsor otherassistance. …

    Government Response

    Aid and donation

    Background

    Context: Texas

    Oil Price

    Government Response

    • Theme coverage can depend on context

    Aid and donation

    Background

    Context: Louisiana


    General tasks of contextual text mining

    General Tasks of Contextual Text Mining

    • Theme Extraction:Extract the global salient themes

      • Common information shared over all contexts

    • View Comparison:Compare a theme from different views

      • Analyze the content variation of themes over contexts

    • Coverage Comparison: Compare the theme coverage of different contexts

      • Reveal how closely a theme is associated to a context

    • Others:

      • Causal analysis

      • Correlation analysis


    A general solution cplsa

    A General Solution: CPLSA

    • CPLAS = Contextual Probabilistic Latent Semantic Analysis

    • An extension of PLSA model ([Hofmann 99]) by

      • Introducing context variables

      • Modeling views of topics

      • Modeling coverage variations of topics

    • Process of contextual text mining

      • Instantiation of CPLSA (context, views, coverage)

      • Fit the model to text data (EM algorithm)

      • Compute probabilistic topic patterns


    Generation process of cplsa

    “Generation” Process of CPLSA

    View1

    View2

    View3

    Themes

    government 0.3 response 0.2..

    new

    donate

    government

    government

    donate 0.1relief 0.05help 0.02 ..

    donation

    city 0.2new 0.1orleans 0.05 ..

    New Orleans

    Theme coverages:

    ……

    Texas

    document

    July 2005

    Choose a theme

    Criticismofgovernment responseto the hurricane primarily consisted ofcriticismof itsresponse to … The totalshut-in oil productionfrom the Gulf of Mexico … approximately 24% of theannual productionand the shut-ingas production … Over seventy countriespledged monetary donationsor otherassistance. …

    Draw a word from i

    Documentcontext:

    Time = July 2005

    Location = Texas

    Author = xxx

    Occup. = Sociologist

    Age Group = 45+

    response

    help

    aid

    Orleans

    Texas

    July 2005

    sociologist

    Choose a view

    Choose a Coverage


    Probabilistic model

    Probabilistic Model

    • To generate a document D with context feature set C:

      • Choose a view viaccording to the view distribution

      • Choose a coverage кjaccording to the coverage distribution

      • Choose a theme according to the coverage кj

      • Generate a word using

      • The likelihood of the document collection is:


    Parameter estimation em algorithm

    Parameter Estimation: EM Algorithm

    • Interesting patterns:

      • Theme content variation for each view:

      • Theme strength variation for each context

    • Prior from a user can be incorporated using MAP estimation


    Regularization of the model

    Regularization of the Model

    • Why?

      • Generality high complexity (inefficient, multiple local maxima)

      • Real applications have domain constraints/knowledge

    • Two useful simplifications:

      • Fixed-Coverage: Only analyze the content variation of themes (e.g., author-topic analysis, cross-collection comparative analysis )

      • Fixed-View: Only analyze the coverage variation of themes (e.g., spatiotemporal theme analysis)

    • In general

      • Impose priors on model parameters

      • Support the whole spectrum from unsupervised to supervised learning


    Interpretation of topics

    Interpretation of Topics

    Statistical topic models

    term 0.1599relevance 0.0752weight 0.0660 feedback 0.0372independence 0.0311

    model 0.0310

    frequent 0.0233

    probabilistic 0.0188

    document 0.0173

    term 0.1599relevance 0.0752weight 0.0660 feedback 0.0372independence 0.0311

    model 0.0310

    frequent 0.0233

    probabilistic 0.0188

    document 0.0173

    term 0.1599relevance 0.0752weight 0.0660 feedback 0.0372independence 0.0311

    model 0.0310

    frequent 0.0233

    probabilistic 0.0188

    document 0.0173

    Multinomial topic models

    Collection (Context)

    Coverage; Discrimination

    Relevance Score

    Re-ranking

    clustering algorithm;

    distance measure;

    database system, clustering algorithm, r tree, functional dependency, iceberg cube, concurrency control, index structure …

    NLP Chunker

    Ngram stat.

    Ranked Listof Labels

    Candidate label pool


    Relevance the zero order score

    Relevance: the Zero-Order Score

    • Intuition: prefer phrases covering high probability words

    Clustering

    Good Label (l1): “clustering algorithm”

    dimensional

    algorithm

    Latent Topic 

    birch

    shape

    Bad Label (l2):“body shape”

    p(w|)

    body


    Relevance the first order score

    Relevance: the First-Order Score

    C: SIGMOD Proceedings

    • Intuition: prefer phrases with similar context (distribution)

    Clustering

    Clustering

    Clustering

    dimension

    dimension

    dimension

    Bad Label (l2):“hash join”

    Good Label (l1):“clustering algorithm”

    Topic 

    partition

    partition

    algorithm

    algorithm

    algorithm

    join

    Score (l,  )

    hash

    hash

    hash

    P(w|l2)

    P(w|)

    P(w|l1)

    D(||l1) < D(||l2)


    Sample results

    Sample Results

    • Comparative text mining

    • Spatiotemporal pattern mining

    • Sentiment summary

    • Event impact analysis

    • Temporal author-topic analysis


    Comparing news articles iraq war 30 articles vs afghan war 26 articles

    Comparing News Articles Iraq War (30 articles) vs. Afghan War (26 articles)

    The common theme indicates that “United Nations” is involved in both wars

    Collection-specific themes indicate different roles of “United Nations” in the two wars


    Comparing laptop reviews

    Comparing Laptop Reviews

    Top words serve as “labels” for common themes

    (e.g., [sound, speakers], [battery, hours], [cd,drive])

    These word distributions can be used to segment text and

    add hyperlinks between documents


    Spatiotemporal patterns in blog articles

    Spatiotemporal Patterns in Blog Articles

    • Query= “Hurricane Katrina”

    • Topics in the results:

    • Spatiotemporal patterns


    Theme life cycles for hurricane katrina

    Theme Life Cycles for Hurricane Katrina

    Oil Price

    price 0.0772oil 0.0643gas 0.0454 increase 0.0210product 0.0203

    fuel 0.0188

    company 0.0182

    New Orleans

    city 0.0634orleans 0.0541new 0.0342louisiana 0.0235flood 0.0227

    evacuate 0.0211

    storm 0.0177


    Theme snapshots for hurricane katrina

    Theme Snapshots for Hurricane Katrina

    Week2: The discussion moves towards the north and west

    Week1: The theme is the strongest along the Gulf of Mexico

    Week3: The theme distributes more uniformly over the states

    Week4: The theme is again strong along the east coast and the Gulf of Mexico

    Week5: The theme fades out in most states


    Theme life cycles kdd

    Theme Life Cycles: KDD

    gene 0.0173expressions 0.0096probability 0.0081microarray 0.0038…

    marketing 0.0087customer 0.0086

    model 0.0079business 0.0048…

    rules 0.0142association 0.0064

    support 0.0053…

    Global Themes life cycles of KDD Abstracts


    Theme evolution graph kdd

    Theme Evolution Graph: KDD

    1999

    2000

    2001

    2002

    2003

    2004

    T

    web 0.009classifica –tion 0.007features0.006topic 0.005…

    SVM 0.007criteria 0.007classifica – tion 0.006linear 0.005

    mixture 0.005random 0.006cluster 0.006clustering 0.005

    variables 0.005…

    topic 0.010mixture 0.008LDA 0.006 semantic 0.005

    decision 0.006tree 0.006classifier 0.005class 0.005Bayes 0.005

    Classifica - tion 0.015text 0.013unlabeled 0.012document 0.008labeled 0.008learning 0.007

    Informa - tion 0.012web 0.010social 0.008retrieval 0.007distance 0.005networks 0.004


    Blog sentiment summary query da vinci code

    Blog Sentiment Summary (query=“Da Vinci Code”)


    Results sentiment dynamics

    Results: Sentiment Dynamics

    Facet: the book “ the da vinci code”. ( Bursts during the movie, Pos > Neg )

    Facet: religious beliefs ( Bursts during the movie, Neg > Pos )


    Event impact analysis ir research

    Event Impact Analysis: IR Research

    Theme: retrieval models

    SIGIR papers

    term 0.1599relevance 0.0752weight 0.0660 feedback 0.0372independence 0.0311

    model 0.0310

    frequent 0.0233

    probabilistic 0.0188

    document 0.0173

    Publication of the paper “A language modeling approach to information retrieval”

    1992

    year

    Starting of the TREC conferences

    xml 0.0678email 0.0197 model 0.0191collect 0.0187

    judgment 0.0102

    rank 0.0097

    subtopic 0.0079

    vector 0.0514concept 0.0298extend 0.0297 model 0.0291space 0.0236

    boolean 0.0151

    function 0.0123

    feedback 0.0077

    1998

    model 0.1687language 0.0753estimate 0.0520 parameter 0.0281distribution 0.0268

    probable 0.0205

    smooth 0.0198

    markov 0.0137

    likelihood 0.0059

    probabilist 0.0778model 0.0432logic 0.0404 ir 0.0338boolean 0.0281

    algebra 0.0200

    estimate 0.0119

    weight 0.0111


    Temporal author topic analysis

    Temporal-Author-Topic Analysis

    Jiawei Han

    Rakesh Agrawal

    close 0.0805pattern 0.0720sequential 0.0462

    min_support 0.0353

    threshold 0.0207

    top-k 0.0176

    fp-tree 0.0102

    index 0.0440graph 0.0343web 0.0307

    gspan 0.0273substructure 0.0201

    gindex 0.0164

    bide 0.0115

    xml 0.0109

    project 0.0444itemset 0.0433intertransaction 0.0397

    support 0.0264associate 0.0258

    frequent 0.0181

    closet 0.0176

    prefixspan 0.0170

    Author

    Author A

    Global theme: frequent patterns

    time

    2000

    Author B

    pattern 0.1107frequent 0.0406frequent-pattern 0.039

    sequential 0.0360

    pattern-growth 0.0203

    constraint 0.0184

    push 0.0138

    research 0.0551next 0.0308transaction 0.0308

    panel 0.0275technical 0.0275

    article 0.0258

    revolution 0.0154

    innovate 0.0154


    Modeling topical communities mei et al 08

    Modeling Topical Communities(Mei et al. 08)

    Community 1:

    Information Retrieval

    Community 2:

    Data Mining

    Community 3:

    Machine Learning

    73


    Other extensions lda extensions

    Other Extensions (LDA Extensions)

    • Many extensions of LDA, mostly done by David Blei, Andrew McCallum and their co-authors

    • Some examples:

      • Hierarchical topic models [Blei et al. 03]

      • Modeling annotated data [Blei & Jordan 03]

      • Dynamic topic models [Blei & Lafferty 06]

      • Pachinko allocation [Li & McCallum 06])

    • Also, some specific context extension of PLSA, e.g., author-topic model [Steyvers et al. 04]


    Future research directions

    Future Research Directions

    • Topic models for text mining

      • Evaluation of topic models

      • Improve the efficiency of estimation and inferences

      • Incorporate linguistic knowledge

      • Applications in new domains and for new tasks

    • Text mining in general

      • Combination of NLP-style and DM-style mining algorithms

      • Integrated mining of text (unstructured) and unstructured data (e.g., Text OLAP)

      • Interactive mining:

        • Incorporate user constraints and support iterative mining

        • Design and implement mining languages


    What you should know

    What You Should Know

    • How PLSA works

    • How EM algorithm works in general

    • Contextual PLSA can be used to perform many quite different interesting text mining tasks


    Roadmap

    Roadmap

    • This lecture: Topic models for text mining

    • Next lecture: Next generation search engines


  • Login