Automatic labeling of multinomial topic models
This presentation is the property of its rightful owner.
Sponsored Links
1 / 16

Automatic Labeling of Multinomial Topic Models PowerPoint PPT Presentation


  • 62 Views
  • Uploaded on
  • Presentation posted in: General

Automatic Labeling of Multinomial Topic Models. Qiaozhu Mei, Xuehua Shen, ChengXiang Zhai University of Illinois at Urbana-Champaign. Outline. Background: statistical topic models Labeling a topic model Criteria and challenge Our approach: a probabilistic framework Experiments Summary.

Download Presentation

Automatic Labeling of Multinomial Topic Models

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Automatic labeling of multinomial topic models

Automatic Labeling of Multinomial Topic Models

Qiaozhu Mei, Xuehua Shen, ChengXiang Zhai

University of Illinois at Urbana-Champaign


Outline

Outline

  • Background: statistical topic models

  • Labeling a topic model

    • Criteria and challenge

  • Our approach: a probabilistic framework

  • Experiments

  • Summary


Statistical topic models for text mining

Topic models

(Multinomial distributions)

term 0.16relevance 0.08weight 0.07 feedback 0.04independ. 0.03

model 0.03

Subtopic discovery

Topical pattern analysis

PLSA [Hofmann 99]

LDA [Blei et al. 03]

Summarization

Author-Topic

[Steyvers et al. 04]

web 0.21search 0.10link 0.08 graph 0.05…

Pachinko allocation

[Li & McCallum 06]

Opinion comparison

CPLSA

[Mei & Zhai 06]

Topic over time

[Wang et al. 06]

Statistical Topic Models for Text Mining

Text

Collections

Probabilistic

Topic Modeling


Topic models hard to interpret

Term, relevance,

weight, feedback

Retrieval Models

Topic Models: Hard to Interpret

  • Use top words

    • automatic, but hard to make sense

  • Human generated labels

    • Make sense, but cannot scale up

term 0.16relevance 0.08weight 0.07 feedback 0.04independence 0.03

model 0.03

frequent 0.02

probabilistic 0.02

document 0.02

insulin

foraging

foragers

collected

grains

loads

collection

nectar

?

Question: Can we automatically generate

understandable labels for topics?


What is a good label

iPod Nano

じょうほうけんさく

Pseudo-feedback

Information Retrieval

What is a Good Label?

Retrieval models

  • Semantically close (relevance)

  • Understandable – phrases?

  • High coverage inside topic

  • Discriminative across topics

term 0.1599relevance 0.0752weight 0.0660 feedback 0.0372independence 0.0311

model 0.0310

frequent 0.0233

probabilistic 0.0188

document 0.0173

  • Mei and Zhai 06:

  • a topic in SIGIR


Our method

NLP Chunker

Ngram Stat.

1

3

Discrimination

2

Relevance Score

information retriev. 0.26 0.01

retrieval models 0.20IR models 0.18 pseudo feedback 0.09

……

Information retrieval 0.26retrieval models 0.19IR models 0.17 pseudo feedback 0.06……

information retrieval, retrieval model, index structure, relevance feedback,

4

Coverage

retrieval models 0.20

IR models 0.18 0.02pseudo feedback 0.09

……

information retrieval 0.01

Candidate label pool

Our Method

Collection

(e.g., SIGIR)

term 0.16relevance 0.07weight 0.07 feedback 0.04independence 0.03

model 0.03

filtering 0.21collaborative 0.15…

trec 0.18evaluation 0.10…


Relevance task 2 the zero order score

Good Label (l1): “clustering algorithm”

?

Bad Label (l2):“body shape”

Relevance (Task 2): the Zero-Order Score

  • Intuition: prefer phrases well covering top words

Clustering

p(“clustering”|) = 0.4

p(“dimensional”|) = 0.3

dimensional

>

algorithm

Latent Topic 

birch

p(“shape”|) = 0.01

shape

p(w|)

body

p(“body”|) = 0.001


Relevance task 2 the first order score

Clustering

Clustering

Clustering

Good Label (l1)“clustering algorithm”

dimension

key …hash join

… code …hash

table …search

…hash join…

map key…hash

…algorithm…key

…hash…key

table…join…

dimension

dimension

Topic

algorithm

partition

partition

key

algorithm

algorithm

hash

P(w|)

p(w | hash join)

hash

hash

p(w | clustering algorithm )

Relevance (Task 2): the First-Order Score

  • Intuition: prefer phrases with similar context (distribution)

l2: “hash join”

Score (l,  ) = D(||l)


Discrimination and coverage tasks 3 4

Discrimination and Coverage (Tasks 3 & 4)

  • Discriminative across topic:

    • High relevance to target topic, low relevance to other topics

  • High Coverage inside topic:

    • Use MMR strategy


Variations and applications

Variations and Applications

  • Labeling document clusters

    • Document cluster  unigram language model

    • Applicable to any task with unigram language model

  • Context sensitive labels

    • Label of a topic is sensitive to the context

    • An alternative way to approach contextual text mining

tree, prune, root, branch  “tree algorithms” in CS

 ? in horticulture

 ?in marketing?


Experiments

Experiments

  • Datasets:

    • SIGMOD abstracts; SIGIR abstracts; AP news data

    • Candidate labels: significant bigrams; NLP chunks

  • Topic models:

    • PLSA, LDA

  • Evaluation:

    • Human annotators to compare labels generated from anonymous systems

    • Order of systems randomly perturbed; score average over all sample topics


Result summary

Result Summary

  • Automatic phrase labels >> top words

  • 1-order relevance >> 0-order relevance

  • Bigram > NLP chunks

    • Bigram works better with literature; NLP better with news

  • System labels << human labels

    • Scientific literature is an easier task


Results sample topic labels

Results: Sample Topic Labels

north 0.02

case 0.01

trial 0.01

iran 0.01

documents 0.01

walsh 0.009

reagan 0.009

charges 0.007

the, of, a, and,to, data, > 0.02

clustering 0.02

time 0.01

clusters 0.01

databases 0.01

large 0.01

performance 0.01

quality 0.005

iran contra

clustering algorithm

clustering structure

tree 0.09

trees 0.08

spatial 0.08

b 0.05

r 0.04

disk 0.02

array 0.01

cache 0.01

r treeb tree …

large data, data

quality, high data, data application, …

indexing

methods


Results context sensitive labeling

Results: Context-Sensitive Labeling

  • Explore the different meaning of a topic with different contexts (content switch)

  • An alternative approach to contextual text mining

sampling

estimation

approximation

histogram

selectivity

histograms

Context: Database(SIGMOD Proceedings)

Context: IR(SIGIR Proceedings)

selectivity estimation;

random sampling;

approximate answers;

distributed retrieval;

parameter estimation;

mixture models;


Summary

Summary

  • Labeling: A postprocessing step of all multinomial topic models

  • A probabilistic approach to generate good labels

    • understandable, relevant, high coverage, discriminative

  • Broadly applicable to mining tasks involving multinomial word distributions; context-sensitive

  • Future work:

    • Labeling hierarchical topic models

    • Incorporating priors


Automatic labeling of multinomial topic models

Thanks!

- Please come to our poster tonight (#40)


  • Login