Analyzing unstructured text with topic models
This presentation is the property of its rightful owner.
Sponsored Links
1 / 52

Analyzing unstructured text with topic models PowerPoint PPT Presentation


  • 75 Views
  • Uploaded on
  • Presentation posted in: General

Analyzing unstructured text with topic models. Mark Steyvers Dep. of Cognitive Sciences & Dep. of Computer Science University of California, Irvine. collaborators: Padhraic Smyth, UC Irvine; Tom Griffiths UC Berkeley. Analyzing Unstructured Text. Pennsylvania Gazette (1728-1800)

Download Presentation

Analyzing unstructured text with topic models

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Analyzing unstructured text with topic models

Analyzing unstructured text with topic models

Mark Steyvers

Dep. of Cognitive Sciences & Dep. of Computer Science

University of California, Irvine

collaborators: Padhraic Smyth, UC Irvine; Tom Griffiths UC Berkeley


Analyzing unstructured text

Analyzing Unstructured Text

Pennsylvania Gazette

(1728-1800)

80,000 articles

Enron

250,000 emails

NYT

330,000 articles

NSF/ NIH

100,000 grants

AOL queries

20,000,000 queries

650,000 users

16 million Medline articles


Topic models and text analysis

Topic Models and Text Analysis

  • Can answer a number of questions:

    • What is in this corpus?

    • What is in this document, paragraph, or sentence?

    • What does this person/group of people write about?

    • What tags are appropriate for this document?

    • What are the topical trends over time?


Topic models

Topic Models

  • Automatic and unsupervised extraction of semantic themes from large text collections.

  • Widely used model in machine learning and text mining

    • pLSI Model: Hoffman (1999)

    • LDA Model: Blei, Ng, and Jordan (2001, 2003)

    • LDA with Gibbs sampling : Griffiths and Steyvers (2003, 2004)


Basic assumptions

Basic Assumptions

  • Each topic is a distribution over words

  • Each document a mixture of topics

  • Each word in a document originates from a single topic


Model

Model

P( words | document ) = S P(words|topic) P (topic|document)

Topic = probability

distribution over words

topic weights

for each document

Automatically learned from text corpus


Toy example

Toy Example

MONEY1 BANK1 BANK1 LOAN1 BANK1 MONEY1 BANK1 MONEY1 BANK1 LOAN1 LOAN1 BANK1 MONEY1 ....

1.0

.6

RIVER2 MONEY1 BANK2 STREAM2 BANK2 BANK1 MONEY1 RIVER2 MONEY1 BANK2 LOAN1 MONEY1 ....

.4

1.0

RIVER2 BANK2 STREAM2 BANK2 RIVER2 BANK2....

Topics

Topic Weights

Documents and topic assignments


Statistical inference

Statistical Inference

MONEY? BANK BANK? LOAN? BANK? MONEY? BANK? MONEY? BANK? LOAN? LOAN? BANK? MONEY? ....

?

?

RIVER? MONEY? BANK? STREAM? BANK? BANK? MONEY? RIVER? MONEY? BANK? LOAN? MONEY? ....

?

RIVER? BANK? STREAM? BANK? RIVER? BANK?....

Topics

Topic

Weights

Documents and topic assignments


Statistical inference1

Statistical Inference

  • Exact inference is intractable

  • Markov chain Monte Carlo (MCMC) with Gibbs sampling

    • scalable to large document collections (e.g. all of wikipedia)

    • parallelizable

  • Form of dimensionality reduction

    • Number of topics T= 50…2000


Examples topics from new york times

Examples Topics from New York Times

Terrorism

Wall Street Firms

Stock Market

Bankruptcy

SEPT_11

WAR

SECURITY

IRAQ

TERRORISM

NATION

KILLED

AFGHANISTAN

ATTACKS

OSAMA_BIN_LADEN

AMERICAN

ATTACK

NEW_YORK_REGION

NEW

MILITARY

NEW_YORK

WORLD

NATIONAL

QAEDA

TERRORIST_ATTACKS

WALL_STREET

ANALYSTS

INVESTORS

FIRM

GOLDMAN_SACHS

FIRMS

INVESTMENT

MERRILL_LYNCH

COMPANIES

SECURITIES

RESEARCH

STOCK

BUSINESS

ANALYST

WALL_STREET_FIRMS

SALOMON_SMITH_BARNEY

CLIENTS

INVESTMENT_BANKING

INVESTMENT_BANKERS

INVESTMENT_BANKS

WEEK

DOW_JONES

POINTS

10_YR_TREASURY_YIELD

PERCENT

CLOSE

NASDAQ_COMPOSITE

STANDARD_POOR

CHANGE

FRIDAY

DOW_INDUSTRIALS

GRAPH_TRACKS

EXPECTED

BILLION

NASDAQ_COMPOSITE_INDEX

EST_02

PHOTO_YESTERDAY

YEN

10

500_STOCK_INDEX

BANKRUPTCY

CREDITORS

BANKRUPTCY_PROTECTION

ASSETS

COMPANY

FILED

BANKRUPTCY_FILING

ENRON

BANKRUPTCY_COURT

KMART

CHAPTER_11

FILING

COOPER

BILLIONS

COMPANIES

BANKRUPTCY_PROCEEDINGS

DEBTS

RESTRUCTURING

CASE

GROUP


Learning multiple meanings of words

Learning multiple meanings of words

PRINTING

PAPER

PRINT

PRINTED

TYPE

PROCESS

INK

PRESS

IMAGE

PRINTER

PRINTS

PRINTERS

COPY

COPIES

FORM

OFFSET

GRAPHIC

SURFACE

PRODUCED

CHARACTERS

PLAY

PLAYS

STAGE

AUDIENCE

THEATER

ACTORS

DRAMA

SHAKESPEARE

ACTOR

THEATRE

PLAYWRIGHT

PERFORMANCE

DRAMATIC

COSTUMES

COMEDY

TRAGEDY

CHARACTERS

SCENES

OPERA

PERFORMED

TEAM

GAME

BASKETBALL

PLAYERS

PLAYER

PLAY

PLAYING

SOCCER

PLAYED

BALL

TEAMS

BASKET

FOOTBALL

SCORE

COURT

GAMES

TRY

COACH

GYM

SHOT

JUDGE

TRIAL

COURT

CASE

JURY

ACCUSED

GUILTY

DEFENDANT

JUSTICE

EVIDENCE

WITNESSES

CRIME

LAWYER

WITNESS

ATTORNEY

HEARING

INNOCENT

DEFENSE

CHARGE

CRIMINAL

HYPOTHESIS

EXPERIMENT

SCIENTIFIC

OBSERVATIONS

SCIENTISTS

EXPERIMENTS

SCIENTIST

EXPERIMENTAL

TEST

METHOD

HYPOTHESES

TESTED

EVIDENCE

BASED

OBSERVATION

SCIENCE

FACTS

DATA

RESULTS

EXPLANATION

STUDY

TEST

STUDYING

HOMEWORK

NEED

CLASS

MATH

TRY

TEACHER

WRITE

PLAN

ARITHMETIC

ASSIGNMENT

PLACE

STUDIED

CAREFULLY

DECIDE

IMPORTANT

NOTEBOOK

REVIEW


Demographic analysis of search queries

Demographic Analysis of Search Queries


Aol dataset

AOL dataset

  • Dataset:

    - 20,000,000+ web queries

    - 650,000+ users

  • Users were given “anonymous” user-id

    • No demographics in this dataset


Example query log from user 2178

Example query log from user #2178

ID Query Date/Time URL clicked

2178dog eats uncooked pasta2006-05-26 15:31:56

2178inducing dog vomiting2006-05-26 15:32:46http://www.twodogpress.com

2178inducing dog vomiting2006-05-26 15:32:46http://www.canismajor.com

2178inducing dog vomiting2006-05-26 15:32:46http://kitchen.robbiehaf.com

2178inducing dog vomiting2006-05-26 15:32:46http://www.dog-first-aid-101.com

2178inducing dog vomiting2006-05-26 15:38:36

2178walmart2006-05-12 12:39:52http://www.walmart.com

2178sears2006-05-12 12:44:22http://www.sears.com

2178target2006-05-12 17:05:36http://www.target.com

2178babycenter.com2006-05-12 17:43:59http://www.babycenter.com

2178google2006-05-16 10:54:39http://www.google.com

2178fit pregnancy2006-05-16 15:34:23

2178baby center2006-05-16 15:37:22

2178yahoo.com2006-05-18 17:11:05http://www.yahoo.com

2178applebee's carside2006-05-19 19:21:08http://www.applebees.com

2178baby names2006-05-20 15:02:38http://www.babynames.com

2178baby names2006-05-20 15:02:38http://www.babynamesworld.com

2178baby names2006-05-20 15:02:38http://www.thinkbabynames.com

2178mortgage calculator2006-05-24 14:39:05http://www.bankrate.com

2178us zip codes2006-05-25 21:26:47http://www.usps.com

2178us zip codes2006-05-25 21:26:47http://www.usps.com


Another query database

Another Query Database…

  • Not publicly available

  • Dataset

    • 250,000+ users

    • 411,000+ queries

  • Age and gender of users are known:

    • age brackets: 0-12, 13-17, 18-20, 21-24, 25-29, 30-34, 35-44, 45-54, 55-64, 65+


Topic modeling of queries

Topic modeling of queries

  • Each user searches for a mixture of topics

  • Each topic is a probability distribution over query words


Four example topics out of 200

Four example topics (out of 200)

auto

car

parts

cars

used

ford

honda

truck

toyota

party

store

wedding

birthday

jewelry

ideas

cards

cake

gifts

webmd

cymbalta

xanax

gout

vicodin

effexor

prednisone

lexapro

ambien

hannah

montana

zac

efron

disney

high school musical

mileycyrus

hilary duff

Probability distribution over words. Most likely words listed at the top


User mixture of topics

User = mixture of topics

auto

car

parts

cars

used

ford

honda

truck

toyota

party

store

wedding

birthday

jewelry

ideas

cards

cake

gifts

webmd

cymbalta

xanax

gout

vicodin

effexor

prednisone

lexapro

ambien

hannah

montana

zac

efron

disney

high school musical

mileycyrus

hilary duff

80%

20%

100%

User #7654

User #246


Topic analysis

Topic Analysis

  • Find likely topics for each demographic bucket

  • Find likely demographics given topics

  • What’s on the mind of people in different age-groups?


Poems topic

“poems” topic


Myspace topic

“myspace” topic


Sports topic

“sports” topic


Mtv topic

“MTV” topic


Clothing stores topic

“Clothing Stores” topic


Hairstyles topic

“Hairstyles” topic


Recipes topic

“recipes” topic


Results

Results

  • Topic models give quick summaries of demographic trends in query datasets

  • Other potential applications:

    • e.g. blogs, social networking sites, email, etc

    • clinical data, e.g. therapy discussions


Analyzing emails who writes on what topics

Analyzing Emailswho writes on what topics?


Enron email data

Enron email data

250,000 emails

5000 authors

1999-2002


Author topic models

Author-topic models

  • We can learn the association between authors of documents and topics

  • Assume each author works on a mixture of topics


Enron email who writes on certain topics

ENRON Email: who writes on certain topics?

... But also over senders (authors) of email. Most likely authors listed at the top


Enron email two example topics t 100

Enron email: two example topics (T=100)


Detecting papers on unusual topics for authors

Detecting Papers on Unusual Topics for Authors

  • We can calculate perplexity (unusualness) for words in a document given an author

Papers ranked by perplexity for M. Jordan:


Author separation

Author Separation

Can model attribute words to authors correctly within a document?


Application faculty browser

Application:Faculty Browser


Faculty browser

Faculty Browser

  • Automatically analyzes computer science papers by UC San Diego and UC Irvine researchers

  • Finds topically related researchers


Analyzing unstructured text with topic models

one topic

most prolific researchers in this topic


Analyzing unstructured text with topic models

one researcher

topics this researcher is interested in

other researchers with similar topical interests


Inferred network of researchers connected through topics

Inferred network of researchers connected through topics


Modeling extensions

Modeling Extensions


Entity topic modeling

Entity-topic modeling

330,000 articles

2000-2002

Who is mentioned in what context?


Extracted named entities

Extracted Named Entities

Three investigations began Thursday into the securities and exchange_commission's choice of william_webster to head a new board overseeing the accounting profession. house and senate_democrats called for the resignations of both judge_webster and harvey_pitt, the commission's chairman. The white_house expressed support for judge_webster as well as for harvey_pitt, who was harshly criticized Thursday for failing to inform other commissioners before they approved the choice of judge_webster that he had led the audit committee of a company facing fraud accusations. “The president still has confidence in harvey_pitt,” said dan_bartlett, bush's communications director …

  • Used standard algorithms to extract named entities:

    • People

    • Places

    • Organizations


Standard topic model with entities

Standard Topic Model with Entities


Example of extracted entity topic network

Example of Extracted Entity-Topic Network


Topic trends

Topic Trends

Tour-de-France

Proportion of words assigned to topic for that time slice

Quarterly Earnings

Anthrax


Learning topic hierarchies example psych review abstracts

Learning Topic Hierarchies(example: psych Review Abstracts)

THE

OF

AND

TO

IN

A

IS

A

MODEL

MEMORY

FOR

MODELS

TASK

INFORMATION

RESULTS

ACCOUNT

SELF

SOCIAL

PSYCHOLOGY

RESEARCH

RISK

STRATEGIES

INTERPERSONAL

PERSONALITY

SAMPLING

MOTION

VISUAL

SURFACE

BINOCULAR

RIVALRY

CONTOUR

DIRECTION

CONTOURS

SURFACES

DRUG

FOOD

BRAIN

AROUSAL

ACTIVATION

AFFECTIVE

HUNGER

EXTINCTION

PAIN

RESPONSE

STIMULUS

REINFORCEMENT

RECOGNITION

STIMULI

RECALL

CHOICE

CONDITIONING

SPEECH

READING

WORDS

MOVEMENT

MOTOR

VISUAL

WORD

SEMANTIC

ACTION

SOCIAL

SELF

EXPERIENCE

EMOTION

GOALS

EMOTIONAL

THINKING

GROUP

IQ

INTELLIGENCE

SOCIAL

RATIONAL

INDIVIDUAL

GROUPS

MEMBERS

SEX

EMOTIONS

GENDER

EMOTION

STRESS

WOMEN

HEALTH

HANDEDNESS

REASONING

ATTITUDE

CONSISTENCY

SITUATIONAL

INFERENCE

JUDGMENT

PROBABILITIES

STATISTICAL

IMAGE

COLOR

MONOCULAR

LIGHTNESS

GIBSON

SUBMOVEMENT

ORIENTATION

HOLOGRAPHIC

CONDITIONIN

STRESS

EMOTIONAL

BEHAVIORAL

FEAR

STIMULATION

TOLERANCE

RESPONSES


Hidden markov topics model

Hidden Markov Topics Model

  • Syntactic dependencies  short range dependencies

  • Semantic dependencies  long-range

q

Semantic state: generate words from topic model

z1

z2

z3

z4

w1

w2

w3

w4

Syntactic states: generate words from HMM

s1

s2

s3

s4

(Griffiths, Steyvers, Blei, & Tenenbaum, 2004)


Analyzing unstructured text with topic models

NIPS Semantics

KERNEL

SUPPORT

VECTOR

SVM

KERNELS

#

SPACE

FUNCTION

MACHINES

SET

NETWORK

NEURAL

NETWORKS

OUPUT

INPUT

TRAINING

INPUTS

WEIGHTS

#

OUTPUTS

IMAGE

IMAGES

OBJECT

OBJECTS

FEATURE

RECOGNITION

VIEWS

#

PIXEL

VISUAL

EXPERTS

EXPERT

GATING

HME

ARCHITECTURE

MIXTURE

LEARNING

MIXTURES

FUNCTION

GATE

MEMBRANE

SYNAPTIC

CELL

*

CURRENT

DENDRITIC

POTENTIAL

NEURON

CONDUCTANCE

CHANNELS

DATA

GAUSSIAN

MIXTURE

LIKELIHOOD

POSTERIOR

PRIOR

DISTRIBUTION

EM

BAYESIAN

PARAMETERS

STATE

POLICY

VALUE

FUNCTION

ACTION

REINFORCEMENT

LEARNING

CLASSES

OPTIMAL

*

NIPSSyntax

IN

WITH

FOR

ON

FROM

AT

USING

INTO

OVER

WITHIN

#

*

I

X

T

N

-

C

F

P

IS

WAS

HAS

BECOMES

DENOTES

BEING

REMAINS

REPRESENTS

EXISTS

SEEMS

SEE

SHOW

NOTE

CONSIDER

ASSUME

PRESENT

NEED

PROPOSE

DESCRIBE

SUGGEST

HOWEVER

ALSO

THEN

THUS

THEREFORE

FIRST

HERE

NOW

HENCE

FINALLY

MODEL

ALGORITHM

SYSTEM

CASE

PROBLEM

NETWORK

METHOD

APPROACH

PAPER

PROCESS

USED

TRAINED

OBTAINED

DESCRIBED

GIVEN

FOUND

PRESENTED

DEFINED

GENERATED

SHOWN


Random sentence generation

Random sentence generation

LANGUAGE:

[S] RESEARCHERS GIVE THE SPEECH

[S] THE SOUND FEEL NO LISTENERS

[S] WHICH WAS TO BE MEANING

[S] HER VOCABULARIES STOPPED WORDS

[S] HE EXPRESSLY WANTED THAT BETTER VOWEL


Software

Software

Public-domain MATLAB toolbox for topic modeling on the Web:

http://psiexp.ss.uci.edu/research/programs_data/toolbox.htm


  • Login