Concept and theme discovery through probabilistic models and clustering
Download
1 / 22

Concept and Theme Discovery through Probabilistic Models and Clustering - PowerPoint PPT Presentation


  • 71 Views
  • Uploaded on

Concept and Theme Discovery through Probabilistic Models and Clustering. Qiaozhu Mei Oct. 12, 2005. Concepts and Themes. Language units in biology literature mining: Terms Phrases Entities Concepts (tight groups of terms/entities representing semantics: e.g. Gene Synonyms)

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Concept and Theme Discovery through Probabilistic Models and Clustering' - harper


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

Concepts and themes
Concepts and Themes Clustering

  • Language units in biology literature mining:

    • Terms

    • Phrases

    • Entities

    • Concepts (tight groups of terms/entities representing semantics: e.g. Gene Synonyms)

    • Themes (loose groups of terms representing topic/subtopics)


Theme discovery
Theme Discovery Clustering

  • What we’ve got now:

    • A Generative Model to extract k themes from a collection

    • Each theme as a language model, represented by top probability words in a theme language model

    • KL Divergence to model the distance/similarity between themes;

    • retrieve most similar themes to a term group


Theme discovery cont
Theme Discovery (cont.) Clustering

  • What we’ve got now (cont.):

    • Use HMM to segment the whole collection with the theme extracted

    • Use MMR to find most representative and least redundant phrases to represent a theme (currently using n-gram prob. as and edit distance as similarity, performance to be tuned..)

    • Results: http://ucair.cs.uiuc.edu/qmei2/ThemeNavigation.html


Some justifications
Some justifications Clustering

  • Fly collection:

    • Cluster 0: circadian

    • Cluster 1: adh, evolution

    • Cluster 2: a mixture of two topics, apoptosis and promoters

    • Cluster 6: brain development

    • Cluster 8: cell division

    • Cluster 12: drosophila immunity

    • Cluster 13: nervous systems

    • Cluster 14: hedgehog segment Polarity gene

    • Cluster 16: Histone, Polycomb

    • Cluster 17: visual system


Theme discovery cont1
Theme Discovery (cont.) Clustering

  • Problems:

    • How to select k? (how many themes do we believe are there in the collection: bee collection should have smaller k than fly collection)

    • Can we find themes in a hierarchical manner?

      • This can solve the former problem…however, when to cutoff?

    • How to represent a theme?

      • Top words sometimes difficult to tell the semantics

      • Phrases?

      • Sentences?

    • Other possible approaches to extract theme? (LDAs, Clustering methods)


Hierarchical theme discovery
Hierarchical Theme Discovery Clustering

  • A straightforward approach (top down splitting):

    • Discover k themes from the initial collection

    • Segment the collection by the k themes

    • For each theme, build a sub-collection with the segments in previous step

    • For each sub-collection, extract k’ themes

    • Do these processes iteratively

    • Problem: When to stop splitting iteration?

Collection

Theme1

Theme3

Theme2

Theme2.1

Theme2.3

Theme2.2

……


Hierarchical theme discovery results
Hierarchical Theme Discovery (results) Clustering

A bee collection with 929 documents

Level1: 5 themes

Level2: 3 sub-themes for each higher level theme


Hierarchical theme discovery results1
Hierarchical Theme Discovery (results) Clustering

african

jelly

royal

european

venom

population

africanized

sting

kda

feral

m

reward

subspecies

proteins

patients

discrimination

naja

cue

characters

areas

queen

workers

worker

signal

jh

vibration

pheromone

gland

eggs

signals

hormone

juvenile

anarchistic

queens

egg

iridaceae

policing

ixia

behavioral

age

pollinator

plants

pollination

flowers

plantae

spermatophyta

angiospermae

dicotyledones

pollen

seed

fruit

angiosperms

spermatophytes

vascular

dicots

crop

plant

flower

pollinators

species

learning

brain

conditioning

olfactory

neural

neurons

mushroom

memory

sucrose

nervous

coordination

dopamine

extension

antennal

odor

system

proboscis

bodies

lobe

kenyon

varroa

mite

mites

jacobsoni

acarina

brood

parasite

colonies

host

control

chelicerata

chelicerates

hygienic

viruses

infestation

destructor

pest

infested

parasitology

mortality


Hierarchical theme discovery results2
Hierarchical Theme Discovery (results) Clustering

african

jelly

royal

european

venom

population

africanized

sting

kda

feral

m

reward

subspecies

proteins

patients

discrimination

naja

cue

characters

areas

queen

workers

worker

signal

jh

vibration

pheromone

gland

eggs

signals

hormone

juvenile

anarchistic

queens

egg

iridaceae

policing

ixia

behavioral

age

pollinator

plants

pollination

flowers

plantae

spermatophyta

angiospermae

dicotyledones

pollen

seed

fruit

angiosperms

spermatophytes

vascular

dicots

crop

plant

flower

pollinators

species

learning

brain

conditioning

olfactory

neural

neurons

mushroom

memory

sucrose

nervous

coordination

dopamine

extension

antennal

odor

system

proboscis

bodies

lobe

kenyon

varroa

mite

mites

jacobsoni

acarina

brood

parasite

colonies

host

control

chelicerata

chelicerates

hygienic

viruses

infestation

destructor

pest

infested

parasitology

mortality

venom

reward

patients

naja

kda

proteins

wasp

protein

diptera

pla2

vespula

primates

hominidae

chordata

vertebrata

mug

sting

sperm

dose

quality

african

european

population

populations

patterns

pattern

genetic

discrimination

mitochondrial

studies

information

are

contrast

green

two

bees

have

derived

africa

subspecies

larvae

microorganisms

gram

bacteria

0

colonies

royal

queen

jelly

eubacteria

non

workers

queens

production

2

nest

italian

5

fraction

nestmates


Hierarchical theme discovery results3
Hierarchical Theme Discovery (results) Clustering

african

jelly

royal

european

venom

population

africanized

sting

kda

feral

m

reward

subspecies

proteins

patients

discrimination

naja

cue

characters

areas

queen

workers

worker

signal

jh

vibration

pheromone

gland

eggs

signals

hormone

juvenile

anarchistic

queens

egg

iridaceae

policing

ixia

behavioral

age

pollinator

plants

pollination

flowers

plantae

spermatophyta

angiospermae

dicotyledones

pollen

seed

fruit

angiosperms

spermatophytes

vascular

dicots

crop

plant

flower

pollinators

species

learning

brain

conditioning

olfactory

neural

neurons

mushroom

memory

sucrose

nervous

coordination

dopamine

extension

antennal

odor

system

proboscis

bodies

lobe

kenyon

varroa

mite

mites

jacobsoni

acarina

brood

parasite

colonies

host

control

chelicerata

chelicerates

hygienic

viruses

infestation

destructor

pest

infested

parasitology

mortality

food

foragers

dance

transfer

enzyme

biosynthesis

receivers

contrast

nectar

flight

source

flow

water

information

rates

ddt

rj

caucasian

visual

green

queen

worker

workers

colonies

pollen

vibration

eggs

foraging

development

brood

signal

queens

bees

anarchistic

behavioral

iridaceae

larvae

egg

pheromone

may

mammals

vertebrates

venom

nonhuman

l

ml

models

model

chordates

beeswax

mug

omega

embryo

mammalia

vertebrata

has

chordata

nurse

coloured

vg


Hierarchical theme discovery results4
Hierarchical Theme Discovery (results) Clustering

african

jelly

royal

european

venom

population

africanized

sting

kda

feral

m

reward

subspecies

proteins

patients

discrimination

naja

cue

characters

areas

queen

workers

worker

signal

jh

vibration

pheromone

gland

eggs

signals

hormone

juvenile

anarchistic

queens

egg

iridaceae

policing

ixia

behavioral

age

pollinator

plants

pollination

flowers

plantae

spermatophyta

angiospermae

dicotyledones

pollen

seed

fruit

angiosperms

spermatophytes

vascular

dicots

crop

plant

flower

pollinators

species

learning

brain

conditioning

olfactory

neural

neurons

mushroom

memory

sucrose

nervous

coordination

dopamine

extension

antennal

odor

system

proboscis

bodies

lobe

kenyon

varroa

mite

mites

jacobsoni

acarina

brood

parasite

colonies

host

control

chelicerata

chelicerates

hygienic

viruses

infestation

destructor

pest

infested

parasitology

mortality

ecology

is

species

environmental

sciences

flowering

floral

terrestrial

pollinator

visiting

reproduction

plants

c

cashew

self

animalia

food

insects

faba

size

seed

per

crop

sunflower

number

cruciferae

fruit

hybrid

agriculture

seeds

quality

cultivar

weight

helianthus

oilseed

compositae

annuus

yield

pollination

set

pollen

eep

honeybees

mating

bumblebees

sp

hive

bacteria

scent

mimosa

brazil

undertakers

chromatography

marks

recently

gram

eubacteria

caraway

microorganisms

propolis


Hierarchical theme discovery results5
Hierarchical Theme Discovery (results) Clustering

african

jelly

royal

european

venom

population

africanized

sting

kda

feral

m

reward

subspecies

proteins

patients

discrimination

naja

cue

characters

areas

queen

workers

worker

signal

jh

vibration

pheromone

gland

eggs

signals

hormone

juvenile

anarchistic

queens

egg

iridaceae

policing

ixia

behavioral

age

pollinator

plants

pollination

flowers

plantae

spermatophyta

angiospermae

dicotyledones

pollen

seed

fruit

angiosperms

spermatophytes

vascular

dicots

crop

plant

flower

pollinators

species

learning

brain

conditioning

olfactory

neural

neurons

mushroom

memory

sucrose

nervous

coordination

dopamine

extension

antennal

odor

system

proboscis

bodies

lobe

kenyon

varroa

mite

mites

jacobsoni

acarina

brood

parasite

colonies

host

control

chelicerata

chelicerates

hygienic

viruses

infestation

destructor

pest

infested

parasitology

mortality

bees

sucrose

conditioning

response

learning

extension

proboscis

pollen

foragers

performance

between

thresholds

honeybees

solution

discrimination

strain

rate

foraging

concentration

low

dopamine

levels

development

age

binding

pupal

brain

octopamine

division

adult

colonies

labor

glass

treated

colony

ryr

pigmentation

chromosomes

arolium

da

imidacloprid

current

memory

mushroom

neurons

1

expressed

4

cells

antennal

mb

bodies

currents

nervous

brain

mv

kinase

receptors

term

protein


Hierarchical theme discovery results6
Hierarchical Theme Discovery (results) Clustering

african

jelly

royal

european

venom

population

africanized

sting

kda

feral

m

reward

subspecies

proteins

patients

discrimination

naja

cue

characters

areas

queen

workers

worker

signal

jh

vibration

pheromone

gland

eggs

signals

hormone

juvenile

anarchistic

queens

egg

iridaceae

policing

ixia

behavioral

age

pollinator

plants

pollination

flowers

plantae

spermatophyta

angiospermae

dicotyledones

pollen

seed

fruit

angiosperms

spermatophytes

vascular

dicots

crop

plant

flower

pollinators

species

learning

brain

conditioning

olfactory

neural

neurons

mushroom

memory

sucrose

nervous

coordination

dopamine

extension

antennal

odor

system

proboscis

bodies

lobe

kenyon

varroa

mite

mites

jacobsoni

acarina

brood

parasite

colonies

host

control

chelicerata

chelicerates

hygienic

viruses

infestation

destructor

pest

infested

parasitology

mortality

mite

varroa

mites

brood

jacobsoni

acarina

colonies

parasite

for

worker

control

a

drone

formic

population

acid

host

0

cells

treatment

viruses

larvae

microorganisms

virus

bacteria

animal

paenibacillus

infection

molecular

pathogen

eubacteria

gram

forming

endospore

positives

p

apv

entomopathogen

pollen

bees

foragers

their

or

ta

heat

at

hygienic

foraging

protein

activity

behaviour

increased

response

blood

flight

strips

metabolic

removal


Phrase representations
Phrase Representations: Clustering

biochemistry and molecular biophysics

endocrine system chemical coordination and homeostasis

molecular genetics biochemistry and molecular biophysics

sense organs sensory reception

animals arthropods chordates insects invertebrates mammals

system chemical coordination and homeostasis

vertebrata chordata animalia

honey bee

behavior terrestrial ecology

mammalia vertebrata chordata animalia

juvenile hormone

queen

rodentia mammalia vertebrata chordata animalia

worker laid eggs

vibration signal

genetics biochemistry and molecular biophysics

dufour s gland

mammals nonhuman mammals

workers

egg laying

queen mandibular gland pheromone

nonhuman vertebrates

iridaceae ixia

arthropoda invertebrata animalia muridae

aves vertebrata chordata animalia

mug ml

african

jelly

royal

european

venom

population

africanized

sting

kda

feral

m

reward

subspecies

proteins

patients

discrimination

naja

cue

characters

areas

queen

workers

worker

signal

jh

vibration

pheromone

gland

eggs

signals

hormone

juvenile

anarchistic

queens

egg

iridaceae

policing

ixia

behavioral

age

pollinator

plants

pollination

flowers

plantae

spermatophyta

angiospermae

dicotyledones

pollen

seed

fruit

angiosperms

spermatophytes

vascular

dicots

crop

plant

flower

pollinators

species

learning

brain

conditioning

olfactory

neural

neurons

mushroom

memory

sucrose

nervous

coordination

dopamine

extension

antennal

odor

system

proboscis

bodies

lobe

kenyon

varroa

mite

mites

jacobsoni

acarina

brood

parasite

colonies

host

control

chelicerata

chelicerates

hygienic

viruses

infestation

destructor

pest

infested

parasitology

mortality


Hierarchical theme discovery cont
Hierarchical Theme Discovery (cont.) Clustering

  • A bottom up agglomerative approach:

    • Find many micro-themes

    • Group similar micro-themes into larger ones

    • Borrow strategy from data mining:

      • BIRCH: incrementally form many micro-clusters, organized in a tree structure

      • Macro-clustering based on micro-clusters.

    • Problem: Again, when to stop?


Hierarchical theme discovery cont1
Hierarchical Theme Discovery (cont.) Clustering

  • Model-based approach:

    • Hofmann, IJCAI 99.

    • Assume we know the collection is generated from a hierarchical structure, use a generative model to learn the themes. (e.g. make use of GO hierarchies)

    • Problem: in most cases we don’t know the hierarchies.


Other research problems
Other Research Problems Clustering

  • Represent a theme:

    • Using top words: where to cut

    • Using phrases: have to tune the MMR (many possible strategies and parameter tuning)

    • Using sentence? Like summarization

  • Themes are interesting… but how to make use of the themes?

  • How to evaluate themes??


Concept extraction
Concept Extraction Clustering

  • What we have now:

    • N-gram algorithm (actually 2-gram): iteratively group a pair of terms which are most likely to be replaceable considering the context of one term before/after it.

    • Time Complexity: O(N3), Space Complexity: now O(N2). Beespace server can deal with <= 9000 terms now (2.4g memory). (performance not evaluated due to the small data size acceptable).

    • Problem: based on Mutual Information, preferring 2-grams with low frequency. Doesn’t make use of farther context.

    • Will removing stop words help or turn down the performance?


Some finding
Some finding: Clustering

  • A small dataset: (200+ abstracts containing gene synonyms)

  • Only 600 iterations (merge 600 times)

    • Most of them are reasonable, but not really useful

    • E.g. head-to-head tail-to-tail

    • E.g. within-locus between-locus

  • FBgn0000017: Dsrc Dabl

  • FBgn0000078: amylase-null AMY-null

  • Problem: doc-set too small, n-gram too sparse to find useful concepts.


Concept extraction cont
Concept Extraction (cont.) Clustering

  • Other Possible strategy:

    • Lin et al, KDD 02: Use feature vector to represent terms, the weights are the mutual information between term and context feature. Thus more flexible than n-gram. (if only consider 2-gram as context features, this will be similar to what we have)

    • Use committee to represent a cluster, thus assures the clusters are tight and robust.

    • Problem: not sure how to select features


Summary
Summary Clustering

  • Theme Extraction:

    • Generally performs well, if we can find a good k.

    • Hierarchical Clustering can solve this problem, but still need to find a reasonable stop criteria.

    • Representation is an interesting problem: MMR phrase extraction should be further tuned

    • Difficult to evaluate other than expert justification

  • Concept extraction:

    • N-gram has space constraints: haven’t really tested the performance… Generally, the performance should be better on large data sets

    • Other clustering algorithms can be explored.


ad