Part 2 what is under the hood topic modeling and things you can do with it
This presentation is the property of its rightful owner.
Sponsored Links
1 / 45

Part 2: What is under the hood: Topic modeling and things you can do with it … PowerPoint PPT Presentation


  • 93 Views
  • Uploaded on
  • Presentation posted in: General

Part 2: What is under the hood: Topic modeling and things you can do with it …. Topic modeling allows useful analyses. Turn written words into measurable quantities Measure how much is written on topic X? Measure how close are documents A and B?. What is the topic model?.

Download Presentation

Part 2: What is under the hood: Topic modeling and things you can do with it …

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Part 2 what is under the hood topic modeling and things you can do with it

Part 2:What is under the hood: Topic modeling and things you can do with it …


Topic modeling allows useful analyses

Topic modeling allows useful analyses

  • Turn written words into measurable quantities

    • Measure how much is written on topic X?

    • Measure how close are documents A and B?


What is the topic model

What is the topic model?

  • Bayesian model for a collection of text documents

    • Finds patterns of co-occurring words

  • Intuition: documents exhibit mixtures of topics

  • Learn using Gibbs sampling (unsupervised)

    • Blei, Ng, Jordan: Latent Dirichlet Allocation (2003)

    • Griffiths & Steyvers: Finding Scientific Topics (2004)


What is a topic

What is a topic?

topics are distributions over words

documents are a mixture of topics

topic is a latent variable

Example leaned topics:

[BAYESIAN INFERENCE] sampling bayesian prior distribution sample monte_carlo methodmodelsamplesposteriormarkov_chaininferenceimportancegibbslikelihoodparameterbayesmixturemcmcgaussian …

[DIGITAL LIBRARIES] digital library librariesaccesscollectioninformationmetadataelectronicrepositoryrepositoriescatalogarchivesarchiveprovidingcontentportalresources …


Topic modeling is better than clustering

Topic modeling is better than clustering

Multiple Topics

One Cluster


Learn topic model using gibbs sampling

Learn topic model using Gibbs sampling


Topic model is fast and scalable

Topic model is fast and scalable

  • Newman+, NIPS 2007

  • Porteous, Newman+, SIGKDD 2008

Topic model: 1 year

Distributed topic model: 1 day

time = 1 year, memory = 400 GB

time = 1 day, memory < 1 GB/proc

MEDLINE/PubMed

8 million abstracts

700 million words

P1

P1

P2

P3

P1024


Topic modeling to measure texts

Topic modeling to “measure” texts

  • Analyzed state-of-the-field of women’s history

    • Block & Newman, 2008 (under review, J. Women’s History)

    • Text mined 20 years of history publications (800,000 abstracts)

  • Busted some myths …

    • e.g. Sexuality studies is a modern project


Part 2 what is under the hood topic modeling and things you can do with it

Proportion of women’s history publications devoted to sexuality studies

  • Q: How modern is sexuality studies?


Use to analyze research portfolio

Use to analyze research portfolio

  • What research does NINDS fund?

  • What other institutes fund research done under NINDS?

  • Measure spending by disease

    • Potentially more accurate, does not rely on classification/keyword/thesaurus terms

    • Find spending overlap (e.g. funding duplicated across institutes), and gaps


What topics does ninds fund

What topics does NINDS fund?


How is research on ion channels shared across other institutes

How is research on ion channels shared across other institutes?


Topic based document document distance

Topic-based document-document distance

topic mix

Doc A

dist (Doc A, Doc B)

Doc B


Example most similar sections across books

Example: Most similar sections across books?

  • Between:

  • Any Austen book, and

  • Melville’s Moby Dick


Grants close to given grant

Grants close to given grant

5R01NS024471-21

Ion channels of neurons

PI: JONES, STEPHEN W

Similar Grants:

(0.8)5R01NS043259-04   Molecular mechanisms of voltage-gated ion channels, (LARSSON, HANS PETER)(0.8)5R01GM069837-03Ion Regulation of Kv Channel Gating and Permeation, (DEUTSCH, CAROL J.)(0.8)5R01HL075536-03Voltage Sensor Movement in the HERG Potassium Channel, (TRISTANI-FIROUZI, MARTIN)(0.8)5R01HL065299-06Molecular Mechanisms of Pacemaker Channel Function, (SANGUINETTI, MICHAEL C.)(0.8)5R01NS045383-10   Molecular Physiology of K and Ca Channels, (YANG, JIAN)(0.8)5R01DK046950-12Molecular Cloning of Epithelial K Channels, (SACKIN, HENRY)(0.8)5R01HL050411-13Cardiac Na+ Channel:Molecular Basis of Permeation, (TOMASELLI, GORDON)(0.8)5R01HL044630-16Pharmacology of Cardiac Sodium Channel Modifiers, (SHEETS, MICHAEL F.)


Hierarchical labeling of topic maps

Hierarchical labeling of topic maps


Hierarchical labeling of topic maps1

Hierarchical labeling of topic maps

  • Learn topics for D documents

  • Compute all D2 document-document distances

  • Compute 2-dim layout of D documents (using DrL, PCA, MDS, Isomap, LLE, etc)

  • Create labels

    • Lower levels: domain expert interprets topic, creates short label

    • Higher levels: cluster topics into group (use hierarchical agglomerative clustering), then domain expert labels group

  • Place labels

    • Cluster 2-dim points using K-means

      • K-means well suited to clumpiness of DrL layouts

    • Use majority label


Part 2 what is under the hood topic modeling and things you can do with it

K-Means clustering of DrL layout using K=130


Interactive visual browse

Interactive visual browse

  • Query  {documents}

  • Iterate:

    {documents}  topic map

    topic map  {user subselects documents}


Part 2 what is under the hood topic modeling and things you can do with it

query = “colon cancer”


Part 2 what is under the hood topic modeling and things you can do with it

query = “colon cancer”


Part 2 what is under the hood topic modeling and things you can do with it

Thank you


Links

Links

  • http://datalab-1.ics.uci.edu/anthrax2/test.php

  • http://datalab-1.ics.uci.edu/newman/pubmed/

  • http://datalab-1.ics.uci.edu/ninds/

  • http://yarra.ics.uci.edu/pubmedtrends/

  • http://yarra.ics.uci.edu/calit2/

  • http://yarra.ics.uci.edu/topic/enron/

  • http://scimaps.org/maps/ninds/

  • http://scimaps.org/maps/neurovis/


Part 2 what is under the hood topic modeling and things you can do with it

query = “p53”


Part 2 what is under the hood topic modeling and things you can do with it

query = “p53”


Topics of topics

Topics of topics

  • Topic model learns patterns of co-occurring words

  • Rerun topic model (on ‘documents’ where the topic mixes are the ‘words’)

     learn patterns of co-occurring topics


Co occurring topics in pubmed

Co-occurring topics in PubMed

[super56]

[t320] laparoscopic patient surgery open time complication procedure postoperative

[t640] bladder urethral incontinence urinary urinary_incontinence patient detrusor urodynamic

[t542] biliary gallbladder bile_duct patient duct bile common endoscopic

[t299] patient surgery surgical treatment operation surgical_treatment indication conservative

[t242] resection anastomosis patient anastomotic operation gastrectomy postoperative anastomoses

[super48]

[t1389] patient radiotherapy chemotherapy treatment survival surgery tumor disease

[t392] lymph_node patient lymph_nodes nodes dissection metastases axillary staging

[t1163] patient prognostic survival prognosis factor prognostic_factor tumor stage

[t1300] patient month follow-up year recurrence range underwent treated

[t120] stage patient stages iii disease iv ii stage_i


Austen dickens melville

Austen, Dickens, Melville

1429 sections of 100 lines

8 novels (Austen++)

Emma

Mansfield Park

Northanger Abbey

Persuasion

Pride and Prejudice

Sense and Sensibility

Our Mutual Friend

Moby Dick

Emma

Mansfield Park

Northanger Abbey

Persuasion

Pride and Prejudice

Sense and Sensibility

Our Mutual Friend

Moby Dick


Part 2 what is under the hood topic modeling and things you can do with it

Trend of ‘Sentiment’ topic throughout Austen novels

[SENTIMENT] felt comfort feeling feel spirit mind heart point moment ill letter beyond mother state never event evil fearimpossiblehopetimeidealeftsituationpoordistress possiblehourendlossreliefdearestsufferingconcerndreadfulmiseryunhappyemotion


Topical similarity across austen dickens melville

Topical similarity across Austen/Dickens/Melville

Our Mutual Friend

Moby Dick

6 Austen

novels

Different

6 Austen

novels

Our Mutual Friend

Moby Dick

Similar


Most similar sections austen melville

Most similar sections: Austen -- Melville


A topic id is assigned to every word

A topic ID is assigned to every word


A topic id is assigned to every word1

A topic ID is assigned to every word


Finding funding overlap

Case Studies > Finding Funding Overlap

Finding Funding Overlap

The US Office of Science and Technology Policy wanted to analyze NSF and NIH funding to determine areas of overlap.

How much funding overlap is there by topic area?


Sample topics from topic model 22 000 abstracts

Case Studies > Finding Funding Overlap

Sample Topics from Topic Model (22,000 abstracts)


Program similarity

Program Similarity


Part 2 what is under the hood topic modeling and things you can do with it

Visualization of funding programs – nearby program support similar topics

NSF – BIO

NSF – SBE

NIH


  • Login