Part 2 what is under the hood topic modeling and things you can do with it
Sponsored Links
This presentation is the property of its rightful owner.
1 / 45

Part 2: What is under the hood: Topic modeling and things you can do with it … PowerPoint PPT Presentation


  • 103 Views
  • Uploaded on
  • Presentation posted in: General

Part 2: What is under the hood: Topic modeling and things you can do with it …. Topic modeling allows useful analyses. Turn written words into measurable quantities Measure how much is written on topic X? Measure how close are documents A and B?. What is the topic model?.

Download Presentation

Part 2: What is under the hood: Topic modeling and things you can do with it …

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Part 2:What is under the hood: Topic modeling and things you can do with it …


Topic modeling allows useful analyses

  • Turn written words into measurable quantities

    • Measure how much is written on topic X?

    • Measure how close are documents A and B?


What is the topic model?

  • Bayesian model for a collection of text documents

    • Finds patterns of co-occurring words

  • Intuition: documents exhibit mixtures of topics

  • Learn using Gibbs sampling (unsupervised)

    • Blei, Ng, Jordan: Latent Dirichlet Allocation (2003)

    • Griffiths & Steyvers: Finding Scientific Topics (2004)


What is a topic?

topics are distributions over words

documents are a mixture of topics

topic is a latent variable

Example leaned topics:

[BAYESIAN INFERENCE] sampling bayesian prior distribution sample monte_carlo methodmodelsamplesposteriormarkov_chaininferenceimportancegibbslikelihoodparameterbayesmixturemcmcgaussian …

[DIGITAL LIBRARIES] digital library librariesaccesscollectioninformationmetadataelectronicrepositoryrepositoriescatalogarchivesarchiveprovidingcontentportalresources …


Topic modeling is better than clustering

Multiple Topics

One Cluster


Learn topic model using Gibbs sampling


Topic model is fast and scalable

  • Newman+, NIPS 2007

  • Porteous, Newman+, SIGKDD 2008

Topic model: 1 year

Distributed topic model: 1 day

time = 1 year, memory = 400 GB

time = 1 day, memory < 1 GB/proc

MEDLINE/PubMed

8 million abstracts

700 million words

P1

P1

P2

P3

P1024


Topic modeling to “measure” texts

  • Analyzed state-of-the-field of women’s history

    • Block & Newman, 2008 (under review, J. Women’s History)

    • Text mined 20 years of history publications (800,000 abstracts)

  • Busted some myths …

    • e.g. Sexuality studies is a modern project


Proportion of women’s history publications devoted to sexuality studies

  • Q: How modern is sexuality studies?


Use to analyze research portfolio

  • What research does NINDS fund?

  • What other institutes fund research done under NINDS?

  • Measure spending by disease

    • Potentially more accurate, does not rely on classification/keyword/thesaurus terms

    • Find spending overlap (e.g. funding duplicated across institutes), and gaps


What topics does NINDS fund?


How is research on ion channels shared across other institutes?


Topic-based document-document distance

topic mix

Doc A

dist (Doc A, Doc B)

Doc B


Example: Most similar sections across books?

  • Between:

  • Any Austen book, and

  • Melville’s Moby Dick


Grants close to given grant

5R01NS024471-21

Ion channels of neurons

PI: JONES, STEPHEN W

Similar Grants:

(0.8)5R01NS043259-04   Molecular mechanisms of voltage-gated ion channels, (LARSSON, HANS PETER)(0.8)5R01GM069837-03Ion Regulation of Kv Channel Gating and Permeation, (DEUTSCH, CAROL J.)(0.8)5R01HL075536-03Voltage Sensor Movement in the HERG Potassium Channel, (TRISTANI-FIROUZI, MARTIN)(0.8)5R01HL065299-06Molecular Mechanisms of Pacemaker Channel Function, (SANGUINETTI, MICHAEL C.)(0.8)5R01NS045383-10   Molecular Physiology of K and Ca Channels, (YANG, JIAN)(0.8)5R01DK046950-12Molecular Cloning of Epithelial K Channels, (SACKIN, HENRY)(0.8)5R01HL050411-13Cardiac Na+ Channel:Molecular Basis of Permeation, (TOMASELLI, GORDON)(0.8)5R01HL044630-16Pharmacology of Cardiac Sodium Channel Modifiers, (SHEETS, MICHAEL F.)


Hierarchical labeling of topic maps


Hierarchical labeling of topic maps

  • Learn topics for D documents

  • Compute all D2 document-document distances

  • Compute 2-dim layout of D documents (using DrL, PCA, MDS, Isomap, LLE, etc)

  • Create labels

    • Lower levels: domain expert interprets topic, creates short label

    • Higher levels: cluster topics into group (use hierarchical agglomerative clustering), then domain expert labels group

  • Place labels

    • Cluster 2-dim points using K-means

      • K-means well suited to clumpiness of DrL layouts

    • Use majority label


K-Means clustering of DrL layout using K=130


Interactive visual browse

  • Query  {documents}

  • Iterate:

    {documents}  topic map

    topic map  {user subselects documents}


query = “colon cancer”


query = “colon cancer”


Thank you


Links

  • http://datalab-1.ics.uci.edu/anthrax2/test.php

  • http://datalab-1.ics.uci.edu/newman/pubmed/

  • http://datalab-1.ics.uci.edu/ninds/

  • http://yarra.ics.uci.edu/pubmedtrends/

  • http://yarra.ics.uci.edu/calit2/

  • http://yarra.ics.uci.edu/topic/enron/

  • http://scimaps.org/maps/ninds/

  • http://scimaps.org/maps/neurovis/


query = “p53”


query = “p53”


Topics of topics

  • Topic model learns patterns of co-occurring words

  • Rerun topic model (on ‘documents’ where the topic mixes are the ‘words’)

     learn patterns of co-occurring topics


Co-occurring topics in PubMed

[super56]

[t320] laparoscopic patient surgery open time complication procedure postoperative

[t640] bladder urethral incontinence urinary urinary_incontinence patient detrusor urodynamic

[t542] biliary gallbladder bile_duct patient duct bile common endoscopic

[t299] patient surgery surgical treatment operation surgical_treatment indication conservative

[t242] resection anastomosis patient anastomotic operation gastrectomy postoperative anastomoses

[super48]

[t1389] patient radiotherapy chemotherapy treatment survival surgery tumor disease

[t392] lymph_node patient lymph_nodes nodes dissection metastases axillary staging

[t1163] patient prognostic survival prognosis factor prognostic_factor tumor stage

[t1300] patient month follow-up year recurrence range underwent treated

[t120] stage patient stages iii disease iv ii stage_i


Austen, Dickens, Melville

1429 sections of 100 lines

8 novels (Austen++)

Emma

Mansfield Park

Northanger Abbey

Persuasion

Pride and Prejudice

Sense and Sensibility

Our Mutual Friend

Moby Dick

Emma

Mansfield Park

Northanger Abbey

Persuasion

Pride and Prejudice

Sense and Sensibility

Our Mutual Friend

Moby Dick


Trend of ‘Sentiment’ topic throughout Austen novels

[SENTIMENT] felt comfort feeling feel spirit mind heart point moment ill letter beyond mother state never event evil fearimpossiblehopetimeidealeftsituationpoordistress possiblehourendlossreliefdearestsufferingconcerndreadfulmiseryunhappyemotion


Topical similarity across Austen/Dickens/Melville

Our Mutual Friend

Moby Dick

6 Austen

novels

Different

6 Austen

novels

Our Mutual Friend

Moby Dick

Similar


Most similar sections: Austen -- Melville


A topic ID is assigned to every word


A topic ID is assigned to every word


Case Studies > Finding Funding Overlap

Finding Funding Overlap

The US Office of Science and Technology Policy wanted to analyze NSF and NIH funding to determine areas of overlap.

How much funding overlap is there by topic area?


Case Studies > Finding Funding Overlap

Sample Topics from Topic Model (22,000 abstracts)


Program Similarity


Visualization of funding programs – nearby program support similar topics

NSF – BIO

NSF – SBE

NIH


  • Login