part 2 what is under the hood topic modeling and things you can do with it
Download
Skip this Video
Download Presentation
Part 2: What is under the hood: Topic modeling and things you can do with it …

Loading in 2 Seconds...

play fullscreen
1 / 45

Part 2: What is under the hood: Topic modeling and things you can do with it … - PowerPoint PPT Presentation


  • 118 Views
  • Uploaded on

Part 2: What is under the hood: Topic modeling and things you can do with it …. Topic modeling allows useful analyses. Turn written words into measurable quantities Measure how much is written on topic X? Measure how close are documents A and B?. What is the topic model?.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Part 2: What is under the hood: Topic modeling and things you can do with it …' - vanida


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
topic modeling allows useful analyses
Topic modeling allows useful analyses
  • Turn written words into measurable quantities
    • Measure how much is written on topic X?
    • Measure how close are documents A and B?
what is the topic model
What is the topic model?
  • Bayesian model for a collection of text documents
    • Finds patterns of co-occurring words
  • Intuition: documents exhibit mixtures of topics
  • Learn using Gibbs sampling (unsupervised)
    • Blei, Ng, Jordan: Latent Dirichlet Allocation (2003)
    • Griffiths & Steyvers: Finding Scientific Topics (2004)
what is a topic
What is a topic?

topics are distributions over words

documents are a mixture of topics

topic is a latent variable

Example leaned topics:

[BAYESIAN INFERENCE] sampling bayesian prior distribution sample monte_carlo methodmodelsamplesposteriormarkov_chaininferenceimportancegibbslikelihoodparameterbayesmixturemcmcgaussian …

[DIGITAL LIBRARIES] digital library librariesaccesscollectioninformationmetadataelectronicrepositoryrepositoriescatalogarchivesarchiveprovidingcontentportalresources …

topic model is fast and scalable
Topic model is fast and scalable
  • Newman+, NIPS 2007
  • Porteous, Newman+, SIGKDD 2008

Topic model: 1 year

Distributed topic model: 1 day

time = 1 year, memory = 400 GB

time = 1 day, memory < 1 GB/proc

MEDLINE/PubMed

8 million abstracts

700 million words

P1

P1

P2

P3

P1024

topic modeling to measure texts
Topic modeling to “measure” texts
  • Analyzed state-of-the-field of women’s history
    • Block & Newman, 2008 (under review, J. Women’s History)
    • Text mined 20 years of history publications (800,000 abstracts)
  • Busted some myths …
    • e.g. Sexuality studies is a modern project
slide9
Proportion of women’s history publications devoted to sexuality studies
  • Q: How modern is sexuality studies?
use to analyze research portfolio
Use to analyze research portfolio
  • What research does NINDS fund?
  • What other institutes fund research done under NINDS?
  • Measure spending by disease
    • Potentially more accurate, does not rely on classification/keyword/thesaurus terms
    • Find spending overlap (e.g. funding duplicated across institutes), and gaps
topic based document document distance
Topic-based document-document distance

topic mix

Doc A

dist (Doc A, Doc B)

Doc B

example most similar sections across books
Example: Most similar sections across books?
  • Between:
  • Any Austen book, and
  • Melville’s Moby Dick
grants close to given grant
Grants close to given grant

5R01NS024471-21

Ion channels of neurons

PI: JONES, STEPHEN W

Similar Grants:

(0.8)5R01NS043259-04   Molecular mechanisms of voltage-gated ion channels, (LARSSON, HANS PETER)(0.8)5R01GM069837-03Ion Regulation of Kv Channel Gating and Permeation, (DEUTSCH, CAROL J.)(0.8)5R01HL075536-03Voltage Sensor Movement in the HERG Potassium Channel, (TRISTANI-FIROUZI, MARTIN)(0.8)5R01HL065299-06Molecular Mechanisms of Pacemaker Channel Function, (SANGUINETTI, MICHAEL C.)(0.8)5R01NS045383-10   Molecular Physiology of K and Ca Channels, (YANG, JIAN)(0.8)5R01DK046950-12Molecular Cloning of Epithelial K Channels, (SACKIN, HENRY)(0.8)5R01HL050411-13Cardiac Na+ Channel:Molecular Basis of Permeation, (TOMASELLI, GORDON)(0.8)5R01HL044630-16Pharmacology of Cardiac Sodium Channel Modifiers, (SHEETS, MICHAEL F.)

hierarchical labeling of topic maps1
Hierarchical labeling of topic maps
  • Learn topics for D documents
  • Compute all D2 document-document distances
  • Compute 2-dim layout of D documents (using DrL, PCA, MDS, Isomap, LLE, etc)
  • Create labels
    • Lower levels: domain expert interprets topic, creates short label
    • Higher levels: cluster topics into group (use hierarchical agglomerative clustering), then domain expert labels group
  • Place labels
    • Cluster 2-dim points using K-means
      • K-means well suited to clumpiness of DrL layouts
    • Use majority label
interactive visual browse
Interactive visual browse
  • Query  {documents}
  • Iterate:

{documents}  topic map

topic map  {user subselects documents}

links
Links
  • http://datalab-1.ics.uci.edu/anthrax2/test.php
  • http://datalab-1.ics.uci.edu/newman/pubmed/
  • http://datalab-1.ics.uci.edu/ninds/
  • http://yarra.ics.uci.edu/pubmedtrends/
  • http://yarra.ics.uci.edu/calit2/
  • http://yarra.ics.uci.edu/topic/enron/
  • http://scimaps.org/maps/ninds/
  • http://scimaps.org/maps/neurovis/
topics of topics
Topics of topics
  • Topic model learns patterns of co-occurring words
  • Rerun topic model (on ‘documents’ where the topic mixes are the ‘words’)

 learn patterns of co-occurring topics

co occurring topics in pubmed
Co-occurring topics in PubMed

[super56]

[t320] laparoscopic patient surgery open time complication procedure postoperative

[t640] bladder urethral incontinence urinary urinary_incontinence patient detrusor urodynamic

[t542] biliary gallbladder bile_duct patient duct bile common endoscopic

[t299] patient surgery surgical treatment operation surgical_treatment indication conservative

[t242] resection anastomosis patient anastomotic operation gastrectomy postoperative anastomoses

[super48]

[t1389] patient radiotherapy chemotherapy treatment survival surgery tumor disease

[t392] lymph_node patient lymph_nodes nodes dissection metastases axillary staging

[t1163] patient prognostic survival prognosis factor prognostic_factor tumor stage

[t1300] patient month follow-up year recurrence range underwent treated

[t120] stage patient stages iii disease iv ii stage_i

austen dickens melville
Austen, Dickens, Melville

1429 sections of 100 lines

8 novels (Austen++)

Emma

Mansfield Park

Northanger Abbey

Persuasion

Pride and Prejudice

Sense and Sensibility

Our Mutual Friend

Moby Dick

Emma

Mansfield Park

Northanger Abbey

Persuasion

Pride and Prejudice

Sense and Sensibility

Our Mutual Friend

Moby Dick

slide37
Trend of ‘Sentiment’ topic throughout Austen novels

[SENTIMENT] felt comfort feeling feel spirit mind heart point moment ill letter beyond mother state never event evil fearimpossiblehopetimeidealeftsituationpoordistress possiblehourendlossreliefdearestsufferingconcerndreadfulmiseryunhappyemotion

topical similarity across austen dickens melville
Topical similarity across Austen/Dickens/Melville

Our Mutual Friend

Moby Dick

6 Austen

novels

Different

6 Austen

novels

Our Mutual Friend

Moby Dick

Similar

finding funding overlap
Case Studies > Finding Funding OverlapFinding Funding Overlap

The US Office of Science and Technology Policy wanted to analyze NSF and NIH funding to determine areas of overlap.

How much funding overlap is there by topic area?

slide45
Visualization of funding programs – nearby program support similar topics

NSF – BIO

NSF – SBE

NIH

ad