mining the medical literature n.
Skip this Video
Loading SlideShow in 5 Seconds..
Mining the Medical Literature PowerPoint Presentation
Download Presentation
Mining the Medical Literature

Loading in 2 Seconds...

play fullscreen
1 / 38

Mining the Medical Literature - PowerPoint PPT Presentation

  • Uploaded on

Mining the Medical Literature. Chirag Bhatt October 14 th , 2004. Why MINE data!. Medical, genomics, proteomics research Find causal links between symptoms or diseases and drugs or chemicals Gene comparison. An example. Problem

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'Mining the Medical Literature' - geoff

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
mining the medical literature

Mining the Medical Literature

Chirag Bhatt

October 14th, 2004

why mine data
Why MINE data!
  • Medical, genomics, proteomics research
  • Find causal links between symptoms or diseases and drugs or chemicals
  • Gene comparison
an example
An example


  • What is causing an uncharacteristic behavior in protein production?


  • Find which genes have a roll to play in amino acid synthesis?


  • Search through online literature for genes that play a role in amino acid synthesis
data retrieval
Data Retrieval
  • Company Database
    • e.g. Customer records, product inventory
  • Search entity (structured)
    • records
  • Query (goal-driven)
    • What is the address of our client?
    • How many widgets are in stock?
  • SQL, Oracle, DB2, etc
information retrieval
Information Retrieval
  • Google, A9, AltaVista
  • Query (goal-driven)
  • Search entity (unstructured)
    • documents
    • variable format
      • html, pdf, etc
data mining
Data Mining
  • Structured data set
  • Generally a large amount of (historical) data
  • Find relations or patterns or trends in database (opportunistic)
    • Eg “beer and diaper”
text mining
Text Mining
  • Unstructured data set
    • Documents, publications, abstracts, web pages
  • Discover useful and previously unknown “gems” of information in large text collections using patterns, trends and domain knowledge
need for mining text
Need for mining text
  • Approximately 90% of the world’s data is held inunstructured formats

(source: Oracle Corporation)

why text mining in medical literature
Why Text Mining in Medical Literature?
  • Many multi-functional genes
    • Screen functionally interesting ones
  • Complexity of needs increasing
    • Individual genes -> family of genes
  • Manual Text Mining ? Not really!
  • Availability of published literature online
functionally coherent genes
Functionally Coherent Genes
  • Group of genes that exhibit similar experimental features
    • Amino acid metabolism, electron transport, stress response
  • Difficulties faced in finding functionally coherent genes
    • Most genes express multi-functionality
    • Some genes studied extensively and some only just discovered
semantic neighbor
Semantic neighbor
  • Two articles are semantic neighbors if they have similar word usage
  • Use statistical natural language processing to access and interpret online text
  • Find semantic neighbors in document set
  • If any article about common functionality contains atleast one in the group then the group is functionally coherant
neighbor divergence
Neighbor divergence
  • Scoring method
  • Each articles relevance to gene group is scored by:
    • count of number neighbors that have references to the group
neighbor divergence scores
Neighbor divergence scores

If score distribution is different from Poisson then gene group represents biological function

The log ratio for a Poisson distribution should be flat along the horizontal axis

need to filter results
Need to filter results
  • Generally well-studied genes tend to have semantic neighbor that refer to same gene
  • Neighbor may not be relevant to group function, but increases score – false positive
  • So only articles that refer to different genes are considered
  • Report percentile of a functional group of genes
  • Calculate precision and recall at different cutoff levels (next slide)
  • Remove legitimate genes with irrelevant genes in group
  • Sample Space: 19 known yeast groups and 1900 random groups
limitations of neighbor divergence
Limitations of neighbor divergence
  • Neighbor divergence helps group genes not tell us function
  • Work based on abstracts only
    • Entire literature search may prove challenging
    • Break into smaller components
another mining approach
Another mining approach

Extracting synonymous gene and protein terms

why find synonyms
Why find synonyms?
  • Genes and proteins are often associated with multiple names across articles and sub domains
  • More names keep getting added
    • new functional or structural information is discovered
  • Improve search and analysis
current work
Current work
  • Biological databases such as GenBank and SWISSPROT include synonyms
  • Not up to date
  • Disagreement on some synonyms
  • Laborious manual curation and review
  • Need for automation
two step problem
Two-step problem
  • Identifying gene and protein names
    • Done by state-of-the-art taggers
  • Determining whether these names are synonymous
    • We’ll discuss more on this…
current synonym approaches
Current synonym approaches
  • Synonymous gene and protein names represent same biological substance
    • Exhibit identical biological functions
    • Same gene or amino acid sequences
  • Other approaches
    • String matching
    • Matching abbreviations

to full-forms

gene and protein tagging
Gene and Protein Tagging
  • Identification step
    • Uses BLAST techniques and domain knowledge to pick out genes and protein terms
  • Heuristics
    • Synonyms usually occur within same sentence
    • Synonyms mentioned in first few pages of article
synonym detection approaches
Synonym detection approaches
  • Unsupervised - ‘Similarity’
    • based on contextual similarity
  • Semi-supervised - ‘Snowball’
    • extracts structured relations using patterns
  • Supervised - Text Classification/SVM
  • Hand-crafted extraction – GPE
  • Combined system
combined approach
Combined Approach
  • Combine output of SnowBall, SVM, and GPE
    • Each system gives a confidence score for each synonym pair

Where, s = <p1,p2> is a synonym pair and ConfE(s) is confidence assigned to s by individual extraction by the system E

unsupervised similarity
Unsupervised - Similarity
  • Context based
    • All words occurring within a ‘x’ word window
    • False positives are very common
    • Run time – O(|lexicon|3)
semi supervised snowball
Semi-supervised - Snowball
  • Manual feedback mechanism
supervised text classification
Supervised – Text Classification
  • Input: known synonym pairs
  • Automatically find contexts and assign weights
  • Train classifier to distinguish between ‘positive’ and ‘negative’ contexts
    • Eg ‘A also known as B’ and ‘A regulates B’
why combined approach
Why Combined Approach?
  • SnowBall and SVM, machine-learning based
    • captures synonyms that may be missed by GPE
  • GPE, knowledge-based
    • SnowBall and SVM have many false positives
  • Combine both advantages
  • Text mining
  • Semantic neighbor
  • Neighbor divergence
  • Precision and Recall
  • Synonym detection Approaches
  • Comments / Questions?