mining medical literature
Download
Skip this Video
Download Presentation
Mining Medical Literature

Loading in 2 Seconds...

play fullscreen
1 / 59

mining medical literature - PowerPoint PPT Presentation


  • 181 Views
  • Uploaded on

Mining Medical Literature. Vignesh Ganapathy (CS 374 : Algorithms in Biology) (FALL 2005). Outline. Introduction and Background Mining Technique 1: Identifying Functionally Coherent Gene Groups Mining Technique 2: Extracting Synonymous gene and protein terms Conclusions. Outline.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'mining medical literature' - issac


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
mining medical literature

Mining Medical Literature

Vignesh Ganapathy

(CS 374 : Algorithms in Biology)

(FALL 2005)

outline
Outline
  • Introduction and Background
  • Mining Technique 1:

Identifying Functionally Coherent Gene Groups

  • Mining Technique 2:

Extracting Synonymous gene and protein terms

  • Conclusions
outline3
Outline
  • Introduction and Background
  • Mining Technique 1:

Identifying Functionally Coherent Gene Groups

  • Mining Technique 2:

Extracting Synonymous gene and protein terms

  • Conclusions
introduction
Introduction
  • Medical Literature has vast amounts of knowledge and information
    • PubMed Central (PMC) ( the U.S. National Institutes of Health (NIH) free digital archive of biomedical and life sciences journal literature)
    • Amedeo.com (The Medical Literature Guide)
    • Journals like Science, Nature, Cell ,EMBO, Cell Biology, PNAS
    • (and many more..)
the problem
The Problem
  • Major task is finding out ways to extract useful information from these resources.
what is data mining
What is Data Mining?

“Data Mining is the Process of discovering meaningful, new correlation patterns and trends by sifting through large amount of data stored in repositories, using pattern recognition techniques as well as statistical and mathematical techniques.”

example data
Example Data!
  • Large amounts of data but no information
    • Daily transactions at a supermarket
    • Daily website visit histories
    • Books/videos rented at a Library
    • Newspaper, Journal archives
google news
Google News
  • Clustering News items (Google News)
more applications
More Applications
  • Improving Sales strategy
    • Finding items that sell together

(there is a common example of beer and diaper being related. A supermarket found out that 50% of the times beer was purchased with diapers)

  • Anomaly Detection and many more…
information retrieval ir
Information Retrieval (IR)
  • Collecting information from text data (Unstructured Data)
  • Applications
    • Search web documents
    • Natural Language Processing
    • Term also extends to include multimedia or other forms of unstructured data
ir system evaluation
IR System Evaluation
  • Some measures are
    • Precision
    • Recall
    • F1 measure – Combined measure which is a weighted harmonic mean
    • Sensitivity
    • Specificity
precision and recall
Precision and Recall

How are Precision and Recall related?

problems with precision and recall
Problems with Precision and Recall
  • Deciding documents relevant and non relevant is not easy
  • For recall, difficult to measure the number of relevant documents in database
    • Creating pool of relevant records is one solution
  • In practice, these are still good measures
sensitivity and specificity
Sensitivity and Specificity
  • Sensitivity – Probability of positive examples
  • Specificity – Probability of negative examples

What is the relation between Sensitivity, Specificity, Precision and Recall?

outline17
Outline
  • Introduction and Background
  • Mining Technique 1:

Identifying Functionally Coherent Gene Groups

  • Mining Technique 2:

Extracting Synonymous gene and protein terms

  • Conclusion
introduction18
Introduction
  • Analysis shifting from single gene to family of genes
  • Examples of these are:
    • Sequence Data
    • Gene Expression Clustering
    • Deletion Phenotypes
    • Yeast-2-Hybrid screens
hovergen a database of homologous vertebrate genes
HOVERGEN: a Database of Homologous Vertebrate Genes

Useful for comparative sequence analysis, or molecular evolution studies

10 biggest gene families

why identify functional gene groups
Why identify functional gene groups?
  • Interesting to know functionally relevant groups for large gene group sets
  • Helps to assess the significance of experimentally derived gene sets
  • Refine gene groups to find more functionally relevant groups
  • Existing algorithms can make use of this information in finding gene groups
existing approaches
Existing Approaches
  • Use of co occurrence of gene names in abstracts to create networks of related genes automatically
  • Use existing vocabulary of gene functions and assigned genes to decide a functionally relevant group

(Gene Ontology (GO) consortium and Munich Information Center for Protein Sequences (MIPS) )

statistical nlp approach
Statistical NLP approach
  • Used for annotating individual genes
  • Determining gene and protein interactions
  • Assigning keywords to genes or group of genes
neighbor divergence approach
Neighbor Divergence Approach
  • Statistical NLP technique
  • Will always be up to date if provided with a current literature base
  • Cannot specify what the actual function is!
challenges in the problem
Challenges in the Problem
  • Large number of genes
  • Genes have multiple functions
  • Some genes have been extensively studied, others recently discovered

So the literature about genes reflects these differences

neighbor divergence algorithm
Neighbor Divergence Algorithm
  • Representation Of Articles
  • Identifying Semantic Neighbors for Corpus Articles
  • Scoring Articles Relative to Gene Group
  • Calculating a Theoretical distribution of Scores
  • Calculating the Difference between empirical and theoretical distribution
nd article representation
ND- Article Representation

Words in articles represented by their inverse document frequency (to reduce the impact of common words)

Wi,j= 1 + (log2(tfi,j))log2 (N/dfi) if tfi,j > 0

Wi,j= 0 if tfi,j= 0

where Wi,j : weighted count of word i in document j,

tfi,j : the number f times word i is in document

dfi : the number of documents containing I

N : the total number of documents

nd identifying semantic neighbors
ND – Identifying Semantic Neighbors
  • For each article, K most similar articles are pre computed (k=20 was used)
  • Cosine similarity measure is used ( Cosine of the angle between two weighted article vectors)
nd scoring articles
ND – Scoring articles
  • Given a gene group, ND assigns a score to each article (Si,g)
  • Score is a count of semantic neighbors that refer to group genes
  • frk,g = nk,g / nk (Fractional Reference for each neighbor k)
  • Si,g = round(Σ(i=1 to 20) fr sem(i,j),g) (Score value)
nd difference in distributions
ND – Difference in Distributions
  • Calculating a theoretical Distribution of Scores
    • Use of Poisson Distribution to represent the non coherent functional structure

P(S = n) = ((λ)n/n!)e−λ

  • KL Divergence
    • If 2 distributions are same, divergence is zero
    • More disparate the distributions, larger the divergence
      • Dgh = Sum(gi log gi /hi )
other methods
Other methods
  • Word Divergence
other methods34
Other methods
  • Best Article Score
    • Highest article score is used as a measure of the gene group’s functional coherence
  • Best p-Value
    • Summed probability of an article having equal or more neighbors than it has
  • Neighborhood Divergence –No Filter
    • Filter used is: When calculating semantic neighbors, only articles that refer to different genes are considered.
outline37
Outline
  • Introduction and Background
  • Mining Technique 1:

Identifying Functionally Coherent Gene Groups

  • Mining Technique 2:

Extracting Synonymous gene and protein terms

  • Conclusion
introduction38
Introduction
  • Genes and proteins are associated with multiple names
    • LARD , DR3 , TR3 , Wsl, DDR3, APO-3, TRAMP, WSL-1, WSL-LR, Tnfrsf12,
    • PS2, Alg2, MA-3, alg-2, Pdcd6
    • GRIP-1, TIF2, 9530095N19, D1Ertd433e, Ncoa2

http://bioinformatics.org/textknowledge/synonym.php)

advantage
Advantage
  • Automated method will keep the database updated
  • Extracting synonyms will help
    • Information retrieval and extraction
    • Human curators of biological resource
existing approaches40
Existing approaches
  • Detecting semantically related words
    • “beer” and “wine” are related terms
  • Use of WORDNET (a large lexical database of English words) to evaluate semantic similarity
  • Most synonymous identification methods do not consider surrounding context of words
information extraction and machine learning
Information Extraction and Machine Learning
  • Requires a large amount of manual labor to construct and tune extraction systems
  • Machine learning techniques help to reduce the manual labor by automatically acquiring rules for labeled and unlabeled data
ml techniques
ML techniques
  • Supervised Learning
    • Labeled Training Data available
  • Semi supervised Learning
    • Small number of labeled training data
  • Unsupervised Learning
    • Data with no labeling
  • Reinforcement Learning
    • Learn a mapping form situations to actions by trial and error interactions
approach used here
Approach Used here
  • Obtain tagged genes and proteins in text using existing gene taggers
  • Four approaches used
    • Unsupervised Learning
    • Partially Supervised Learning
    • Supervised Learning
    • Hand Crafter System
  • Use of a final COMBINED system
unsupervised learning contextual similarity
Unsupervised Learning – Contextual Similarity
  • Finds set of words that appear in similar context using mutual information between the words
unsupervised learning contextual similarity45
Unsupervised Learning – Contextual Similarity
  • Mutual Information
  • Similarity Measure:
contextual similarity
Contextual Similarity
  • For all terms takes time O(|lexicon|3 . So ,heuristic search is used
  • Lots of false positives returned, so useful to incorporate some domain knowledge
snowball
Snowball
  • Confidence of a pattern
  • Calculates confidence of extracted tuples and discards low confidence tuples
supervised learning text classification
Supervised Learning – Text classification
  • User provided positive and negative example gene and protein pairs
  • Use SVM to train using this data (radial basis kernel function of SVMLight)
  • Classifies pairs of identified genes and proteins using a confidence score Conf(s)(score assigned by classifier)
  • Does not combine evidence from multiple occurrences of same gene or protein pair
hand crafted extraction system gpe system
Hand Crafted Extraction System- GPE system
  • Most labor intensive but high quality result approach
  • Starts with set of known pairs of synonyms
  • Manual examination to find patterns of occurrences
    • Use of “known as” or “also called”
  • Scans for more synonyms and uses heuristics and filters to ignore non gene/protein terms
  • Confidence value of 1 assigned to every returned result
combined system
Combined System
  • Exploits advantages of knowledge based and machine learning based systems
  • ConfE(s) represents the confidence score assigned to s by system E

(1 – prob that all systems extracted s incorrectly)

outline56
Outline
  • Introduction and Background
  • Mining Technique 1:

Identifying Functionally Coherent Gene Groups

  • Mining Technique 2:

Extracting Synonymous gene and protein terms

  • Conclusion
conclusion and future work
Conclusion and Future Work
  • Lot of interest in using knowledge from medical literature to guide bioinformatics algorithms
  • Functional Gene Groups:
    • Can be used to connect data analysis algorithms to scientific literature
    • ND maybe used to define new functional groups, annotating genes and organizing genes in a functional hierarchy
    • Use of full text articles instead of only abstracts
conclusion and future work58
Conclusion and Future Work
  • Synonym Extraction:
    • Extracted synonyms could be used as a valuable supplement to the SWISSPROT database
    • Techniques could use the existing systems to find other biological relations between genes and proteins, small molecules, drugs and diseases.
ad