Mining medical literature
1 / 59

Mining Medical Literature - PowerPoint PPT Presentation

  • Updated On :

Mining Medical Literature. Vignesh Ganapathy (CS 374 : Algorithms in Biology) (FALL 2005). Outline. Introduction and Background Mining Technique 1: Identifying Functionally Coherent Gene Groups Mining Technique 2: Extracting Synonymous gene and protein terms Conclusions. Outline.

Related searches for Mining Medical Literature

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'Mining Medical Literature' - issac

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Mining medical literature l.jpg

Mining Medical Literature

Vignesh Ganapathy

(CS 374 : Algorithms in Biology)

(FALL 2005)

Outline l.jpg

  • Introduction and Background

  • Mining Technique 1:

    Identifying Functionally Coherent Gene Groups

  • Mining Technique 2:

    Extracting Synonymous gene and protein terms

  • Conclusions

Outline3 l.jpg

  • Introduction and Background

  • Mining Technique 1:

    Identifying Functionally Coherent Gene Groups

  • Mining Technique 2:

    Extracting Synonymous gene and protein terms

  • Conclusions

Introduction l.jpg

  • Medical Literature has vast amounts of knowledge and information

    • PubMed Central (PMC) ( the U.S. National Institutes of Health (NIH) free digital archive of biomedical and life sciences journal literature)

    • (The Medical Literature Guide)

    • Journals like Science, Nature, Cell ,EMBO, Cell Biology, PNAS

    • (and many more..)

The problem l.jpg
The Problem

  • Major task is finding out ways to extract useful information from these resources.

What is data mining l.jpg
What is Data Mining?

“Data Mining is the Process of discovering meaningful, new correlation patterns and trends by sifting through large amount of data stored in repositories, using pattern recognition techniques as well as statistical and mathematical techniques.”

Example data l.jpg
Example Data!

  • Large amounts of data but no information

    • Daily transactions at a supermarket

    • Daily website visit histories

    • Books/videos rented at a Library

    • Newspaper, Journal archives

Google news l.jpg
Google News

  • Clustering News items (Google News)

More applications l.jpg
More Applications

  • Improving Sales strategy

    • Finding items that sell together

      (there is a common example of beer and diaper being related. A supermarket found out that 50% of the times beer was purchased with diapers)

  • Anomaly Detection and many more…

Information retrieval ir l.jpg
Information Retrieval (IR)

  • Collecting information from text data (Unstructured Data)

  • Applications

    • Search web documents

    • Natural Language Processing

    • Term also extends to include multimedia or other forms of unstructured data

Ir system evaluation l.jpg
IR System Evaluation

  • Some measures are

    • Precision

    • Recall

    • F1 measure – Combined measure which is a weighted harmonic mean

    • Sensitivity

    • Specificity

Precision and recall l.jpg
Precision and Recall

How are Precision and Recall related?

Problems with precision and recall l.jpg
Problems with Precision and Recall

  • Deciding documents relevant and non relevant is not easy

  • For recall, difficult to measure the number of relevant documents in database

    • Creating pool of relevant records is one solution

  • In practice, these are still good measures

Sensitivity and specificity l.jpg
Sensitivity and Specificity

  • Sensitivity – Probability of positive examples

  • Specificity – Probability of negative examples

    What is the relation between Sensitivity, Specificity, Precision and Recall?

Outline17 l.jpg

  • Introduction and Background

  • Mining Technique 1:

    Identifying Functionally Coherent Gene Groups

  • Mining Technique 2:

    Extracting Synonymous gene and protein terms

  • Conclusion

Introduction18 l.jpg

  • Analysis shifting from single gene to family of genes

  • Examples of these are:

    • Sequence Data

    • Gene Expression Clustering

    • Deletion Phenotypes

    • Yeast-2-Hybrid screens

Hovergen a database of homologous vertebrate genes l.jpg
HOVERGEN: a Database of Homologous Vertebrate Genes

Useful for comparative sequence analysis, or molecular evolution studies

10 biggest gene families

Why identify functional gene groups l.jpg
Why identify functional gene groups?

  • Interesting to know functionally relevant groups for large gene group sets

  • Helps to assess the significance of experimentally derived gene sets

  • Refine gene groups to find more functionally relevant groups

  • Existing algorithms can make use of this information in finding gene groups

Existing approaches l.jpg
Existing Approaches

  • Use of co occurrence of gene names in abstracts to create networks of related genes automatically

  • Use existing vocabulary of gene functions and assigned genes to decide a functionally relevant group

    (Gene Ontology (GO) consortium and Munich Information Center for Protein Sequences (MIPS) )

Statistical nlp approach l.jpg
Statistical NLP approach

  • Used for annotating individual genes

  • Determining gene and protein interactions

  • Assigning keywords to genes or group of genes

Neighbor divergence approach l.jpg
Neighbor Divergence Approach

  • Statistical NLP technique

  • Will always be up to date if provided with a current literature base

  • Cannot specify what the actual function is!

Challenges in the problem l.jpg
Challenges in the Problem

  • Large number of genes

  • Genes have multiple functions

  • Some genes have been extensively studied, others recently discovered

    So the literature about genes reflects these differences

Neighbor divergence algorithm l.jpg
Neighbor Divergence Algorithm

  • Representation Of Articles

  • Identifying Semantic Neighbors for Corpus Articles

  • Scoring Articles Relative to Gene Group

  • Calculating a Theoretical distribution of Scores

  • Calculating the Difference between empirical and theoretical distribution

Nd article representation l.jpg
ND- Article Representation

Words in articles represented by their inverse document frequency (to reduce the impact of common words)

Wi,j= 1 + (log2(tfi,j))log2 (N/dfi) if tfi,j > 0

Wi,j= 0 if tfi,j= 0

where Wi,j : weighted count of word i in document j,

tfi,j : the number f times word i is in document

dfi : the number of documents containing I

N : the total number of documents

Nd identifying semantic neighbors l.jpg
ND – Identifying Semantic Neighbors

  • For each article, K most similar articles are pre computed (k=20 was used)

  • Cosine similarity measure is used ( Cosine of the angle between two weighted article vectors)

Nd scoring articles l.jpg
ND – Scoring articles

  • Given a gene group, ND assigns a score to each article (Si,g)

  • Score is a count of semantic neighbors that refer to group genes

  • frk,g = nk,g / nk (Fractional Reference for each neighbor k)

  • Si,g = round(Σ(i=1 to 20) fr sem(i,j),g) (Score value)

Nd difference in distributions l.jpg
ND – Difference in Distributions

  • Calculating a theoretical Distribution of Scores

    • Use of Poisson Distribution to represent the non coherent functional structure

      P(S = n) = ((λ)n/n!)e−λ

  • KL Divergence

    • If 2 distributions are same, divergence is zero

    • More disparate the distributions, larger the divergence

      • Dgh = Sum(gi log gi /hi )

Other methods l.jpg
Other methods

  • Word Divergence

Other methods34 l.jpg
Other methods

  • Best Article Score

    • Highest article score is used as a measure of the gene group’s functional coherence

  • Best p-Value

    • Summed probability of an article having equal or more neighbors than it has

  • Neighborhood Divergence –No Filter

    • Filter used is: When calculating semantic neighbors, only articles that refer to different genes are considered.

Outline37 l.jpg

  • Introduction and Background

  • Mining Technique 1:

    Identifying Functionally Coherent Gene Groups

  • Mining Technique 2:

    Extracting Synonymous gene and protein terms

  • Conclusion

Introduction38 l.jpg

  • Genes and proteins are associated with multiple names

    • LARD , DR3 , TR3 , Wsl, DDR3, APO-3, TRAMP, WSL-1, WSL-LR, Tnfrsf12,

    • PS2, Alg2, MA-3, alg-2, Pdcd6

    • GRIP-1, TIF2, 9530095N19, D1Ertd433e, Ncoa2

Advantage l.jpg

  • Automated method will keep the database updated

  • Extracting synonyms will help

    • Information retrieval and extraction

    • Human curators of biological resource

Existing approaches40 l.jpg
Existing approaches

  • Detecting semantically related words

    • “beer” and “wine” are related terms

  • Use of WORDNET (a large lexical database of English words) to evaluate semantic similarity

  • Most synonymous identification methods do not consider surrounding context of words

Information extraction and machine learning l.jpg
Information Extraction and Machine Learning

  • Requires a large amount of manual labor to construct and tune extraction systems

  • Machine learning techniques help to reduce the manual labor by automatically acquiring rules for labeled and unlabeled data

Ml techniques l.jpg
ML techniques

  • Supervised Learning

    • Labeled Training Data available

  • Semi supervised Learning

    • Small number of labeled training data

  • Unsupervised Learning

    • Data with no labeling

  • Reinforcement Learning

    • Learn a mapping form situations to actions by trial and error interactions

Approach used here l.jpg
Approach Used here

  • Obtain tagged genes and proteins in text using existing gene taggers

  • Four approaches used

    • Unsupervised Learning

    • Partially Supervised Learning

    • Supervised Learning

    • Hand Crafter System

  • Use of a final COMBINED system

Unsupervised learning contextual similarity l.jpg
Unsupervised Learning – Contextual Similarity

  • Finds set of words that appear in similar context using mutual information between the words

Unsupervised learning contextual similarity45 l.jpg
Unsupervised Learning – Contextual Similarity

  • Mutual Information

  • Similarity Measure:

Contextual similarity l.jpg
Contextual Similarity

  • For all terms takes time O(|lexicon|3 . So ,heuristic search is used

  • Lots of false positives returned, so useful to incorporate some domain knowledge

Snowball l.jpg

  • Confidence of a pattern

  • Calculates confidence of extracted tuples and discards low confidence tuples

Supervised learning text classification l.jpg
Supervised Learning – Text classification

  • User provided positive and negative example gene and protein pairs

  • Use SVM to train using this data (radial basis kernel function of SVMLight)

  • Classifies pairs of identified genes and proteins using a confidence score Conf(s)(score assigned by classifier)

  • Does not combine evidence from multiple occurrences of same gene or protein pair

Hand crafted extraction system gpe system l.jpg
Hand Crafted Extraction System- GPE system

  • Most labor intensive but high quality result approach

  • Starts with set of known pairs of synonyms

  • Manual examination to find patterns of occurrences

    • Use of “known as” or “also called”

  • Scans for more synonyms and uses heuristics and filters to ignore non gene/protein terms

  • Confidence value of 1 assigned to every returned result

Combined system l.jpg
Combined System

  • Exploits advantages of knowledge based and machine learning based systems

  • ConfE(s) represents the confidence score assigned to s by system E

    (1 – prob that all systems extracted s incorrectly)

Outline56 l.jpg

  • Introduction and Background

  • Mining Technique 1:

    Identifying Functionally Coherent Gene Groups

  • Mining Technique 2:

    Extracting Synonymous gene and protein terms

  • Conclusion

Conclusion and future work l.jpg
Conclusion and Future Work

  • Lot of interest in using knowledge from medical literature to guide bioinformatics algorithms

  • Functional Gene Groups:

    • Can be used to connect data analysis algorithms to scientific literature

    • ND maybe used to define new functional groups, annotating genes and organizing genes in a functional hierarchy

    • Use of full text articles instead of only abstracts

Conclusion and future work58 l.jpg
Conclusion and Future Work

  • Synonym Extraction:

    • Extracted synonyms could be used as a valuable supplement to the SWISSPROT database

    • Techniques could use the existing systems to find other biological relations between genes and proteins, small molecules, drugs and diseases.