Mining Medical Literature

Mining Medical Literature Vignesh Ganapathy (CS 374 : Algorithms in Biology) (FALL 2005)

Outline • Introduction and Background • Mining Technique 1: Identifying Functionally Coherent Gene Groups • Mining Technique 2: Extracting Synonymous gene and protein terms • Conclusions

Introduction • Medical Literature has vast amounts of knowledge and information • PubMed Central (PMC) ( the U.S. National Institutes of Health (NIH) free digital archive of biomedical and life sciences journal literature) • Amedeo.com (The Medical Literature Guide) • Journals like Science, Nature, Cell ,EMBO, Cell Biology, PNAS • (and many more..)

The Problem • Major task is finding out ways to extract useful information from these resources.

What is Data Mining? “Data Mining is the Process of discovering meaningful, new correlation patterns and trends by sifting through large amount of data stored in repositories, using pattern recognition techniques as well as statistical and mathematical techniques.”

Example Data! • Large amounts of data but no information • Daily transactions at a supermarket • Daily website visit histories • Books/videos rented at a Library • Newspaper, Journal archives

Amazon.com

Google News • Clustering News items (Google News)

More Applications • Improving Sales strategy • Finding items that sell together (there is a common example of beer and diaper being related. A supermarket found out that 50% of the times beer was purchased with diapers) • Anomaly Detection and many more…

Information Retrieval (IR) • Collecting information from text data (Unstructured Data) • Applications • Search web documents • Natural Language Processing • Term also extends to include multimedia or other forms of unstructured data

Simple flow of Retrieval Process

IR System Evaluation • Some measures are • Precision • Recall • F1 measure – Combined measure which is a weighted harmonic mean • Sensitivity • Specificity

Precision and Recall How are Precision and Recall related?

Problems with Precision and Recall • Deciding documents relevant and non relevant is not easy • For recall, difficult to measure the number of relevant documents in database • Creating pool of relevant records is one solution • In practice, these are still good measures

Sensitivity and Specificity • Sensitivity – Probability of positive examples • Specificity – Probability of negative examples What is the relation between Sensitivity, Specificity, Precision and Recall?

Outline • Introduction and Background • Mining Technique 1: Identifying Functionally Coherent Gene Groups • Mining Technique 2: Extracting Synonymous gene and protein terms • Conclusion

Introduction • Analysis shifting from single gene to family of genes • Examples of these are: • Sequence Data • Gene Expression Clustering • Deletion Phenotypes • Yeast-2-Hybrid screens

HOVERGEN: a Database of Homologous Vertebrate Genes Useful for comparative sequence analysis, or molecular evolution studies 10 biggest gene families

Why identify functional gene groups? • Interesting to know functionally relevant groups for large gene group sets • Helps to assess the significance of experimentally derived gene sets • Refine gene groups to find more functionally relevant groups • Existing algorithms can make use of this information in finding gene groups

Existing Approaches • Use of co occurrence of gene names in abstracts to create networks of related genes automatically • Use existing vocabulary of gene functions and assigned genes to decide a functionally relevant group (Gene Ontology (GO) consortium and Munich Information Center for Protein Sequences (MIPS) )

Statistical NLP approach • Used for annotating individual genes • Determining gene and protein interactions • Assigning keywords to genes or group of genes

Neighbor Divergence Approach • Statistical NLP technique • Will always be up to date if provided with a current literature base • Cannot specify what the actual function is!

Challenges in the Problem • Large number of genes • Genes have multiple functions • Some genes have been extensively studied, others recently discovered So the literature about genes reflects these differences

Neighbor Divergence Intuition

Neighbor Divergence Algorithm • Representation Of Articles • Identifying Semantic Neighbors for Corpus Articles • Scoring Articles Relative to Gene Group • Calculating a Theoretical distribution of Scores • Calculating the Difference between empirical and theoretical distribution

ND- Article Representation Words in articles represented by their inverse document frequency (to reduce the impact of common words) Wi,j= 1 + (log2(tfi,j))log2 (N/dfi) if tfi,j > 0 Wi,j= 0 if tfi,j= 0 where Wi,j : weighted count of word i in document j, tfi,j : the number f times word i is in document dfi : the number of documents containing I N : the total number of documents

ND – Identifying Semantic Neighbors • For each article, K most similar articles are pre computed (k=20 was used) • Cosine similarity measure is used ( Cosine of the angle between two weighted article vectors)

ND – Scoring articles • Given a gene group, ND assigns a score to each article (Si,g) • Score is a count of semantic neighbors that refer to group genes • frk,g = nk,g / nk (Fractional Reference for each neighbor k) • Si,g = round(Σ(i=1 to 20) fr sem(i,j),g) (Score value)

ND – Difference in Distributions • Calculating a theoretical Distribution of Scores • Use of Poisson Distribution to represent the non coherent functional structure P(S = n) = ((λ)n/n!)e−λ • KL Divergence • If 2 distributions are same, divergence is zero • More disparate the distributions, larger the divergence • Dgh = Sum(gi log gi /hi )

Observed and Expected Distribution of Article Scores

Results

Other methods • Word Divergence

Other methods • Best Article Score • Highest article score is used as a measure of the gene group’s functional coherence • Best p-Value • Summed probability of an article having equal or more neighbors than it has • Neighborhood Divergence –No Filter • Filter used is: When calculating semantic neighbors, only articles that refer to different genes are considered.

Evaluation

Corrupting Functional Groups

Outline • Introduction and Background • Mining Technique 1: Identifying Functionally Coherent Gene Groups • Mining Technique 2: Extracting Synonymous gene and protein terms • Conclusion

Introduction • Genes and proteins are associated with multiple names • LARD , DR3 , TR3 , Wsl, DDR3, APO-3, TRAMP, WSL-1, WSL-LR, Tnfrsf12, • PS2, Alg2, MA-3, alg-2, Pdcd6 • GRIP-1, TIF2, 9530095N19, D1Ertd433e, Ncoa2 http://bioinformatics.org/textknowledge/synonym.php)

Advantage • Automated method will keep the database updated • Extracting synonyms will help • Information retrieval and extraction • Human curators of biological resource

Existing approaches • Detecting semantically related words • “beer” and “wine” are related terms • Use of WORDNET (a large lexical database of English words) to evaluate semantic similarity • Most synonymous identification methods do not consider surrounding context of words

Information Extraction and Machine Learning • Requires a large amount of manual labor to construct and tune extraction systems • Machine learning techniques help to reduce the manual labor by automatically acquiring rules for labeled and unlabeled data

ML techniques • Supervised Learning • Labeled Training Data available • Semi supervised Learning • Small number of labeled training data • Unsupervised Learning • Data with no labeling • Reinforcement Learning • Learn a mapping form situations to actions by trial and error interactions

Approach Used here • Obtain tagged genes and proteins in text using existing gene taggers • Four approaches used • Unsupervised Learning • Partially Supervised Learning • Supervised Learning • Hand Crafter System • Use of a final COMBINED system

Unsupervised Learning – Contextual Similarity • Finds set of words that appear in similar context using mutual information between the words

Unsupervised Learning – Contextual Similarity • Mutual Information • Similarity Measure:

Contextual Similarity • For all terms takes time O(|lexicon|3 . So ,heuristic search is used • Lots of false positives returned, so useful to incorporate some domain knowledge

Partially supervised Learning- Snowball

Snowball • Confidence of a pattern • Calculates confidence of extracted tuples and discards low confidence tuples

Supervised Learning – Text classification • User provided positive and negative example gene and protein pairs • Use SVM to train using this data (radial basis kernel function of SVMLight) • Classifies pairs of identified genes and proteins using a confidence score Conf(s)(score assigned by classifier) • Does not combine evidence from multiple occurrences of same gene or protein pair

Hand Crafted Extraction System- GPE system • Most labor intensive but high quality result approach • Starts with set of known pairs of synonyms • Manual examination to find patterns of occurrences • Use of “known as” or “also called” • Scans for more synonyms and uses heuristics and filters to ignore non gene/protein terms • Confidence value of 1 assigned to every returned result

Mining Medical Literature

Mining Medical Literature

Presentation Transcript

Mining the Medical Literature

Effective Medical Literature Searching

Biological literature mining

Evaluating Medical Literature

Literature Mining for the Biologists

Text Mining Applications for Literature Curation

Literature Review of Microarray Data Mining

Literature Retrieval and Mining

Searching the medical literature

Literature Mining and Systems Biology

Medical Literature-Based Objectives

Mining the Biomedical Research Literature

Medical Data Mining

Mining the Biomedical Literature

NLP Tools for Biology Literature Mining

Evaluating Medical Literature

Literature Mining BMI 730

Mining Biomedical Literature for Neuroanatomy

Medical Literature Review

Basics in Medical Literature Searching