1 / 59

Mining Medical Literature

Mining Medical Literature. Vignesh Ganapathy (CS 374 : Algorithms in Biology) (FALL 2005). Outline. Introduction and Background Mining Technique 1: Identifying Functionally Coherent Gene Groups Mining Technique 2: Extracting Synonymous gene and protein terms Conclusions. Outline.

issac
Download Presentation

Mining Medical Literature

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Mining Medical Literature Vignesh Ganapathy (CS 374 : Algorithms in Biology) (FALL 2005)

  2. Outline • Introduction and Background • Mining Technique 1: Identifying Functionally Coherent Gene Groups • Mining Technique 2: Extracting Synonymous gene and protein terms • Conclusions

  3. Outline • Introduction and Background • Mining Technique 1: Identifying Functionally Coherent Gene Groups • Mining Technique 2: Extracting Synonymous gene and protein terms • Conclusions

  4. Introduction • Medical Literature has vast amounts of knowledge and information • PubMed Central (PMC) ( the U.S. National Institutes of Health (NIH) free digital archive of biomedical and life sciences journal literature) • Amedeo.com (The Medical Literature Guide) • Journals like Science, Nature, Cell ,EMBO, Cell Biology, PNAS • (and many more..)

  5. The Problem • Major task is finding out ways to extract useful information from these resources.

  6. What is Data Mining? “Data Mining is the Process of discovering meaningful, new correlation patterns and trends by sifting through large amount of data stored in repositories, using pattern recognition techniques as well as statistical and mathematical techniques.”

  7. Example Data! • Large amounts of data but no information • Daily transactions at a supermarket • Daily website visit histories • Books/videos rented at a Library • Newspaper, Journal archives

  8. Amazon.com

  9. Google News • Clustering News items (Google News)

  10. More Applications • Improving Sales strategy • Finding items that sell together (there is a common example of beer and diaper being related. A supermarket found out that 50% of the times beer was purchased with diapers) • Anomaly Detection and many more…

  11. Information Retrieval (IR) • Collecting information from text data (Unstructured Data) • Applications • Search web documents • Natural Language Processing • Term also extends to include multimedia or other forms of unstructured data

  12. Simple flow of Retrieval Process

  13. IR System Evaluation • Some measures are • Precision • Recall • F1 measure – Combined measure which is a weighted harmonic mean • Sensitivity • Specificity

  14. Precision and Recall How are Precision and Recall related?

  15. Problems with Precision and Recall • Deciding documents relevant and non relevant is not easy • For recall, difficult to measure the number of relevant documents in database • Creating pool of relevant records is one solution • In practice, these are still good measures

  16. Sensitivity and Specificity • Sensitivity – Probability of positive examples • Specificity – Probability of negative examples What is the relation between Sensitivity, Specificity, Precision and Recall?

  17. Outline • Introduction and Background • Mining Technique 1: Identifying Functionally Coherent Gene Groups • Mining Technique 2: Extracting Synonymous gene and protein terms • Conclusion

  18. Introduction • Analysis shifting from single gene to family of genes • Examples of these are: • Sequence Data • Gene Expression Clustering • Deletion Phenotypes • Yeast-2-Hybrid screens

  19. HOVERGEN: a Database of Homologous Vertebrate Genes Useful for comparative sequence analysis, or molecular evolution studies 10 biggest gene families

  20. Why identify functional gene groups? • Interesting to know functionally relevant groups for large gene group sets • Helps to assess the significance of experimentally derived gene sets • Refine gene groups to find more functionally relevant groups • Existing algorithms can make use of this information in finding gene groups

  21. Existing Approaches • Use of co occurrence of gene names in abstracts to create networks of related genes automatically • Use existing vocabulary of gene functions and assigned genes to decide a functionally relevant group (Gene Ontology (GO) consortium and Munich Information Center for Protein Sequences (MIPS) )

  22. Statistical NLP approach • Used for annotating individual genes • Determining gene and protein interactions • Assigning keywords to genes or group of genes

  23. Neighbor Divergence Approach • Statistical NLP technique • Will always be up to date if provided with a current literature base • Cannot specify what the actual function is!

  24. Challenges in the Problem • Large number of genes • Genes have multiple functions • Some genes have been extensively studied, others recently discovered So the literature about genes reflects these differences

  25. Neighbor Divergence Intuition

  26. Neighbor Divergence Algorithm • Representation Of Articles • Identifying Semantic Neighbors for Corpus Articles • Scoring Articles Relative to Gene Group • Calculating a Theoretical distribution of Scores • Calculating the Difference between empirical and theoretical distribution

  27. ND- Article Representation Words in articles represented by their inverse document frequency (to reduce the impact of common words) Wi,j= 1 + (log2(tfi,j))log2 (N/dfi) if tfi,j > 0 Wi,j= 0 if tfi,j= 0 where Wi,j : weighted count of word i in document j, tfi,j : the number f times word i is in document dfi : the number of documents containing I N : the total number of documents

  28. ND – Identifying Semantic Neighbors • For each article, K most similar articles are pre computed (k=20 was used) • Cosine similarity measure is used ( Cosine of the angle between two weighted article vectors)

  29. ND – Scoring articles • Given a gene group, ND assigns a score to each article (Si,g) • Score is a count of semantic neighbors that refer to group genes • frk,g = nk,g / nk (Fractional Reference for each neighbor k) • Si,g = round(Σ(i=1 to 20) fr sem(i,j),g) (Score value)

  30. ND – Difference in Distributions • Calculating a theoretical Distribution of Scores • Use of Poisson Distribution to represent the non coherent functional structure P(S = n) = ((λ)n/n!)e−λ • KL Divergence • If 2 distributions are same, divergence is zero • More disparate the distributions, larger the divergence • Dgh = Sum(gi log gi /hi )

  31. Observed and Expected Distribution of Article Scores

  32. Results

  33. Other methods • Word Divergence

  34. Other methods • Best Article Score • Highest article score is used as a measure of the gene group’s functional coherence • Best p-Value • Summed probability of an article having equal or more neighbors than it has • Neighborhood Divergence –No Filter • Filter used is: When calculating semantic neighbors, only articles that refer to different genes are considered.

  35. Evaluation

  36. Corrupting Functional Groups

  37. Outline • Introduction and Background • Mining Technique 1: Identifying Functionally Coherent Gene Groups • Mining Technique 2: Extracting Synonymous gene and protein terms • Conclusion

  38. Introduction • Genes and proteins are associated with multiple names • LARD , DR3 , TR3 , Wsl, DDR3, APO-3, TRAMP, WSL-1, WSL-LR, Tnfrsf12, • PS2, Alg2, MA-3, alg-2, Pdcd6 • GRIP-1, TIF2, 9530095N19, D1Ertd433e, Ncoa2 http://bioinformatics.org/textknowledge/synonym.php)

  39. Advantage • Automated method will keep the database updated • Extracting synonyms will help • Information retrieval and extraction • Human curators of biological resource

  40. Existing approaches • Detecting semantically related words • “beer” and “wine” are related terms • Use of WORDNET (a large lexical database of English words) to evaluate semantic similarity • Most synonymous identification methods do not consider surrounding context of words

  41. Information Extraction and Machine Learning • Requires a large amount of manual labor to construct and tune extraction systems • Machine learning techniques help to reduce the manual labor by automatically acquiring rules for labeled and unlabeled data

  42. ML techniques • Supervised Learning • Labeled Training Data available • Semi supervised Learning • Small number of labeled training data • Unsupervised Learning • Data with no labeling • Reinforcement Learning • Learn a mapping form situations to actions by trial and error interactions

  43. Approach Used here • Obtain tagged genes and proteins in text using existing gene taggers • Four approaches used • Unsupervised Learning • Partially Supervised Learning • Supervised Learning • Hand Crafter System • Use of a final COMBINED system

  44. Unsupervised Learning – Contextual Similarity • Finds set of words that appear in similar context using mutual information between the words

  45. Unsupervised Learning – Contextual Similarity • Mutual Information • Similarity Measure:

  46. Contextual Similarity • For all terms takes time O(|lexicon|3 . So ,heuristic search is used • Lots of false positives returned, so useful to incorporate some domain knowledge

  47. Partially supervised Learning- Snowball

  48. Snowball • Confidence of a pattern • Calculates confidence of extracted tuples and discards low confidence tuples

  49. Supervised Learning – Text classification • User provided positive and negative example gene and protein pairs • Use SVM to train using this data (radial basis kernel function of SVMLight) • Classifies pairs of identified genes and proteins using a confidence score Conf(s)(score assigned by classifier) • Does not combine evidence from multiple occurrences of same gene or protein pair

  50. Hand Crafted Extraction System- GPE system • Most labor intensive but high quality result approach • Starts with set of known pairs of synonyms • Manual examination to find patterns of occurrences • Use of “known as” or “also called” • Scans for more synonyms and uses heuristics and filters to ignore non gene/protein terms • Confidence value of 1 assigned to every returned result

More Related