1 / 41

Literature Mining and Ontology BMI/IBGP 730 Autumn, 2011

Literature Mining and Ontology BMI/IBGP 730 Autumn, 2011. Yang Xiang, Ph.D. in Computer Science yxiang@bmi.osu.edu Department of Biomedical Informatics The Ohio State University. Outline. What is Literature Mining? Popular Tools for Literature Mining Basic Techniques

base
Download Presentation

Literature Mining and Ontology BMI/IBGP 730 Autumn, 2011

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Literature Mining and OntologyBMI/IBGP 730 Autumn, 2011 Yang Xiang, Ph.D. in Computer Scienceyxiang@bmi.osu.edu Department of Biomedical Informatics The Ohio State University

  2. Outline • What is Literature Mining? • Popular Tools for Literature Mining • Basic Techniques • Information Retrieval (Indexing): Expediting searching • Linguistic Processing • Other Processing • What is Ontology? • Simple ontology examples • Gene ontology • Unified Medical Language System • Use and index ontology • Applications of Literature Mining and Ontology

  3. Outline • What is Literature Mining? • Popular Tools for Literature Mining • Basic Techniques • Information Retrieval (Indexing): Expediting searching • Linguistic Processing • Other Processing • What is Ontology? • Simple ontology examples • Gene ontology • Unified Medical Language System • Use and index ontology • Applications of Literature Mining and Ontology

  4. What is Literature (Text) Mining? • The purposes of Literature Mining • Find relevant documents • Discover knowledge (what is knowledge?) • e.g. opinion mining (sentiment analysis) • e.g. document similarity • The advantage of computer-based Literature Mining • Simply, computers can search much more documents! • Computers can ‘think’ and discover knowledge. • We will focus on biomedical literature mining in the following

  5. Why Literature Mining is Very Popular in Biomedical Science? • Biomedical science studies nature subjects. • Species • Genes • Phenotypes • Diseases ….

  6. Outline • What is Literature Mining? • Popular Tools for Literature Mining • Basic Techniques • Information Retrieval (Indexing): Expediting searching • Linguistic Processing • Other Processing • What is Ontology? • Simple ontology examples • Gene ontology • Unified Medical Language System • Use and index ontology • Applications of Literature Mining and Ontology

  7. Popular Tools for Biomedical Literature Mining – Document search • Google • Google Scholar: http://scholar.google.com • ISI web of knoledge • www.isiknowledge.com • Pubmed • www.ncbi.nlm.nih.gov/pubmed • Scopus • www.scopus.com

  8. Tools for Biomedical Literature Mining – Knowledge discovery • The Gene Ontology • http://www.geneontology.org/ • Gene answer • www.geneanswers.com

  9. Outline • What is Literature Mining? • Popular Tools for Literature Mining • Basic Techniques • Information Retrieval (Indexing): Expediting searching • Linguistic Processing • Other Processing • What is Ontology? • Simple ontology examples • Gene ontology • Unified Medical Language System • Use and index ontology • Applications of Literature Mining and Ontology

  10. Techniques Behind Literature Mining • Interdisciplinary • Computer Science • Information retrieval • Data mining • Natural Language Processing • Machine learning • Library Science • Biomedical Science • Linguistics • Computational linguistics • Statistics • And more! • Two main research areas (some overlaps) • Information Retrieval • Natural Language Processing

  11. Basic Text Search Algorithm text … H e l l o , w o r l d … • Assume text size is n. • Assume search string size is m. • How to design an efficient algorithm to find all matches in the text? • Brutal force algorithm, O(mn). • Boyer-Moore Heuristics, O(mn), but fast in most cases for English text. • KMP (Knuth-Morris-Pratt) algorithm, O(m+n). String to match w o r l d

  12. Outline • What is Literature Mining? • Popular Tools for Literature Mining • Basic Techniques • Information Retrieval (Indexing): Expediting searching • Linguistic Processing • Other Processing • What is Ontology? • Simple ontology examples • Gene ontology • Unified Medical Language System • Use and index ontology • Applications of Literature Mining and Ontology

  13. Information Retrieval (Indexing) • Archiving (preprocessing) documents for fast search • Preprocessing time • Query time • Index size

  14. Outline • What is Literature Mining? • Popular Tools for Literature Mining • Basic Techniques • Information Retrieval (Indexing): Expediting searching • Linguistic Processing • Other Processing • What is Ontology? • Simple ontology examples • Gene ontology • Unified Medical Language System • Use and index ontology • Applications of Literature Mining and Ontology

  15. Programming language processing (C++, Java, etc) • Lexical analysis y=x+10; • Syntax analysis assignment operator expression identifier = expression expression + y number identifier x 10

  16. Natural Language Processing • Lexical level • Stemming (including lemmatizing): find the root of a wordswimming, swam, swim, swimmer  swim • Stemming rule may vary (balance between overstemming and understemming) • Typical algorithm (Porter Stemming algorithm) • Alias, Synonym • Grammatical level • Parsing“…We find Gene1 interacts with Gene2…” Sentence Verb phrase Noun phrase Noun phrase Verb Gene1 interact Gene2

  17. Outline • What is Literature Mining? • Popular Tools for Literature Mining • Basic Techniques • Information Retrieval (Indexing): Expediting searching • Linguistic Processing • Other Processing • What is Ontology? • Simple ontology examples • Gene ontology • Unified Medical Language System • Use and index ontology • Applications of Literature Mining and Ontology

  18. Statistical and Data Mining Processing • Statistical • Count the word frequency • Count the expression frequency • Data Mining • Mining the set of frequent words • Association Rule Mining

  19. Document Classification • E.g., classify all documents related to coffee and health • Various machine learning algorithms can be applied here. Cardioprotective Documents show benefits … Coffee and health related documents Laxative Cholesterol Documents show risk … Anxiety

  20. Accuracy vs Relevancyin Pattern Recognition/Machine Learning • Precision=|{relevant docs}∩{retrieved docs}|/| {retrieved docs}| • Recall= |{relevant docs}∩{retrieved docs}|/|{relevant docs}| • Fall-out |{nonrelevant docs}∩{retrieved docs}|/|{nonrelevant docs}|

  21. Outline • What is Literature Mining? • Popular Tools for Literature Mining • Basic Techniques • Information Retrieval (Indexing): Expediting searching • Linguistic Processing • Other Processing • What is Ontology? • Simple ontology examples • Gene ontology • Unified Medical Language System • Use and index ontology • Applications of Literature Mining and Ontology

  22. Ontology • According to philosophy, ontology is a systematic account of Existence • In information science, ontology is a representation of concepts and their relationships, often by directed graphs

  23. Ontology Example (Informal) fish salt water fresh water Asian Europe North American …… native Common Carp Crappie mirror Carp invasive

  24. Ontology Example: Scientifc classification Kingdom Animalia Phylum Chordata … Hemichordata Class … Actinopterygii Sarcopterygii … Neopterygii Subclass Chondrostei Infraclass … Teleostei Order … Cypriniformes Family … Cyprinidae

  25. Outline • What is Literature Mining? • Popular Tools for Literature Mining • Basic Techniques • Information Retrieval (Indexing): Expediting searching • Linguistic Processing • Other Processing • What is Ontology? • Simple ontology examples • Gene ontology • Unified Medical Language System • Use and index ontology • Applications of Literature Mining and Ontology

  26. Gene Ontology (GO) Consortium DNA metabolis cell Molecular function … … … Nucleic acid binding enzyme helicase DNA binding DNA helicase ATP-dependent DNA helicase … Reference: Gene Ontology: tool for the unification of biology, nature genetics, 2000 http://dx.doi.org/ 10.1038/75556

  27. Outline • What is Literature Mining? • Popular Tools for Literature Mining • Basic Techniques • Information Retrieval (Indexing): Expediting searching • Linguistic Processing • Other Processing • What is Ontology? • Simple ontology examples • Gene ontology • Unified Medical Language System • Use and index ontology • Applications of Literature Mining and Ontology

  28. Unified Medical Language System (UMLS) • A compendium of controlled vocabularies in the biomedical sciences (since 1986). It contains: • Metathesaurus • Semantic Network • SPECIALIST Lexicon • UMLS contains data more than ontologies • Maintained by US National Library of Medicine • Website: http://www.nlm.nih.gov/research/umls/

  29. UMLS - Metathesaurus • Number of biomedical concepts > 1 million • Stem from over 100 incorporated controlled source vocabularies: • ICD (International Statistical Classification of Diseases and Related Health Problems) • MeSH (Medical Subject Headings) • SNOMED CT (Systematized Nomenclature of Medicine – Clinical Terms) • LOINC (Logical Observation Identifiers Names and Codes) • Gene Ontology • OMIM (Mendelian Inheritance in Man) … http://www.nlm.nih.gov/research/umls/knowledge_sources/metathesaurus/release/source_vocabularies.html

  30. UMLS - Semantic Network • Semantic types (categories) • Entity • Physical Object • Organism … … • Event • Actitivity • Behavior … … • Semantic relationships (connecting two concepts) • isa • assoicated_with • physically_related_to • part_of… • spatially_related_to • location_of… … Drug A treats treated_by Disease B disease_is_marked_by_gene Gene A http://www.nlm.nih.gov/research/umls/META3_current_semantic_types.html http://www.clres.com/semrels/umls_relation_list.html

  31. Outline • What is Literature Mining? • Popular Tools for Literature Mining • Basic Techniques • Information Retrieval (Indexing): Expediting searching • Linguistic Processing • Other Processing • What is Ontology? • Simple ontology examples • Gene ontology • Unified Medical Language System • Use and index ontology • Applications of Literature Mining and Ontology

  32. Use of ontology systems • Statistical • Gene ontology enrichment test • Indexing • Reachability • Distance • Path

  33. Represent Ontology by Graphs • Directed Graph • Directed Acyclic Graph (DAG): Most ontologies fall into this type. • Directed Tree Directed Graph DAG Tree

  34. Reachability The problem: Given two vertices u and v in a directed graph G, is there a path from u to v ? ?Query(1,11) Yes ?Query(3,9) No 15 14 11 13 10 12 6 7 8 9 3 4 5 1 2

  35. Distance The problem: Given two vertices u and v in a (directed) graph G, what is the distance from u to v? ?Query dG(1, 11) =3 15 14 11 13 10 12 6 7 8 9 3 4 5 1 2

  36. Path The problem:Given two vertices u and v in a (directed) graph G, what is a path (are paths) connecting u to v ? 15 14 Find a path from1to11 11 13 10 12 6 7 8 9 3 4 5 1 2

  37. The estimated difficulty of building a very efficient indexing graph database schemes (based on current research) Reference: R. Jin, Y. Xiang, N. Ruan, H. Wang, "Efficiently Answering Reachability Queries on Very Large Directed Graphs", Proc. of ACM SIGMOD Conference, Vancouver, June 9-12, 2008, pp. 595-608. R. Jin, Y. Xiang, N. Ruan, D. Fuhry, "3-HOP: A High-Compression Indexing Scheme for Reachability Query", Proc. of ACM SIGMOD Conference, Providence, Rhode Island, June 29-July 2, 2009, pp. 813-826.

  38. Outline • What is Literature Mining? • Popular Tools for Literature Mining • Basic Techniques • Information Retrieval (Indexing): Expediting searching • Linguistic Processing • Other Processing • What is Ontology? • Simple ontology examples • Gene ontology • Unified Medical Language System • Ontology use and indexing • Applications of Literature Mining and Ontology

  39. Applications of Literature Mining and Ontology - I • Build confirmed gene-phenotype relations • Human Phenotype Ontology (HPO) • Built from Online Mendelian Inheritance in Man (OMIM) database. • http://human-phenotype-ontology.org/ Reference: Robinson PN, Mundlos S. The Human Phenotype Ontology. Clinical Genetics 77(6) 2010: 525–534. http://dx.doi.org/10.1111/j.1399-0004.2010.01436.x

  40. Applications of Literature Mining and Ontology - II • MetaMap program and CKC Mining • MetaMap: Mapping biomedical text to UMLS Metathesaurus. • CKC (Conceptual Knowledge Constructs) represents a path connecting several concepts in the UMLS. • Knowledge Discovery using MetaMap and CKC mining. ……… .… … C phenotypes bio-molecular CKCs Literature MetaMap Reference: Aronson, A.: Effective mapping of biomedical text to the UMLS Metathesaurus: the MetaMap program. In : AMIA Symposium, p.17 (2001) Payne, P., Borlawsky, T., Kwok, A., Greaves, A.: Supporting the design of translational clinical studies through the generation and verification of conceptual knowledge-anchored hypotheses. In : AMIA Annual Symposium Proceedings, p.566 (2008)

  41. Thanks!Questions?

More Related