1 / 92

Mining Associations in Annotated Biomedical Web

This research paper explores the methodology for discovering biologically meaningful associations between controlled vocabulary terms in the biomedical web. It focuses on human genes, diseases, and genomics, using metrics and semantic knowledge to identify meaningful associations.

leekelley
Download Presentation

Mining Associations in Annotated Biomedical Web

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. 1 April 2009 Woei-Jyh (Adam) Lee, Ph.D. Center for Bioinformatics and Computational Biology University of Maryland, U.S.A. National Center for Biotechnology Information National Institutes of Health, U.S.A.

  2. Research Experiences • New York University • Morgan Stanley • AT&T Labs, AT&T • Bell Laboratories, Lucent Technologies • University of Southern California • University of Maryland • National Institutes of Health, U.S.A. Distributed computing, parallel computing Web performance measurement Component object model, quality of service Fault tolerance, Internet protocols, policy based management Media streaming, error correction, video on demand Data management and mining, bioinformatics Protein domain parsing, genomics and genetics W.-J. Lee

  3. Mining associations in the annotated biomedical web Woei-Jyh (Adam) Lee, Ph.D. University of Maryland Department of Computer Science; and Institute for Advanced Computer Studies; and Center for Bioinformatics and Computational Biology

  4. LSLink - Life Science Link • http://www.cbcb.umd.edu/research/lslink/ W.-J. Lee

  5. LSLink - Life Science Link • Background • Many Web accessible data resources for biologists. • publications, genes, diseases, sequences, structures, … • Data records are linked. • genes linked to publications, diseases linked to genes, … • Data records are annotated with controlled vocabulary (CV) terms. • Entrez Gene with GO, PubMed with MeSH, … • Goal: to discover biologicallymeaningfuland yetunknownassociations between pairs of CV terms. W.-J. Lee

  6. Entrez Gene W.-J. Lee

  7. PubMed W.-J. Lee

  8. Links from Entrez Gene to PubMed W.-J. Lee

  9. Links from PubMed to Entrez Gene W.-J. Lee

  10. Gene Ontology (GO) W.-J. Lee

  11. Medical Subject Headings (MeSH) W.-J. Lee

  12. GO Annotations in Entrez Gene W.-J. Lee

  13. MeSH Annotations in PubMed W.-J. Lee

  14. Web of Entrez Gene, OMIM and PubMed GO Gene Nomenclature MeSH Lash CV Entrez Gene PubMed OMIM Legend: data resource link Clinical Synopsis SNOMED CT annotation controlled vocabulary W.-J. Lee

  15. Outline • Background & Motivation • Methodology to Generate LSLink Datasets • Metrics to Identify Meaningful Associations • Association between Terms from Two CVs (human genes) • Semantic Knowledge in a CV Hierarchy (human diseases) • Semantics of Links (human genome map) • Related Work and Conclusion W.-J. Lee

  16. Outline • Background & Motivation • Methodology to Generate LSLink Datasets • Methodology to Generate Datasets • Links and Termlinks • Background and User Query Datasets • Metrics to Identify Meaningful Associations • Association between Terms from Two CVs (human genes) • Semantic Knowledge in a CV Hierarchy (human diseases) • Semantics of Links (human genome map) • Related Work and Conclusion W.-J. Lee

  17. Approach • Generate and analyze background and user query datasets. • Apply two classes of metrics (association rule mining and hypergeometric distribution) • Filter CV terms (by Major Topic, Semantic Type, etc.). • Rank association pairs of terms from two CVs. • Perform scientist evaluation. W.-J. Lee

  18. Methodology Collect data records and links Entrez Gene (E) PubMed (P) e1 p1 e2 p2 Extract annotations E P GO (G) MeSH (M) Generate termlink instances e1 p1 g1 m1 g2 m2 W.-J. Lee

  19. Outline • Background & Motivation • Methodology to Generate LSLink Datasets • Methodology to Generate Datasets • Links and Termlinks • Background and User Query Datasets • Metrics to Identify Meaningful Associations • Association between Terms from Two CVs (human genes) • Semantic Knowledge in a CV Hierarchy (human diseases) • Semantics of Links (human genome map) • Related Work and Conclusion W.-J. Lee

  20. Links versus Termlinks annotations annotations GO Entrez Gene PubMed MeSH links g1 m1 e1 p1 g2 g3 m2 e2 p2 g4 m3 e3 p3 g5 g6 m4 1 link: (e1, p2) 4 termlinks: (g1, m2, e1, p2) (g1, m2, e1, p2) (g6, m3, e1, p2) (g6, m3, e1, p2) GO Entrez Gene PubMed MeSH g1 e1 m2 p2 m3 g6 W.-J. Lee

  21. Example Links and Termlinks Entrez Gene PubMed … GeneID: 672 PMID: 12242698 MeSH: BRCA1 Protein GO: DNA repair MeSH: BRCA2 Protein GO: positive regulation of DNA repair … Legend: GeneID: 675 PMID: 10749118 data resource data record link MeSH: Mitosis GO: DNA repair controlled vocabulary term MeSH: Neoplasm Proteins GO: mitotic checkpoint termlink … W.-J. Lee

  22. Outline • Background & Motivation • Methodology to Generate LSLink Datasets • Methodology to Generate Datasets • Links and Termlinks • Background and User Query Datasets • Metrics to Identify Meaningful Associations • Association between Terms from Two CVs (human genes) • Semantic Knowledge in a CV Hierarchy (human diseases) • Semantics of Links (human genome map) • Related Work and Conclusion W.-J. Lee

  23. Human Genes Background Dataset • Retrieve all active human gene records in Entrez Gene. • Filter out records been replaced and discontinued. • Filter out records without GO annotations or links to PubMed. • Extract their GO annotations. • Follow all links from these records to PubMed records. • Extract MeSH annotations for PubMed records reached for the prior step. • Use the most relevant Descriptors/Qualifiers identified as Major Topic. • Filter with selected Semantic Types. W.-J. Lee

  24. Statistics onHuman Genes Background Dataset W.-J. Lee

  25. User Query Dataset • We support multiple user scenarios for querying the background dataset. • A user query dataset is a subset of the background dataset of scientists’ interest. • Individual human gene record: APOE, CFTR, etc. • BRCA1/BRCA2-containing complex: Early Onset Breast Cancer • Human genes and genetic disorders: Breast Cancer, Colorectal Cancer, Prostate Cancer, etc. • (G,M,E’,P’) (G,M,E,P) where E’  E and P’  P W.-J. Lee

  26. Example Human Gene User Query Datasets W.-J. Lee

  27. Outline • Background & Motivation • Methodology to Generate LSLink Datasets • Metrics to Identify Meaningful Associations • Two Classes of Metrics • Datasets and Subsets for Evaluation of Metrics • Distribution of Confidence Scores and P-values • Agreement and Disagreement Analysis • Association between Terms from Two CVs (human genes) • Semantic Knowledge in a CV Hierarchy (human diseases) • Semantics of Links (human genome map) • Related Work and Conclusion W.-J. Lee

  28. Two Classes of Metrics Used to IdentifyPotential Meaningful Associations • Association rule mining. • Used in data mining. • Support and confidence scores [Agrawal et al. 1993]. • Hypergeometric distribution. • Used in hypothesis testing. • P-value [Sokal and Rohlf 1969]. W.-J. Lee

  29. Definition of Probabilities • Term probability: • Link probability: • Conditional probability: W.-J. Lee

  30. Association Rule Mining • Support score reflects the probability of an association annotated with some pair of CV terms. • Confidence score reflects the conditional probability of an association annotated with some pair of CV terms, given that associations are annotated with either of the CV terms. W.-J. Lee

  31. Support and Confidence with Correction • We incorporate term-frequency correction and apply log operator (both are novel to our research). W.-J. Lee

  32. Hypergeometric Distribution • P-value tests the over-representation of an association in a user query dataset. W.-J. Lee

  33. Outline • Background & Motivation • Methodology to Generate LSLink Datasets • Metrics to Identify Meaningful Associations • Two Classes of Metrics • Datasets and Subsets for Evaluation of Metrics • Distribution of Confidence Scores and P-values • Agreement and Disagreement Analysis • Association between Terms from Two CVs (human genes) • Semantic Knowledge in a CV Hierarchy (human diseases) • Semantics of Links (human genome map) • Related Work and Conclusion W.-J. Lee

  34. Example User Query Datasets for Evaluation of Metrics W.-J. Lee

  35. Relationship among Subsets of Associations in a User Query Dataset User query dataset: early onset breast cancer in human W.-J. Lee

  36. Outline • Background & Motivation • Methodology to Generate LSLink Datasets • Metrics to Identify Meaningful Associations • Two Classes of Metrics • Datasets and Subsets for Evaluation of Metrics • Distribution of Confidence Scores and P-values • Agreement and Disagreement Analysis • Association between Terms from Two CVs (human genes) • Semantic Knowledge in a CV Hierarchy (human diseases) • Semantics of Links (human genome map) • Related Work and Conclusion W.-J. Lee

  37. Distribution of confidence scores forEarly Onset Breast Cancer in Human Confidence scores of most associations in singleton and local-non-singleton subsets are higher than 3. W.-J. Lee

  38. Distribution of P-values forEarly Onset Breast Cancer in Human P-values in singleton and local-non-singleton subsets are appeared as step-function. W.-J. Lee

  39. Outline • Background & Motivation • Methodology to Generate LSLink Datasets • Metrics to Identify Meaningful Associations • Two Classes of Metrics • Datasets and Subsets for Evaluation of Metrics • Distribution of Confidence Scores and P-values • Agreement and Disagreement Analysis • Association between Terms from Two CVs (human genes) • Semantic Knowledge in a CV Hierarchy (human diseases) • Semantics of Links (human genome map) • Related Work and Conclusion W.-J. Lee

  40. Overlap between Confidence Score and P-value Ranks For 50%, we observe that the overlap is significant and ranged from 83.8% (F5) to 92.6% (CTNNB1). W.-J. Lee

  41. Overlap between Top-X Confidence Scoreand Top-K% P-value Ranksfor Early Onset Breast Cancer in Human Overlap b/w Top-X confidence score and Top-20% P-value ranks is mostly larger than 50% (of X). W.-J. Lee

  42. Kendall’s  between Confidence Score and P-value Ranks For 50%, we observe that the Kendall’s  is smaller than 0.5 and ranged from 0.452 (F5) to 0.485 (FGD4). W.-J. Lee

  43. Outline • Background & Motivation • Methodology to Generate LSLink Datasets • Metrics to Identify Meaningful Associations • Association between Terms from Two CVs (human genes) • Discovery Tool • Human Expert Evaluation • Semantic Knowledge in a CV Hierarchy (human diseases) • Semantics of Links (human genome map) • Related Work and Conclusion W.-J. Lee

  44. Select A Human Gene Symbol http://www.cbcb.umd.edu/research/lslink/lodgui/ W.-J. Lee

  45. Select CV Type: GO or MeSH W.-J. Lee

  46. Select A GO or MeSH Term W.-J. Lee

  47. View Associations as A Group W.-J. Lee

  48. Association with the Highest Score W.-J. Lee

  49. Association around Average Score average average W.-J. Lee

  50. Associations above A Cutoff Score W.-J. Lee

More Related