Mining Associations in Annotated Biomedical Web

1 April 2009 Woei-Jyh (Adam) Lee, Ph.D. Center for Bioinformatics and Computational Biology University of Maryland, U.S.A. National Center for Biotechnology Information National Institutes of Health, U.S.A.

Research Experiences • New York University • Morgan Stanley • AT&T Labs, AT&T • Bell Laboratories, Lucent Technologies • University of Southern California • University of Maryland • National Institutes of Health, U.S.A. Distributed computing, parallel computing Web performance measurement Component object model, quality of service Fault tolerance, Internet protocols, policy based management Media streaming, error correction, video on demand Data management and mining, bioinformatics Protein domain parsing, genomics and genetics W.-J. Lee

Mining associations in the annotated biomedical web Woei-Jyh (Adam) Lee, Ph.D. University of Maryland Department of Computer Science; and Institute for Advanced Computer Studies; and Center for Bioinformatics and Computational Biology

LSLink - Life Science Link • http://www.cbcb.umd.edu/research/lslink/ W.-J. Lee

LSLink - Life Science Link • Background • Many Web accessible data resources for biologists. • publications, genes, diseases, sequences, structures, … • Data records are linked. • genes linked to publications, diseases linked to genes, … • Data records are annotated with controlled vocabulary (CV) terms. • Entrez Gene with GO, PubMed with MeSH, … • Goal: to discover biologicallymeaningfuland yetunknownassociations between pairs of CV terms. W.-J. Lee

Entrez Gene W.-J. Lee

PubMed W.-J. Lee

Links from Entrez Gene to PubMed W.-J. Lee

Links from PubMed to Entrez Gene W.-J. Lee

Gene Ontology (GO) W.-J. Lee

Medical Subject Headings (MeSH) W.-J. Lee

GO Annotations in Entrez Gene W.-J. Lee

MeSH Annotations in PubMed W.-J. Lee

Web of Entrez Gene, OMIM and PubMed GO Gene Nomenclature MeSH Lash CV Entrez Gene PubMed OMIM Legend: data resource link Clinical Synopsis SNOMED CT annotation controlled vocabulary W.-J. Lee

Outline • Background & Motivation • Methodology to Generate LSLink Datasets • Metrics to Identify Meaningful Associations • Association between Terms from Two CVs (human genes) • Semantic Knowledge in a CV Hierarchy (human diseases) • Semantics of Links (human genome map) • Related Work and Conclusion W.-J. Lee

Outline • Background & Motivation • Methodology to Generate LSLink Datasets • Methodology to Generate Datasets • Links and Termlinks • Background and User Query Datasets • Metrics to Identify Meaningful Associations • Association between Terms from Two CVs (human genes) • Semantic Knowledge in a CV Hierarchy (human diseases) • Semantics of Links (human genome map) • Related Work and Conclusion W.-J. Lee

Approach • Generate and analyze background and user query datasets. • Apply two classes of metrics (association rule mining and hypergeometric distribution) • Filter CV terms (by Major Topic, Semantic Type, etc.). • Rank association pairs of terms from two CVs. • Perform scientist evaluation. W.-J. Lee

Methodology Collect data records and links Entrez Gene (E) PubMed (P) e1 p1 e2 p2 Extract annotations E P GO (G) MeSH (M) Generate termlink instances e1 p1 g1 m1 g2 m2 W.-J. Lee

Links versus Termlinks annotations annotations GO Entrez Gene PubMed MeSH links g1 m1 e1 p1 g2 g3 m2 e2 p2 g4 m3 e3 p3 g5 g6 m4 1 link: (e1, p2) 4 termlinks: (g1, m2, e1, p2) (g1, m2, e1, p2) (g6, m3, e1, p2) (g6, m3, e1, p2) GO Entrez Gene PubMed MeSH g1 e1 m2 p2 m3 g6 W.-J. Lee

Example Links and Termlinks Entrez Gene PubMed … GeneID: 672 PMID: 12242698 MeSH: BRCA1 Protein GO: DNA repair MeSH: BRCA2 Protein GO: positive regulation of DNA repair … Legend: GeneID: 675 PMID: 10749118 data resource data record link MeSH: Mitosis GO: DNA repair controlled vocabulary term MeSH: Neoplasm Proteins GO: mitotic checkpoint termlink … W.-J. Lee

Human Genes Background Dataset • Retrieve all active human gene records in Entrez Gene. • Filter out records been replaced and discontinued. • Filter out records without GO annotations or links to PubMed. • Extract their GO annotations. • Follow all links from these records to PubMed records. • Extract MeSH annotations for PubMed records reached for the prior step. • Use the most relevant Descriptors/Qualifiers identified as Major Topic. • Filter with selected Semantic Types. W.-J. Lee

Statistics onHuman Genes Background Dataset W.-J. Lee

User Query Dataset • We support multiple user scenarios for querying the background dataset. • A user query dataset is a subset of the background dataset of scientists’ interest. • Individual human gene record: APOE, CFTR, etc. • BRCA1/BRCA2-containing complex: Early Onset Breast Cancer • Human genes and genetic disorders: Breast Cancer, Colorectal Cancer, Prostate Cancer, etc. • (G,M,E’,P’) (G,M,E,P) where E’  E and P’  P W.-J. Lee

Example Human Gene User Query Datasets W.-J. Lee

Outline • Background & Motivation • Methodology to Generate LSLink Datasets • Metrics to Identify Meaningful Associations • Two Classes of Metrics • Datasets and Subsets for Evaluation of Metrics • Distribution of Confidence Scores and P-values • Agreement and Disagreement Analysis • Association between Terms from Two CVs (human genes) • Semantic Knowledge in a CV Hierarchy (human diseases) • Semantics of Links (human genome map) • Related Work and Conclusion W.-J. Lee

Two Classes of Metrics Used to IdentifyPotential Meaningful Associations • Association rule mining. • Used in data mining. • Support and confidence scores [Agrawal et al. 1993]. • Hypergeometric distribution. • Used in hypothesis testing. • P-value [Sokal and Rohlf 1969]. W.-J. Lee

Definition of Probabilities • Term probability: • Link probability: • Conditional probability: W.-J. Lee

Association Rule Mining • Support score reflects the probability of an association annotated with some pair of CV terms. • Confidence score reflects the conditional probability of an association annotated with some pair of CV terms, given that associations are annotated with either of the CV terms. W.-J. Lee

Support and Confidence with Correction • We incorporate term-frequency correction and apply log operator (both are novel to our research). W.-J. Lee

Hypergeometric Distribution • P-value tests the over-representation of an association in a user query dataset. W.-J. Lee

Example User Query Datasets for Evaluation of Metrics W.-J. Lee

Relationship among Subsets of Associations in a User Query Dataset User query dataset: early onset breast cancer in human W.-J. Lee

Distribution of confidence scores forEarly Onset Breast Cancer in Human Confidence scores of most associations in singleton and local-non-singleton subsets are higher than 3. W.-J. Lee

Distribution of P-values forEarly Onset Breast Cancer in Human P-values in singleton and local-non-singleton subsets are appeared as step-function. W.-J. Lee

Overlap between Confidence Score and P-value Ranks For 50%, we observe that the overlap is significant and ranged from 83.8% (F5) to 92.6% (CTNNB1). W.-J. Lee

Overlap between Top-X Confidence Scoreand Top-K% P-value Ranksfor Early Onset Breast Cancer in Human Overlap b/w Top-X confidence score and Top-20% P-value ranks is mostly larger than 50% (of X). W.-J. Lee

Kendall’s  between Confidence Score and P-value Ranks For 50%, we observe that the Kendall’s  is smaller than 0.5 and ranged from 0.452 (F5) to 0.485 (FGD4). W.-J. Lee

Outline • Background & Motivation • Methodology to Generate LSLink Datasets • Metrics to Identify Meaningful Associations • Association between Terms from Two CVs (human genes) • Discovery Tool • Human Expert Evaluation • Semantic Knowledge in a CV Hierarchy (human diseases) • Semantics of Links (human genome map) • Related Work and Conclusion W.-J. Lee

Select A Human Gene Symbol http://www.cbcb.umd.edu/research/lslink/lodgui/ W.-J. Lee

Select CV Type: GO or MeSH W.-J. Lee

Select A GO or MeSH Term W.-J. Lee

View Associations as A Group W.-J. Lee

Association with the Highest Score W.-J. Lee

Association around Average Score average average W.-J. Lee

Associations above A Cutoff Score W.-J. Lee

Mining Associations in Annotated Biomedical Web

Mining Associations in Annotated Biomedical Web

Presentation Transcript

Jill Tichy April 1, 2009

April 2009

April 2009

April 2009

April 2009

1 April 2009

April 1, 2009

April 2009

April 2009

April 2009

April 2009

1 April 2009

Philadelphia, PA April 1, 2009

April 2009

April 2009

Network Security April 1, 2009

April 2009

Wed., April 1 st , 2009

April 1, 2009

Monday, April 1 st 2009

Journal club, April 1, 2009

April 2009