210 likes | 313 Views
Explore how ontology-based annotation enhances the querying of Tissue Microarray (TMA) data, enabling better search capabilities and data integration. The process involves mapping text to ontology terms, aligning with NCI Thesaurus and SNOMED-CT, and improving alignment through ontology graph structure.
E N D
Ontology-based Annotation & Query of TMA data Nigam Shah Stanford Medical Informatics (nigam@stanford.edu)
Tissue Microarrays www.nature.com/clinicalpractice/onc
Stanford tissue microarray database http://tma.stanford.edu/tma_portal/
Key analysis issue • Tissue microarrays query a large number of samples/patients for one protein. • The key query dimension in TMA data is a tissue sample • Because of the lack of a commonly used ontology to describe the diagnosis [or annotations] for a given TMA sample in TMAD it is not easy to perform such as query.
Ontologies considered • The NCI Thesaurus, version 05.09g • The SNOMED-CT, from UMLS 2005 AA
Available annotations for a block • Each donor block in the TMA has semi-structured text associated with it.
Map text to ontology terms • Make all possible permutations • Rules to weed out bad permutations • Check for an exact match with NCI and SNOMED-CT terms (and/or synonyms) • Rules to weed out bad matches 24 permutations Prostate Carcinoma Adeno intraductal Prostate Carcinoma Adeno intraductal : Carcinoma Prostate intraductal Adeno : Adeno Carcinoma intraductal Prostate : Prostate intraductal Adeno Carcinoma Prostate_Ductal_Adenocarcinoma
Results and validation • Mapped the term-sets for 8495 records, which correspond to 783 distinct term-sets. • 577 term-sets (6614 records) matched to the NCI thesaurus • 365 term-sets (3465 records) matched to SNOMED-CT • In total mapped 6871 records (80%) of annotated records in TMAD (641 distinct term-sets) to one or more ontology terms.
Parents & Siblings nodes with data (Burly wood) Child nodes with no data (Grey) Child nodes with data (Yellow)
How do ontology based annotation help? • Better search: we can retrieve samples of all the retroperitoneal tumors or malignant uterine neoplasms for example. • Better Integration of data: we can correlate gene expression with protein expression across multiple tumor types. • Tissue microarray data from TMAD • Gene expression data from GEO
Integrating mRNA and protein expression Genes Samples Proteins Samples
Steps in Alignment • Anchor identification • Identify similar class labels in the ontologies to be aligned • Usually done by string matching • Ontology structure • Use the “similar” classes as anchors and examine the local [graph] structure around them to inform the “similarity” metric R Root Term-1 Term-2 t1 t2 Term-3 Term-4 t3 t4 t5 t7 Term-5 t6
We might improve alignment … Ontology [graph] structure based step t5 S2 Term-5 t5 S2 R Root Term-5 Term-1 Term-2 t1 t2 Term-2 t1 Term-3 Term-4 t3 t4 t5 Term-5 t5 t7 Term-5 t6 Provide Anchors from annotated data
Summary Ability to map word-groups to ontology terms
Pathology Robert Marinelli Matt van de Rijn Medical Informatics Kaustubh Supekar Daniel Rubin Mark Musen Funding NIH Credits and acknowledgements