120 likes | 213 Views
A novel approach to aid automatic annotation of databases, extracting gene function from literature, and linking GeneRIF with GO terms. By computing relevance and similarity of sentences, this method improves precision and recall. Results show successful identification of GO terms containing ChEBI entities and validation of potential relationships. Challenges like naming ambiguity are addressed, while disease text classification and term disambiguation in integrative genomics are also discussed.
E N D
Relevance Detection Approach to Gene Annotation • Aid to automatic annotation of databases • Annotation flow • Extraction of molecular function of a gene from literature • That annotation of this function with a term in a controlled vocabulary • Premise • If the document sets retrieved by a GeneRIF and a GO concept are similar then a link can be made between them
Data • GeneRIF/GO term pairs • Paired if reference same MEDLINE article • Manually filtered for obvious errors • 550 pairs from 335 distinct genes • GO concept = GO term + definition • GeneRIFs and GO concepts too short for simple keyword matching • Treated as an IR problem • Similar to TREC novelty track • Compute relevance and similarity of 2 sentences
Document set - TREC Genomics 2003 docs • Each sentence within GeneRIF/GO concept pair treated as IR query • Similarity between the 2 computed based on top 200 docs retrieved by each query • Best Recall = 78.2%(prec = 22.1%) • Best Precision = 66.2% (rec = 46.9%)
GO Dependence Relations • Previous work (PSB) • Using substring matching between GO codes • Derived from annotation databases, using vector space models, co-occurrence, association rule-mining. • ChEBI: www.ebi.ac.uk/chebi/ • Chemical Entities of Biological Interest • Preferred names + synonyms • IS_A (poly)hierarchy
methods • String matching • If the same ChEBI entity is used within 2 GO codes, they are in a dependence relationship • First order relationship • ChEBI term must be whole word or surrounded by punctuation, e.g. carbonic anhydrase activity is not related to carbon-oxygen lyase activity • Also, in a dependence relationship with the ancestors • Second order relationship
Results • 55% of GO terms contain a ChEBI entity • 56% of dependent pairs with a ChEBI term found in PSB study were identified in this study • Less than 1% of GO term pairs found in this study were identified by the PSB study • Issues • How to validate potential relationships? • Usual naming/synonym ambiguity! • Substrings not used: imidazolonepropionase
Disease Text Classification • Task: Classification of text into one of 26 disease classes • Used full text and weighted sections according to information distribution published by other groups
Data Preparation • HTML full text documents, semi automatic section division • Tokenisation, Stemming, Stop word filtering, Part of speech tagging • Dataset: 21*25 positive full text articles, 33 negative full text articles • 10 fold cross validation • Nearest centroid classifier
Results • Baseline: 56% F-score • Additional preprocessing: 67% • 10,000 stopword filter • Only nouns • Section weighting: 74% • Abstract and Introduction weighted highest
From Nonsense to Sense in Healthcare Questions • Diagnosis, Prognosis, Therapy, Prevention • medicine finds disease mechanisms by first finding cures • Currently by trial and error • Try drug then test • Future - test then try drug • Biomarkers • Normality -> dysfunction -> disease • There are prognostic markers before any diagnostic markers
Integrative Genomics • Looking for hidden connections over wide field, e.g. • Immune system works too hard = rheumatoid arthritis • Immune system doesn’t work hard enough = infectious diseases
Term Disambiguation • 40% of genes have homonym problem • For 300 genes = 1mil MEDLINE articles • After disambiguation = 60,000 articles • 93% accuracy in asigning correct ID to ambiguous genes • Use contectual fingerprints: • Experts choose 5 abstracts about a concept • Fingerprint then created for that concept