Gene Annotation Relevance Detection: A Novel Approach

Relevance Detection Approach to Gene Annotation • Aid to automatic annotation of databases • Annotation flow • Extraction of molecular function of a gene from literature • That annotation of this function with a term in a controlled vocabulary • Premise • If the document sets retrieved by a GeneRIF and a GO concept are similar then a link can be made between them

Data • GeneRIF/GO term pairs • Paired if reference same MEDLINE article • Manually filtered for obvious errors • 550 pairs from 335 distinct genes • GO concept = GO term + definition • GeneRIFs and GO concepts too short for simple keyword matching • Treated as an IR problem • Similar to TREC novelty track • Compute relevance and similarity of 2 sentences

Document set - TREC Genomics 2003 docs • Each sentence within GeneRIF/GO concept pair treated as IR query • Similarity between the 2 computed based on top 200 docs retrieved by each query • Best Recall = 78.2%(prec = 22.1%) • Best Precision = 66.2% (rec = 46.9%)

GO Dependence Relations • Previous work (PSB) • Using substring matching between GO codes • Derived from annotation databases, using vector space models, co-occurrence, association rule-mining. • ChEBI: www.ebi.ac.uk/chebi/ • Chemical Entities of Biological Interest • Preferred names + synonyms • IS_A (poly)hierarchy

methods • String matching • If the same ChEBI entity is used within 2 GO codes, they are in a dependence relationship • First order relationship • ChEBI term must be whole word or surrounded by punctuation, e.g. carbonic anhydrase activity is not related to carbon-oxygen lyase activity • Also, in a dependence relationship with the ancestors • Second order relationship

Results • 55% of GO terms contain a ChEBI entity • 56% of dependent pairs with a ChEBI term found in PSB study were identified in this study • Less than 1% of GO term pairs found in this study were identified by the PSB study • Issues • How to validate potential relationships? • Usual naming/synonym ambiguity! • Substrings not used: imidazolonepropionase

Disease Text Classification • Task: Classification of text into one of 26 disease classes • Used full text and weighted sections according to information distribution published by other groups

Data Preparation • HTML full text documents, semi automatic section division • Tokenisation, Stemming, Stop word filtering, Part of speech tagging • Dataset: 21*25 positive full text articles, 33 negative full text articles • 10 fold cross validation • Nearest centroid classifier

Results • Baseline: 56% F-score • Additional preprocessing: 67% • 10,000 stopword filter • Only nouns • Section weighting: 74% • Abstract and Introduction weighted highest

From Nonsense to Sense in Healthcare Questions • Diagnosis, Prognosis, Therapy, Prevention • medicine finds disease mechanisms by first finding cures • Currently by trial and error • Try drug then test • Future - test then try drug • Biomarkers • Normality -> dysfunction -> disease • There are prognostic markers before any diagnostic markers

Integrative Genomics • Looking for hidden connections over wide field, e.g. • Immune system works too hard = rheumatoid arthritis • Immune system doesn’t work hard enough = infectious diseases

Term Disambiguation • 40% of genes have homonym problem • For 300 genes = 1mil MEDLINE articles • After disambiguation = 60,000 articles • 93% accuracy in asigning correct ID to ambiguous genes • Use contectual fingerprints: • Experts choose 5 abstracts about a concept • Fingerprint then created for that concept

Gene Annotation Relevance Detection: A Novel Approach

Gene Annotation Relevance Detection: A Novel Approach

Presentation Transcript

GOAT: The Gene Ontology Annotation Tool

Gene Prediction and Annotation techniques Basics

3. Genome Annotation: Gene Prediction

3. Genome Annotation: Gene Prediction (II)

Gene Structure Annotation

Gene Families and Functional Annotation

Subsystem Approach to Genome Annotation

Introduction to Gene Ontology annotation resources

ENIGMA : comparative, consensus gene structure annotation

Gene Finding and Sequence Annotation

Gene Annotation Databases

Lecture 6: Gene ontology and Gene Annotation

Gene/Protein Function Annotation

Gene Structure Annotation

trpC Gene Annotation in Thiomicrospira crunogena

A Statistical Approach to Literature-based Gene Group Annotation

Information System for Bee Gene Annotation

Gene Structure Annotation

Gene Annotation Databases

The Gene Wiki: Community Intelligence Applied to Gene Annotation

Gene Prediction and Annotation techniques Basics