Customizing Gene Taggers for BeeSpace - Improving Gene Recognition

Customizing Gene Taggersfor BeeSpace Jing Jiang jiang4@uiuc.edu March 9, 2005

Entity Recognition in BeeSpace • Types of entities we are interested in: • Genes • Sequences • Proteins • Organisms • Behaviors • … • Currently, we focus on genes BeeSpace

Input and Output • Input: free text (w/ simple XML tags) • <?xml version=“1.0” encoding=“UTF-8”><Document id=“1”>…We have cloned and sequenced a cDNA encoding Apis mellifera ultraspiracle (AMUSP) and examined its responses to JH. …</Document> • Output: tagged text (XML format) • <?xml version=“1.0” encoding=“UTF-8”> <Document id=“1”> …<Sent><NP>We</NP> have <VP>cloned</VP> and <VP>sequenced</VP><NP>a cDNA encoding <Gene>Apis mellifera ultraspiracle</Gene><NP> (<Gene>AMUSP</Gene>) and <VP>examined</VP><NP>its responses to JH</NP>.</Sent>…</Document> BeeSpace

Challenges • No complete gene dictionary • Many variations: • Acronyms: hyperpolarization-activated ion channel (Amih) • Synonyms: octopamine receptor (oa1, oar, amoa1) • Common English words: at (arctops), by (3R-B) • Different genes or gene and protein may share the same name/symbol BeeSpace

Automatic Gene Recognition:Characteristics of Gene Names • Capitalization (especially acronyms) • Numbers (gene families) • Punctuation: -, /, :, etc. • Context: • Local: surrounding words such as “gene”, “encoding”, “regulation”, “expressed”, etc. • Global: same noun phrase occurs several times in the same article BeeSpace

Existing Tools • KeX (Fukuda) • Based on hand-crafted rules • Recognizes proteins and other entities • Human efforts, not easy to modify • ABNER & YAGI (Settles) • Based on conditional random fields (CRFs) to learn the “rules” • ABNER identifies and classifies different entities including proteins, DNAs, RNAs, cells • YAGI recognizes genes and gene products • No training BeeSpace

Existing Tools (cont.) • LingPipe (Alias-i, Inc.) • Uses a generative statistical model based on word trigrams and tag bigrams • Can be trained • Has two trained models • Others • NLProt (SVM) • AbGene (rule-based) • GeneTaggerCRF (CRFs) BeeSpace

Comparison of Existing Tools • Performance on a few manually annotated, public data sets (protein names): • GENIA (2000 abstracts on “human & blood cell & transcription factor”) • Yapex (99 abstracts on “protein binding & interaction & molecular”) • UTexas (750 abstracts on “human”) • Performance on a honeybee sample data set: • Biosis search “apis mellifera gene” BeeSpace

Comparison of Existing Tools (cont.) BeeSpace

Comparison of Existing Tools (cont.) • KeX on honeybee data • False positives: company name, country name, etc. • Does not differentiate between genes, proteins, and other chemicals • YAGI on honeybee data • False negatives: occurrences of the same gene name are not all tagged • Entity types and boundary detection • LingPipe on honeybee data • Similar to YAGI BeeSpace

Lessons Learned • Machine learning methods outperform hand-crafted rule-based system • Machine learning methods have over-fitting problem • Existing tools need to be customized for BeeSpace • LingPipe is a good choice • There is still room for better feature selection • E.g., global context BeeSpace

Customization • Train LingPipe on a better training data set • Use fly (Drosophila) genes • F1 increased from 0.2207 to 0.7226 on held-out fly data • Tested on honeybee data: results • Some gene names are learned (Record 13) • Some false positives are removed (proteins, RNAs) • Some false positives are introduced • The noisy training data can be further cleaned • E.g., exclude common English words BeeSpace

Customization (cont.) • Exploit more features such as global context • Occurrences of the same word/phrase should be tagged all positive or all negative • Differentiate between domain-independent features and domain-specific features • E.g., prefix “Am” is domain-specific for Apis mellifera • Features can be weighted based on their contribution across domains BeeSpace

Maximum Entropy Modelfor Gene Tagging • Given an observation (a token or a noun phrase), together with its context, denoted as x • Predict y {gene, non-gene} • Maximum entropy model: P(y|x) = K exp(ifi(x, y)) • Typical f: • y = gene & candidate phrase starts with a capital letter • y = gene & candidate phrase contains digits • Estimate i with training data BeeSpace

Plan: Customization with Feature Adaptation • i: trained on large set of data in domain A (e.g., human or fly) • i: trained on small set of data in domain B (e.g., bee) • i’ = i•i + (1 - i)•i: used for domain B • i: based on how useful fi is across different domains • Large i if fi is domain-independent • Small i if fi is domain-specific BeeSpace

Issues to Discuss • Definition of gene names: • Gene families? (e.g., cb1 gene family) • Entities with a gene name? (e.g., Ks-1 transcripts) • Difference between genes and proteins? • E.g., “CREB (cAMP response element binding protein)” and “AmCREB”? • How to evaluate the performance on honeybee data? BeeSpace

The End • Questions? • Thank You! BeeSpace

Customizing Gene Taggers for BeeSpace - Improving Gene Recognition