170 likes | 262 Views
Learn about customizing gene taggers for BeeSpace, focusing on gene, sequence, protein, organism, and behavior recognition. Explore challenges and characteristics of gene recognition, existing tools like LingPipe and ABNER, and comparison of performance on various datasets. Discover lessons learned and strategies for BeeSpace customization through feature selection and maximum entropy models.
E N D
Customizing Gene Taggersfor BeeSpace Jing Jiang jiang4@uiuc.edu March 9, 2005
Entity Recognition in BeeSpace • Types of entities we are interested in: • Genes • Sequences • Proteins • Organisms • Behaviors • … • Currently, we focus on genes BeeSpace
Input and Output • Input: free text (w/ simple XML tags) • <?xml version=“1.0” encoding=“UTF-8”><Document id=“1”>…We have cloned and sequenced a cDNA encoding Apis mellifera ultraspiracle (AMUSP) and examined its responses to JH. …</Document> • Output: tagged text (XML format) • <?xml version=“1.0” encoding=“UTF-8”> <Document id=“1”> …<Sent><NP>We</NP> have <VP>cloned</VP> and <VP>sequenced</VP><NP>a cDNA encoding <Gene>Apis mellifera ultraspiracle</Gene><NP> (<Gene>AMUSP</Gene>) and <VP>examined</VP><NP>its responses to JH</NP>.</Sent>…</Document> BeeSpace
Challenges • No complete gene dictionary • Many variations: • Acronyms: hyperpolarization-activated ion channel (Amih) • Synonyms: octopamine receptor (oa1, oar, amoa1) • Common English words: at (arctops), by (3R-B) • Different genes or gene and protein may share the same name/symbol BeeSpace
Automatic Gene Recognition:Characteristics of Gene Names • Capitalization (especially acronyms) • Numbers (gene families) • Punctuation: -, /, :, etc. • Context: • Local: surrounding words such as “gene”, “encoding”, “regulation”, “expressed”, etc. • Global: same noun phrase occurs several times in the same article BeeSpace
Existing Tools • KeX (Fukuda) • Based on hand-crafted rules • Recognizes proteins and other entities • Human efforts, not easy to modify • ABNER & YAGI (Settles) • Based on conditional random fields (CRFs) to learn the “rules” • ABNER identifies and classifies different entities including proteins, DNAs, RNAs, cells • YAGI recognizes genes and gene products • No training BeeSpace
Existing Tools (cont.) • LingPipe (Alias-i, Inc.) • Uses a generative statistical model based on word trigrams and tag bigrams • Can be trained • Has two trained models • Others • NLProt (SVM) • AbGene (rule-based) • GeneTaggerCRF (CRFs) BeeSpace
Comparison of Existing Tools • Performance on a few manually annotated, public data sets (protein names): • GENIA (2000 abstracts on “human & blood cell & transcription factor”) • Yapex (99 abstracts on “protein binding & interaction & molecular”) • UTexas (750 abstracts on “human”) • Performance on a honeybee sample data set: • Biosis search “apis mellifera gene” BeeSpace
Comparison of Existing Tools (cont.) BeeSpace
Comparison of Existing Tools (cont.) • KeX on honeybee data • False positives: company name, country name, etc. • Does not differentiate between genes, proteins, and other chemicals • YAGI on honeybee data • False negatives: occurrences of the same gene name are not all tagged • Entity types and boundary detection • LingPipe on honeybee data • Similar to YAGI BeeSpace
Lessons Learned • Machine learning methods outperform hand-crafted rule-based system • Machine learning methods have over-fitting problem • Existing tools need to be customized for BeeSpace • LingPipe is a good choice • There is still room for better feature selection • E.g., global context BeeSpace
Customization • Train LingPipe on a better training data set • Use fly (Drosophila) genes • F1 increased from 0.2207 to 0.7226 on held-out fly data • Tested on honeybee data: results • Some gene names are learned (Record 13) • Some false positives are removed (proteins, RNAs) • Some false positives are introduced • The noisy training data can be further cleaned • E.g., exclude common English words BeeSpace
Customization (cont.) • Exploit more features such as global context • Occurrences of the same word/phrase should be tagged all positive or all negative • Differentiate between domain-independent features and domain-specific features • E.g., prefix “Am” is domain-specific for Apis mellifera • Features can be weighted based on their contribution across domains BeeSpace
Maximum Entropy Modelfor Gene Tagging • Given an observation (a token or a noun phrase), together with its context, denoted as x • Predict y {gene, non-gene} • Maximum entropy model: P(y|x) = K exp(ifi(x, y)) • Typical f: • y = gene & candidate phrase starts with a capital letter • y = gene & candidate phrase contains digits • Estimate i with training data BeeSpace
Plan: Customization with Feature Adaptation • i: trained on large set of data in domain A (e.g., human or fly) • i: trained on small set of data in domain B (e.g., bee) • i’ = i•i + (1 - i)•i: used for domain B • i: based on how useful fi is across different domains • Large i if fi is domain-independent • Small i if fi is domain-specific BeeSpace
Issues to Discuss • Definition of gene names: • Gene families? (e.g., cb1 gene family) • Entities with a gene name? (e.g., Ks-1 transcripts) • Difference between genes and proteins? • E.g., “CREB (cAMP response element binding protein)” and “AmCREB”? • How to evaluate the performance on honeybee data? BeeSpace
The End • Questions? • Thank You! BeeSpace