1 / 17

Customizing Gene Taggers for BeeSpace

Customizing Gene Taggers for BeeSpace. Jing Jiang jiang4@uiuc.edu March 9, 2005. Entity Recognition in BeeSpace. Types of entities we are interested in: Genes Sequences Proteins Organisms Behaviors … Currently, we focus on genes. Input and Output.

werner
Download Presentation

Customizing Gene Taggers for BeeSpace

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Customizing Gene Taggersfor BeeSpace Jing Jiang jiang4@uiuc.edu March 9, 2005

  2. Entity Recognition in BeeSpace • Types of entities we are interested in: • Genes • Sequences • Proteins • Organisms • Behaviors • … • Currently, we focus on genes BeeSpace

  3. Input and Output • Input: free text (w/ simple XML tags) • <?xml version=“1.0” encoding=“UTF-8”><Document id=“1”>…We have cloned and sequenced a cDNA encoding Apis mellifera ultraspiracle (AMUSP) and examined its responses to JH. …</Document> • Output: tagged text (XML format) • <?xml version=“1.0” encoding=“UTF-8”> <Document id=“1”> …<Sent><NP>We</NP> have <VP>cloned</VP> and <VP>sequenced</VP><NP>a cDNA encoding <Gene>Apis mellifera ultraspiracle</Gene><NP> (<Gene>AMUSP</Gene>) and <VP>examined</VP><NP>its responses to JH</NP>.</Sent>…</Document> BeeSpace

  4. Challenges • No complete gene dictionary • Many variations: • Acronyms: hyperpolarization-activated ion channel (Amih) • Synonyms: octopamine receptor (oa1, oar, amoa1) • Common English words: at (arctops), by (3R-B) • Different genes or gene and protein may share the same name/symbol BeeSpace

  5. Automatic Gene Recognition:Characteristics of Gene Names • Capitalization (especially acronyms) • Numbers (gene families) • Punctuation: -, /, :, etc. • Context: • Local: surrounding words such as “gene”, “encoding”, “regulation”, “expressed”, etc. • Global: same noun phrase occurs several times in the same article BeeSpace

  6. Existing Tools • KeX (Fukuda) • Based on hand-crafted rules • Recognizes proteins and other entities • Human efforts, not easy to modify • ABNER & YAGI (Settles) • Based on conditional random fields (CRFs) to learn the “rules” • ABNER identifies and classifies different entities including proteins, DNAs, RNAs, cells • YAGI recognizes genes and gene products • No training BeeSpace

  7. Existing Tools (cont.) • LingPipe (Alias-i, Inc.) • Uses a generative statistical model based on word trigrams and tag bigrams • Can be trained • Has two trained models • Others • NLProt (SVM) • AbGene (rule-based) • GeneTaggerCRF (CRFs) BeeSpace

  8. Comparison of Existing Tools • Performance on a few manually annotated, public data sets (protein names): • GENIA (2000 abstracts on “human & blood cell & transcription factor”) • Yapex (99 abstracts on “protein binding & interaction & molecular”) • UTexas (750 abstracts on “human”) • Performance on a honeybee sample data set: • Biosis search “apis mellifera gene” BeeSpace

  9. Comparison of Existing Tools (cont.) BeeSpace

  10. Comparison of Existing Tools (cont.) • KeX on honeybee data • False positives: company name, country name, etc. • Does not differentiate between genes, proteins, and other chemicals • YAGI on honeybee data • False negatives: occurrences of the same gene name are not all tagged • Entity types and boundary detection • LingPipe on honeybee data • Similar to YAGI BeeSpace

  11. Lessons Learned • Machine learning methods outperform hand-crafted rule-based system • Machine learning methods have over-fitting problem • Existing tools need to be customized for BeeSpace • LingPipe is a good choice • There is still room for better feature selection • E.g., global context BeeSpace

  12. Customization • Train LingPipe on a better training data set • Use fly (Drosophila) genes • F1 increased from 0.2207 to 0.7226 on held-out fly data • Tested on honeybee data: results • Some gene names are learned (Record 13) • Some false positives are removed (proteins, RNAs) • Some false positives are introduced • The noisy training data can be further cleaned • E.g., exclude common English words BeeSpace

  13. Customization (cont.) • Exploit more features such as global context • Occurrences of the same word/phrase should be tagged all positive or all negative • Differentiate between domain-independent features and domain-specific features • E.g., prefix “Am” is domain-specific for Apis mellifera • Features can be weighted based on their contribution across domains BeeSpace

  14. Maximum Entropy Modelfor Gene Tagging • Given an observation (a token or a noun phrase), together with its context, denoted as x • Predict y {gene, non-gene} • Maximum entropy model: P(y|x) = K exp(ifi(x, y)) • Typical f: • y = gene & candidate phrase starts with a capital letter • y = gene & candidate phrase contains digits • Estimate i with training data BeeSpace

  15. Plan: Customization with Feature Adaptation • i: trained on large set of data in domain A (e.g., human or fly) • i: trained on small set of data in domain B (e.g., bee) • i’ = i•i + (1 - i)•i: used for domain B • i: based on how useful fi is across different domains • Large i if fi is domain-independent • Small i if fi is domain-specific BeeSpace

  16. Issues to Discuss • Definition of gene names: • Gene families? (e.g., cb1 gene family) • Entities with a gene name? (e.g., Ks-1 transcripts) • Difference between genes and proteins? • E.g., “CREB (cAMP response element binding protein)” and “AmCREB”? • How to evaluate the performance on honeybee data? BeeSpace

  17. The End • Questions? • Thank You! BeeSpace

More Related