80 likes | 176 Views
Discussing progress on gene annotation, common errors in tagging, heuristics for consistency, and plans to evaluate performance and tune entity recognizer. Working on honey bee data judgment and adjusting heuristics. Collaborating with Todd on format.
E N D
Entity Recognition:Current Status and Summer Plan Jing Jiang May 12, 2006
Update since last meeting • Met with Nyla (the biologist) to talk about training/evaluation data • Most annotated genes in the BioCreative data set are reasonable • To manually annotate a sample set of bee literature for evaluation and tuning purpose • Tagged some other collections (fly-bcb, songbird, Wnt pathway) • Identified some common errors and came up with some heuristics to fix the errors
Current performance • On BIOSIS honey bee: waiting to hear from Nyla for judgment on the honey bee sample • On Wnt pathway full-text articles (a sample of 100 sentences, judged by Xin) • Precision: 92% (207 / 224) • Recall: 84% (207 / 245) • Examples: • fly, songbird, Wnt pathway
Common errors and heuristics • Same word/phrase tagged differently within the same article • Because of the different contexts • Heuristic: force the tagging to be consistent • Long form and its abbreviation tagged differently • E.g.: …a cDNA encoding Apis mellifera ultraspiracle (AMUSP) and… • Heuristic: force the tagging to be consistent • Easily detectable false positives • E.g.: Roughly half of Drosophilagenes currently… • Heuristic: compile a list (of species names, chemical names, etc.) and some heuristic rules
Common errors and heuristics (cont.) • Conjunctive words/phrases tagged differently • E.g.: …three cbl genes (c-cbl , cblb , and cblc) which… • Heuristic: use some rules to capture such conjunctive words, and tag them consistently • Tokenization errors: • E.g.: There is no difference in AmTRP-expressing cells among worker, … • Heuristic: compile a list of typical suffixes (such as “-expressing”, “-dependent”, etc.) that should be separated from their prefixes
Common errors and heuristics • Mistakes caused by citations: • Only in certain text (Wnt pathway collection has this problem. BIOSIS collections don’t.) • E.g.: Among the downstream targets of PI 3-kinase are phospholipase C (6-9) , protein kinase C (10, 11) , Rac (12-14) , and… • Heuristic: remove these citations(?) • Controversial cases: domain, subunit, etc. • E.g.: Alternating proline / alanine sequence of beta B1 subunit originates… • BioCreative data set tags these as part of gene names
Summer plan • Evaluate the performance on honey bee data based on Nyla’s judgments • Implement and tune the heuristics to capture the common errors, and evaluate their effectiveness • Some heuristics may cause new errors • Tune on the annotated sample honey bee data • Based on the need of BeeSpace, find a good balance between precision and recall • Work with Todd on the input/output format of the entity recognizer