1 / 49

IE by Candidate Classification: Jansche & Abney, Cohen et al

IE by Candidate Classification: Jansche & Abney, Cohen et al. William Cohen 1/19/03. SCAN: Search & Summarization for Audio Collections (AT&T Labs). Why IE from personal voicemail. Unified interface for email, voicemail, fax, … requires uniform headers: Sender, Time, Subject, …

chace
Download Presentation

IE by Candidate Classification: Jansche & Abney, Cohen et al

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. IE by Candidate Classification:Jansche & Abney, Cohen et al William Cohen 1/19/03

  2. SCAN: Search & Summarization for Audio Collections (AT&T Labs)

  3. Why IE from personal voicemail • Unified interface for email, voicemail, fax, … requires uniform headers: • Sender, Time, Subject, … • Headers are key for uniform interface • Independently, voicemail access is slow: • useful to have fast access to important parts of message (contact number, caller)

  4. Why else to read this paper • Robust information extraction • Generalizing from manual transcripts (i.e., human-produced written version of voicemail) to automatic (ASR) transcripts • Place of hand-codingvs learning in information extraction • How to break up task • Where and how to use engineering Candidate Generator Candidate phrase Learned filter Extracted phrase

  5. Voicemail corpus • About 10,000 manually transcribed and annotated voice messages. • 1869 used for evaluation

  6. Observation: caller phrases are short and near the beginning of the message.

  7. Caller-phrase extraction • Propose start positions i1,…,iN • Use a learned decision tree to pick the best i • Propose end positions i+j1,i+j2,…,i+jM • Use a learned decision tree to pick the best j

  8. Baseline (HZP, Col log-linear) • IE as tagging: • Pr(tag i|word i,word i-1,…,word i+1,…,tag i-1,…) estimated via MAXENT model • Beam search to find best tag sequence given word sequence • Features of model are words, word pairs, word pair+tag trigrams, ….

  9. Performance

  10. Observation: caller names are reallyshort and near the beginningof the message.

  11. What about ASR transcripts?

  12. Extracting phone numbers • Phase 1: hand-coded grammer proposes candidate phone numbers • Not too hard, due to limited vocabulary • Optimize recall (96%) not precision (30%) • Phase 2: a learned decision tree filters candidates • Use length, position, context, …

  13. Results

  14. Their Conclusions

  15. Cohen, Wang, Murphy • Another paper with a similar flavor: • IE for a particular task • IE using similar propose-and-filter approach • When and how to you engineer, and when and how to you use learning?

  16. Background – subcellular localization The most important tool for studying protein localizations is fluorescence microscopy. New image processing techniques can automatically produce a quantitative description of subcellular localization.

  17. Two golgi proteins that cannot be distinguished by eye Background – subcellular localization

  18. Background – subcellular localization Entrez: “a new 376kD Golgi complex outher membrane protein” SWISSProt: “INTEGRAL MEMBRANE PROTEIN. GOLGI MEMBRANE” Entrez: “GPP130; type II Golgi membrane protein” SWISSProt: nothing

  19. Overview of SLIF: image analysis of existing images from online publications Image On-line paper Panel Splitter Figure finder Panel Classifier Fl. Micr. Panel Scale Finder Figure Micr. Scale

  20. Overview of SLIF: image analysis of existing images from online publications End result: collection of on-line fluorescence microscope images, with quantitative description of localization. E.g.: we know this figure section shows a tubulin-like protein… …but not which one!

  21. Background – overview of SLIF2.0 Image Caption Image Pointer Finder Panel Splitter Panel Label Matcher Panel Classifier Scope Finder Fl. Micr. Panel Scale Finder Name Finder Protein Name Micr. Scale Cell Type

  22. BY-2 U2B 0-GFP p80-coilin anti-p80 coilin An old issue: entity recognition Background – overview of SLIF2.0 Figure 1. (A) Single confocal optical section of BY-2 cells expressing U2B 0-GFP, double labeled with GFP (left panel) and autoantibody against p80 coilin (right panel). Three nuclei are shown, and the bright GFP spots colocalize with bright foci of anti-coilin labeling. There is some labeling of the cytoplasm by anti-p80 coilin. (B) Single confocal optical section of BY-2 cells expressing U2B 0 -GFP, double labeled with GFP (left panel) and 4G3 antibody (right panel). Three nuclei are shown. Most coiled bodies are in the nucleoplasm, but occasionally are seen in the nucleolus (arrows). All coiled bodies that contain U2B 0 also express the U2B 0-GFP fusion. Bars, 5 m m. Movement of Coiled Bodies Vol. 10, July 1999 2299 A new issue: “caption understanding” - where are the entities in the image?

  23. Why caption understanding? - Location proteomics. - Remove extraneous junk from caption text for “ordinary” IE, NLP, indexing, … - Better text- or content-based image retrieval for scientific images. Figure 1. (A) Single confocal optical section of BY-2 cells expressing U2B 0-GFP, double labeled with GFP (left panel) and autoantibody against p80 coilin (right panel). Three nuclei are shown, and the bright GFP spots colocalize with bright foci of anti-coilin labeling. There is some labeling of the cytoplasm by anti-p80 coilin. (B) Single confocal optical section of BY-2 cells expressing U2B 0 -GFP, double labeled with GFP (left panel) and 4G3 antibody (right panel). Three nuclei are shown. Most coiled bodies are in the nucleoplasm, but occasionally are seen in the nucleolus (arrows). All coiled bodies that contain U2B 0 also express the U2B 0-GFP fusion. Bars, 5 m m. Movement of Coiled Bodies Vol. 10, July 1999 2299

  24. Identify image pointers: Substrings that refer to parts of the image Will focus on text issues, not matching Figure 1. (A) Single confocal optical section of BY-2 cells expressing U2B 0-GFP, double labeled with GFP (left panel) and autoantibody against p80 coilin (right panel). Three nuclei are shown, and the bright GFP spots colocalize with bright foci of anti-coilin labeling. There is some labeling of the cytoplasm by anti-p80 coilin. (B) Single confocal optical section of BY-2 cells expressing U2B 0 -GFP, double labeled with GFP (left panel) and 4G3 antibody (right panel). Three nuclei are shown. Most coiled bodies are in the nucleoplasm, but occasionally are seen in the nucleolus (arrows). All coiled bodies that contain U2B 0 also express the U2B 0-GFP fusion. Bars, 5 m m. Movement of Coiled Bodies Vol. 10, July 1999 2299

  25. Identifyimage pointers: Substrings that refer to parts of the image Classify image pointers as citation-styleor bullet-style. Figure 1. (A) Single confocal optical section of BY-2 cells expressing U2B 0-GFP, double labeled with GFP (left panel) and autoantibody against p80 coilin (right panel). Three nuclei are shown, and the bright GFP spots colocalize with bright foci of anti-coilin labeling. There is some labeling of the cytoplasm by anti-p80 coilin. (B) Single confocal optical section of BY-2 cells expressing U2B 0 -GFP, double labeled with GFP (left panel) and 4G3 antibody (right panel). Three nuclei are shown. Most coiled bodies are in the nucleoplasm, but occasionally are seen in the nucleolus (arrows). All coiled bodies that contain U2B 0 also express the U2B 0-GFP fusion. Bars, 5 m m. Movement of Coiled Bodies Vol. 10, July 1999 2299

  26. Compute scopes: - The scope of a bullet-style image pointer is all words between it and the next “bullet” scope of (A) scope of (B) Classify image pointers as citation-styleor bullet-style. Figure 1. (A) Single confocal optical section of BY-2 cells expressing U2B 0-GFP, double labeled with GFP (left panel) and autoantibody against p80 coilin (right panel). Three nuclei are shown, and the bright GFP spots colocalize with bright foci of anti-coilin labeling. There is some labeling of the cytoplasm by anti-p80 coilin. (B) Single confocal optical section of BY-2 cells expressing U2B 0 -GFP, double labeled with GFP (left panel) and 4G3 antibody (right panel). Three nuclei are shown. Most coiled bodies are in the nucleoplasm, but occasionally are seen in the nucleolus (arrows). All coiled bodies that contain U2B 0 also express the U2B 0-GFP fusion. Bars, 5 m m. Movement of Coiled Bodies Vol. 10, July 1999 2299

  27. Compute scopes: - The scope of a bullet-style image pointer is all words after it, but before next “bullet” - The scope of a citation-style image pointer is some set of words nearby it (heuristically determined by separating words and punctuation) Figure 1. (A) Single confocal optical section of BY-2 cells expressing U2B 0-GFP, double labeled with GFP (left panel) and autoantibody against p80 coilin (right panel). Three nuclei are shown, and the bright GFP spots colocalize with bright foci of anti-coilin labeling. There is some labeling of the cytoplasm by anti-p80 coilin. (B) Single confocal optical section of BY-2 cells expressing U2B 0 -GFP, double labeled with GFP (left panel) and 4G3 antibody (right panel). Three nuclei are shown. Most coiled bodies are in the nucleoplasm, but occasionally are seen in the nucleolus (arrows). All coiled bodies that contain U2B 0 also express the U2B 0-GFP fusion. Bars, 5 m m. Movement of Coiled Bodies Vol. 10, July 1999 2299

  28. Image pointers share all entities in their “scope”. Entities are assigned to panels based on matches of image-pointers to annotations in panels. Figure 1. (A) Single confocal optical section of BY-2 cells expressing U2B 0-GFP, double labeled with GFP (left panel) and autoantibody against p80 coilin (right panel). Three nuclei are shown, and the bright GFP spots colocalize with bright foci of anti-coilin labeling. There is some labeling of the cytoplasm by anti-p80 coilin. (B) Single confocal optical section of BY-2 cells expressing U2B 0 -GFP, double labeled with GFP (left panel) and 4G3 antibody (right panel). Three nuclei are shown. Most coiled bodies are in the nucleoplasm, but occasionally are seen in the nucleolus (arrows). All coiled bodies that contain U2B 0 also express the U2B 0-GFP fusion. Bars, 5 m m. Movement of Coiled Bodies Vol. 10, July 1999 2299

  29. Outline • Details on caption understanding • Baseline hand-coded methods • Learning methods • Experimental results

  30. Task • Identify image pointers in captions. • Classify image pointers: • bullet-style, citation-style, or NP-style • E.g., “Panels A and C show the …” • Won’t talk about scoping • Will focus first on extracting image pointers—i.e., binary classification of substrings “is this an image pointer” • Data: 100 captions from 100 papers—about 600 positive examples.

  31. Baseline methods • Labeled 100 sample figure captions. • HANDCODE-1: patterns like (A), (B-E), (c and d), etc. • HANDCODE-2: all short parenthesized expressions & patterns like “panel A” or “in B-C” Some plausible tricks (like filtering HC-2) don’t help much…

  32. How hard is the problem? Some citation-style image pointers

  33. How hard is the problem? NP-style non-image pointers The difficulty of the task suggests using a learning approach

  34. Another use of propose-and-filter Note that Hand-Code2 (recall 98%) is a natural candidate generator. We’ll start with “off the shelf” features… Candidate Generator Candidate phrase Learned filter Extracted phrase

  35. Learning methods: boosting Generalized version of AdaBoost (Singer&Schapire, 99) Allows “real-valued” predictions for each “base hypothesis”—including value of zero.

  36. Constraint: W+ > W- where and caret is smoothing Learning methods: boosting rules • Weak learner: to find weak hypothesis t: • Split Data into Growing and Pruning sets • Let Rt be an empty conjunction • Greedily add conditions to Rt guided by Growing set: • Greedily remove conditions from Rt guided by Pruning set: • Convert to weak hypothesis:

  37. Learning methods: boosting rules SLIPPER also produces fairly compact rule sets.

  38. Learning methods: BWI • Boosted wrapper induction (BWI) learns to extract substrings from a document. • Learns three concepts: firstToken(x), lastToken(x), substringLength(k) • Conditions are tests on tokens before/after x • E.g., toki-2=‘from’, isNumber(toki+1) • SLIPPER weak learner, no pruning. • Greedy search extends window size by at most L in each iteration, uses lookahead L, no fixed limit on window size. • Good results in (Kushmeric and Frietag, 2000)

  39. Learning methods: ABWI • “Almost boosted wrapper induction” (ABWI) learns to extract substrings: • Learns to filter candidate substrings (HandCode2) • Conditions are the same tests on tokens near x: • E.g., toki-2=‘from’, isNumber(toki+1) • SLIPPER weak learner, no pruning. • Greedy search extends window size any amount, uses no lookahead, has fixed limit on window size. • Optimal window sizes for this problem seem to be small…

  40. Learning methods • Features: W tokens before/after, all tokens inside. • Learner: 100 rounds boosting conjunctions of feature tests • Inspired by BWI (Frietag & Kushmeric) • Implemented with SLIPPER learner

  41. Other learning methods All learning methods are competitive with hand-coded methods

  42. Additional features • Check if candidate contains certain “special” substrings: • Matches color name: labeled color • Matches HANDCODE-1 pattern: handcode1 • Matches “mm”, “mg”, etc: measure • Matches 1980,…,2003, “et al”: citation • Matches “top”, “left”, etc: place • Added “sentence boundary” substrings: • Feature is “distance to boundary”.

  43. Learning with expanded feature set Many new features are inversely correlated with class (e.g. citation), but ABWI looks only for positively-correlated patterns.

  44. Learning with expanded feature set SABWI is a symmetric version of ABWI: can use rules and/or conditions negatively or positively correlated with the class

  45. Task • Identify image pointers in captions. • Classify image pointers: • bullet-style, citation-style, or NP-style • Combine these to get a four-class problem: • bullet-style, citation-style, or NP-style, other • no hand-coded baseline methods

  46. Four-class extraction results

  47. Further improvement is probable with additional labeled data

More Related