1 / 20

New Event Detection at UMass Amherst

New Event Detection at UMass Amherst. Giridhar Kumaran and James Allan. Preprocessing . Lemur Toolkit for tokenization, stopping, k-stemming http://www-2.cs.cmu.edu/~lemur/ BBN Identifinder™ for extracting named entities. Systems fielded . Submitted four systems

gefen
Download Presentation

New Event Detection at UMass Amherst

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. New Event Detection at UMass Amherst Giridhar Kumaran and James Allan

  2. Preprocessing • Lemur Toolkit for tokenization, stopping, k-stemming • http://www-2.cs.cmu.edu/~lemur/ • BBN Identifinder™ for extracting named entities CIIR, UMass Amherst

  3. Systems fielded • Submitted four systems • Didn’t include last year’s system • Classification according to LDC categories and term – pruning • Didn’t work on exclusively NW story corpus CIIR, UMass Amherst

  4. Primary system – UMass1 • Utility of named entities acknowledged • Failure analysis indicates • Large number of old stories have low confidence score (false alarms) • Conflict with new story scores • Reasons • Stories on multiple topics • Diffuse topics • Varying document lengths CIIR, UMass Amherst

  5. Primary system – UMass1 • Focus • Identify old stories better – affects cost • Clue • Most old stories get low confidence scores as topics linked by • only named entities (large number) • only non-named entities (few) CIIR, UMass Amherst

  6. Primary system – UMass1 • Approach • Look at the set of closest matching stories • If consistently high named entity or non-named entity match modify confidence score CIIR, UMass Amherst

  7. Primary system – UMass1 • Procedure • Double original confidence score if less than a threshold • Gradually reduce score towards original score if set of closest stories match neither named entities nor non-named entities CIIR, UMass Amherst

  8. UMass1 – Examples from TDT3 • Russian Financial Crisis - Old Story CIIR, UMass Amherst

  9. UMass1 – Examples from TDT3 • Russian Financial Crisis - Old Story   CIIR, UMass Amherst

  10. UMass1 – Examples from TDT3 Threshold = 0.1 • Russian Financial Crisis - Old Story CIIR, UMass Amherst

  11. UMass1 – Examples from TDT3 Threshold = 0.1 • Russian Financial Crisis - Old Story CIIR, UMass Amherst

  12. UMass1 – Examples from TDT3 Threshold = 0.1 • Russian Financial Crisis - Old Story CIIR, UMass Amherst

  13. UMass1 – Examples from TDT3 • Thai Airbus Crash   - New Story CIIR, UMass Amherst

  14. UMass1 on TDT3 CIIR, UMass Amherst

  15. UMass1 on TDT3 CIIR, UMass Amherst

  16. UMass2 • Basic vector space model system • Compare with all preceding stories • Return highest cosine match CIIR, UMass Amherst

  17. UMass3 • Same model as UMass2 • TDT5 – Very large collection • Practical system • Compare with a maximum of 25000 stories with highest coordination match • Faster CIIR, UMass Amherst

  18. UMass4 • Similar to UMass1 • Rationale is the same • Consider top five matches • Use different formula for modifying confidence score CIIR, UMass Amherst

  19. Performance Summary CIIR, UMass Amherst

  20. Summary • Basic vector space model did the best • Restricting number of stories to be compared with • Improved system speed • Didn’t improve performance • Primary system did extremely well on training data, but failed on TDT5 CIIR, UMass Amherst

More Related