1 / 17

Text Classification and Named Entities for New Event Detection

Text Classification and Named Entities for New Event Detection. Giridhar Kumaran and James Allan University of Massachusetts Amherst SIGIR 2004. Introduction. New Event Detection (NED) is one of the task in TDT program. (http://www.nist.gov/speech/tests/tdt/index.htm)

Download Presentation

Text Classification and Named Entities for New Event Detection

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Text Classification and Named Entities for New Event Detection Giridhar Kumaran and James Allan University of Massachusetts Amherst SIGIR 2004

  2. Introduction • New Event Detection (NED) is one of the task in TDT program. (http://www.nist.gov/speech/tests/tdt/index.htm) • Vector space model has achieved the best results to date. • Better similarity metrics and document representations.

  3. Previous Research • Increasing the number of features. • Weight event-level features more heavily than more general topic-level features. • Lexical chains (using WordNet) • NED and tracking system. • Named entities re-weighted and stop list created for each topic. • Incremental TF-IDF

  4. NED Evaluation • Assign a confidence score between 0-1 by the NED algorithm, immediately or look-ahead. • 0  new, 1  old • Define threshold results in the least cost. • Detection Error Tradeoff (DET) curve is used to represent miss and false alarm.

  5. Basic Model • Cosine similarity

  6. Modified Model • Cosine is good, but make mistakes. • The level of a hierarchy of events is of interest. • Looking into other parameters like the category, the overlap of named entities, and the overlap of non-named entities. • Develop a simple rules reflect the questions that a human being would ask before deciding if a story is new or old.

  7. Modification to document model • Terms: health care – drugs, cost, coverage, plan, prescription..vs. locations and individuals. • Solve: First placing stories into broad categories, and then computing term weights. • Using topic types specified by the LDC. • Classification according to LDC topics. • Train in TDT2, test in TDT3.

  8. Modification to Similarity Metric • Isolate the named entities and treat them preferentially (nothing new). • Named entities are a double-edged sword, deciding when to use them can be tricky.

  9. Multiple document representations • Alpha : all terms • Beta : only named entities • Gamma : non-named entity terms • Event, GPE(Geographical and Political Entities ), Language, Location, Nationality, Organization, Person, Cardinal, Ordinal, Date, and Time.

  10. Election News • Gamma is less than 0.2, while beta spreads out. (2 Graphs) : using alpha + gamma

  11. Legal/Criminal Cases • Gamma below 0.4, beta above 0.4 : use beta + alpha

  12. Financial News • Cannot decide using beta or gamma: use alpha only.

  13. Term scores and categories • (Table 4)

  14. Experimental Results • The result seems to be worse in TDT4. • TDT4 may contain topics not conductive to named entity-based modification.

  15. DET Curve of TDT3 • Focus on the high accuracy area.

  16. DET Curve of TDT4

  17. Conclusion and Future Work • Present a new multi-stage system for NED. • Show a way to harness the named entities in documents, and illustrate their utility in different situations. • Improve named entity rules • Different ways to develop stop lists for different categories • Temporal information

More Related