Text Classification and Named Entities for New Event Detection

Text Classification and Named Entities for New Event Detection Giridhar Kumaran and James Allan University of Massachusetts Amherst SIGIR 2004

Introduction • New Event Detection (NED) is one of the task in TDT program. (http://www.nist.gov/speech/tests/tdt/index.htm) • Vector space model has achieved the best results to date. • Better similarity metrics and document representations.

Previous Research • Increasing the number of features. • Weight event-level features more heavily than more general topic-level features. • Lexical chains (using WordNet) • NED and tracking system. • Named entities re-weighted and stop list created for each topic. • Incremental TF-IDF

NED Evaluation • Assign a confidence score between 0-1 by the NED algorithm, immediately or look-ahead. • 0  new, 1  old • Define threshold results in the least cost. • Detection Error Tradeoff (DET) curve is used to represent miss and false alarm.

Basic Model • Cosine similarity

Modified Model • Cosine is good, but make mistakes. • The level of a hierarchy of events is of interest. • Looking into other parameters like the category, the overlap of named entities, and the overlap of non-named entities. • Develop a simple rules reflect the questions that a human being would ask before deciding if a story is new or old.

Modification to document model • Terms: health care – drugs, cost, coverage, plan, prescription..vs. locations and individuals. • Solve: First placing stories into broad categories, and then computing term weights. • Using topic types specified by the LDC. • Classification according to LDC topics. • Train in TDT2, test in TDT3.

Modification to Similarity Metric • Isolate the named entities and treat them preferentially (nothing new). • Named entities are a double-edged sword, deciding when to use them can be tricky.

Multiple document representations • Alpha : all terms • Beta : only named entities • Gamma : non-named entity terms • Event, GPE(Geographical and Political Entities ), Language, Location, Nationality, Organization, Person, Cardinal, Ordinal, Date, and Time.

Election News • Gamma is less than 0.2, while beta spreads out. (2 Graphs) : using alpha + gamma

Legal/Criminal Cases • Gamma below 0.4, beta above 0.4 : use beta + alpha

Financial News • Cannot decide using beta or gamma: use alpha only.

Term scores and categories • (Table 4)

Experimental Results • The result seems to be worse in TDT4. • TDT4 may contain topics not conductive to named entity-based modification.

DET Curve of TDT3 • Focus on the high accuracy area.

DET Curve of TDT4

Conclusion and Future Work • Present a new multi-stage system for NED. • Show a way to harness the named entities in documents, and illustrate their utility in different situations. • Improve named entity rules • Different ways to develop stop lists for different categories • Temporal information

Text Classification and Named Entities for New Event Detection