CMU TDT Report TIDES PI Meeting 2002

CMU TDT Report TIDES PI Meeting 2002 The CMU TDT Team: Jaime Carbonell, Yiming Yang, Ralf Brown, Jian Zhang, Nianli Ma, Chun Jin Language Technologies Institute, CMU

Time Line for TDT Activities • ReStarted TDT: Summer 2001 • Tasks: FSD, SLD, Detection • New Techniques: Nov 2001 – Present • Topic-conditional Novelty (FSD) • Situated NE’s (all tasks) • Source-conditional interpolated training (SLD) • Evaluations • TDT: Oct 2001, July 2002 • New FSD (internal): July 2002 (KDD Conference)

2002 Dry Run Results: DET [1] Using our Mandarin to English EBMT, and replace our boundary with systran’s boundary. [2] Using our Dictionary-Based Arabic to English translation, and with our own boundaries. So the boundaries of evaluation and our results are mismatching. [3] Using our Dictionary-Based Arabic to English translation, and replace our boundary with systran’s boundary.

Baseline FSD Method • (Unconditional) Dissimilarity with Past • Decision threshold on most-similar story • (Linear) temporal decay • Length-filter (for teasers) • Cosine similarity with standard weights:

2002 Dry Run Results: FSD

2002 Dry Run DET: CMU-FSD

FSD Observations • Cross-site comparable baselines (cost =.7) • “Events-vs-Topics” issue (e.g. Asia crisis) • A few mislabled stories wreak havoc for FSD • Eager auto-segmentation a problem (misses) • Recommendations for TDT labeling • FSD on true events, or events within topic(s) • Change auto-segmentation optimality criterion ?? • Recommendations for TDT reserachers • Keep working hard on FSD – not cracked yet

New FSD Directions • Topic-conditional models • E.g. “airplane,” “investigation,” “FAA,” “FBI,” “casualties,”  topic, not event • “TWA 800,” “March 12, 1997”  event • First categorize into topic, then use maximally-discriminative terms within topic • Rely on situated named entities • E.g. “Arcan as victim,” “Sharon as peacemaker”

Broad Topics vs Events

Two-level Scheme for FSD

Confusability between Intra-topic Events • AIRPLANE ACCIDENTS BOMBINGS • Each data point in the matrix is the similarity between the two corresponding documents. • Documents are sorted by event as the first key and by the time of arrival as second key, so the diagonal sub-matrices are intra-event document similarities, while the off-diagonal sub-matrices are inter-event document similarities.

Measuring Effectiveness of NEs [1] f means a Named Entity; Sk the Kth type of Named Entities among seven types of NEs. [2] We use the effectiveness of each type of NEs to measure how well they can differentiate intra-topic events.

Effectiveness of Named Entities

Experimental Design • Baseline: conventional FSD • Simple case: two-level FSD with “perfect” topic labels • Ideal case: two-level FSD with “perfect” topic labels, weighted NE and removing topic-specific stop words • Real case: the same as Ideal Case except using system-predicted topic labels

Data Description • Broadcast News: published by Primary Source Media, • 261,209 transcripts for news articles from ABC, CNN, NPR and MSNBC in the period from 1992 to 1998. • Document Structure: each document (story) is composed of several fields, such as Title, Topic, Keywords, Date, Abstract and Body. • (Training) topic labels provided by PSM (4 topics) • Airplane accidents, bombings, tornados, hijackings • CMU students labeled 36 events within 4 topics (divided into 50% training and 50% test)

Results for Topic-Conditioned FSD

Confusability Reduction (5 events within topic: airplane accident in test data) NOTE: • These graphs only contains test data (5 events for topic “airplane accidents”) • The left graph is the Baseline, and the right one is the Ideal Case.

Topic-Conditioned Approach to First Story Detection for TDT

CMU TDT Report TIDES PI Meeting 2002

CMU TDT Report TIDES PI Meeting 2002

Presentation Transcript

tdt 2002 straw man

Integrity Through Mediated Interfaces PI Meeting August 19, 2002

Enterprise Wrappers OASIS PI Meeting March 12, 2002

Enterprise Wrappers OASIS PI Meeting August 19, 2002

Integrity Through Mediated Interfaces PI Meeting March 12, 2002

PCES PI Meeting 3-5 April 2002

PI Report

CMU TEAM-A in TDT 2004 Topic Tracking

CMU Richard Griffiths - PI GSFC Robert Petre – Deputy PI Keith Jahoda Richard Mushotzky

CMU

PI: Badri Nath SensIT PI Meeting January 15,16,17 2002 cs.rutgers/dataman/webdust

Kappa Pi Meeting

DARPA Oasis PI Meeting

DARPA TIDES MT Group Meeting

CMU at TDT 2004 — Novelty Detection

CMU TDT Report 12-13 November 2001

Phase II PI Meeting

802.11F Meeting Report April 2002

802.11F Meeting Report March 2002

802.11F Meeting Report July 2002

TDT 4242

Annual INBRE PI Meeting _______________________________________________