Overview of the TDT 2001 Evaluation and Results - PowerPoint PPT Presentation

Overview of the tdt 2001 evaluation and results l.jpg
Download
1 / 28

Overview of the TDT 2001 Evaluation and Results. Jonathan Fiscus Gaithersburg Holiday Inn Gaithersburg, Maryland November 12-13, 2001. Outline. TDT Evaluation Overview 2001 TDT Evaluation Result Summaries First Story Detection (FSD) Topic Detection Topic Tracking Link Detection

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

Download Presentation

Overview of the TDT 2001 Evaluation and Results

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Overview of the tdt 2001 evaluation and results l.jpg

Overview of the TDT 2001 Evaluation and Results

Jonathan Fiscus

Gaithersburg Holiday Inn

Gaithersburg, Maryland

November 12-13, 2001


Outline l.jpg

Outline

  • TDT Evaluation Overview

  • 2001 TDT Evaluation Result Summaries

    • First Story Detection (FSD)

    • Topic Detection

    • Topic Tracking

    • Link Detection

  • Other Investigations

www.nist.gov/TDT


Tdt 101 l.jpg

TDT 101

“Applications for organizing text”

Terabytes of Unorganized data

  • 5 TDT Applications

    • Story Segmentation

    • Topic Tracking

    • Topic Detection

    • First Story Detection

    • Link Detection

www.nist.gov/TDT


Tdt s research domain l.jpg

TDT’s Research Domain

  • Technology challenge

    • Develop applications that organize and locate relevant stories from a continuous feed of news stories

  • Research driven by evaluation tasks

  • Composite applications built from

    • Automatic Speech Recognition

    • Story Segmentation

    • Document Retrieval

www.nist.gov/TDT


Definitions l.jpg

Definitions

A topicis …

a seminal event or activity, along with alldirectly related events and activities.

A storyis …

a topically cohesive segment of news that includes two or more DECLARATIVE independent clauses about a single event.

www.nist.gov/TDT


Example topic l.jpg

Example Topic

Title: Mountain Hikers Lost

  • WHAT: 35 or 40 young mountain hikers were lost in an avalanche in France around the 20th of January.

  • WHERE: Orres, France

  • WHEN: January 1998

  • RULES OF INTERPRETATION:

    • Rule #5. Accidents

www.nist.gov/TDT


Tdt 2001 evaluation corpus l.jpg

TDT 2001 Evaluation Corpus

  • TDT3 + Supplemental Corpora used for the evaluation*†

    • TDT3 Corpus

      • Third consecutive use for evaluations

      • XXX stories, 4th Qtr. 1998

      • Used for Tracking and Link Detection development test

    • Supplement of 35K stories added to TDT3

      • No annotations

      • Data added from both 3rd and 4th Qtr. 1998

      • Not used for FSD tests

  • LDC Annotations †

    • 120 fully annotated topics: divided into published and withheld sets

    • 120 partially annotated topics

    • FSD used all 240 topics

    • Topic Detection used the 120 fully annotated topics

    • Tracking and Link Detection used the 60 fully annotated withheld topics

* see www.nist.gov/speech/tests/tdt/tdt2001 for details

† see www.ldc.upenn.edu/Projects/TDT3/ for details

www.nist.gov/TDT


Tdt3 topic division l.jpg

TDT3 Topic Division

TDT 2000 Systems

  • Two topic sets:

    • Published topics

    • Withheld topics

  • Selection criteria:

    • 60 topics per set

      • 30 of the 1999 topics

      • 30 of the 2000 topics

    • Balanced by number of on-topic stories

www.nist.gov/TDT


Tdt evaluation methodology l.jpg

TDT Evaluation Methodology

  • Evaluation tasks are cast as detection tasks:

    • YES there is a target, or NO there is not

  • Performance is measured in terms of detection cost:

    “a weighted sum of missed detection and false alarm probabilities”CDet = CMiss • PMiss • Ptarget + CFA • PFA • (1- Ptarget)

    • CMiss = 1 and CFA=0.1 are preset costs

    • Ptarget = 0.02 is the a priori probability of a target

  • Detection Cost is normalized to generally lie between 0 and 1:(CDet)Norm = CDet/ min{CMiss • Ptarget, CFA • (1- Ptarget)}

  • When based on the YES/NO decisions, it is referred to as the actual decision cost

  • Detection Error Tradeoff (DET) curves graphically depict the performance tradeoff between PMiss and PFA

    • Makes use of likelihood scores attached to the YES|NO decisions

    • Minimum DET point is the best score a system could achieve with proper thresholds

  • www.nist.gov/TDT


    Tdt experimental control l.jpg

    TDT: Experimental Control

    • Good research requires experimental controls

    • Conditions that affect performance in TDT

      • Newswire vs. Broadcast News

      • Manual vs. automatic transcription of Broadcast News

      • Manual vs. automatic story segmentation

      • Mono vs. multilingual language material

      • Topic training amounts and languages

      • Default automatic English translations of Mandarin vs. native Mandarin orthography

      • Decision deferral periods

    www.nist.gov/TDT


    Outline11 l.jpg

    Outline

    • TDT Evaluation Overview

    • 2001 TDT Evaluation Result Summaries

      • First Story Detection (FSD)

      • Topic Detection

      • Topic Tracking

      • Link Detection

    • Other Investigations

    www.nist.gov/TDT


    First story detection results l.jpg

    First Stories on two topics

    = Topic 1

    = Topic 2

    Not First Stories

    First Story Detection Results

    System Goal:

    • To detect the first story that discusses each topic

      • Evaluating “part” of a Topic Detection system, i.e., when to start a new cluster

    www.nist.gov/TDT


    Slide13 l.jpg

    2001 TDT Primary FSD ResultsNewswire+BNews ASR, English texts,automatic story boundaries, 10 File Deferral

    www.nist.gov/TDT


    Slide14 l.jpg

    TDT Topic Detection Task

    System Goal:

    • To detect topics in terms of the (clusters of) storiesthat discuss them.

      • “Unsupervised” topic training

      • New topics must be detected as the incoming stories are processed.

      • Input stories are then associated with one of the topics.

    Topic 1

    Story Stream

    Topic 2


    Primary topic detection sys newswire bnasr multilingual auto boundaries deferral 10 l.jpg

    Primary Topic Detection Sys. Newswire+Bnasr, Multilingual, Auto Boundaries, Deferral=10

    Mandarin Native

    TranslatedMandarin

    www.nist.gov/TDT


    Topic tracking task l.jpg

    training data

    on-topic

    unknown

    unknown

    test data

    Topic Tracking Task

    System Goal:

    • To detect stories that discuss the target topic,in multiple source streams.

      • Supervised Training

        • Given Nt sample stories that discuss a given target topic

      • Testing

        • Find all subsequent stories that discuss the target topic

    www.nist.gov/TDT


    Primary tracking results newswire bnman english training 1 positive 0 negative l.jpg

    Primary Tracking ResultsNewswire+BNman, English Training:1 Positive-0 Negative

    www.nist.gov/TDT


    Slide18 l.jpg

    TDT Link Detection Task

    System Goal:

    • To detect whether a pair of stories discuss the same topic.

      (Can be thought of as a “primitive operator” to build a variety of applications)

    ?

    www.nist.gov/TDT


    Primary link det results newswire bnasr deferral 10 l.jpg

    Primary Link Det. ResultsNewswire+BNasr, Deferral=10

    NTU’s threshholding is unusual

    Native Mandarin

    Mandarin Native

    TranslatedMandarin

    www.nist.gov/TDT


    Outline20 l.jpg

    Outline

    • TDT Evaluation Overview

    • 2001 TDT Evaluation Result Summaries

      • First Story Detection (FSD)

      • Topic Detection

      • Topic Tracking

      • Link Detection

    • Other Investigations

    www.nist.gov/TDT


    Primary topic detection sys newswire bnasr multilingual auto boundaries deferral 1021 l.jpg

    Primary Topic Detection Sys. Newswire+Bnasr, Multilingual, Auto Boundaries, Deferral=10

    www.nist.gov/TDT


    Topic detection false alarm visualization l.jpg

    Topic Detection:False Alarm Visualization

    UMass1

    • Systems behave very differently

    • IMHO a user would not like to use a high FA rate system

    • Perhaps False alarms should get more weight in the cost function

    • Outer Circle: Number of stories in a cluster

      • Light => cluster was mapped to a reference topic

      • Blue => unmapped cluster

    • Inner Circle: Number of on-topic stories

    Topic ID

    TNO1-late

    System clusters, ordered by size

    Topic ID

    `

    System clusters, ordered by size


    Slide23 l.jpg

    Topic Detection:2000 vs. 2001 Index FilesMultilingual Text, Newswire + Broadcast News,Auto Boundaries, Deferral =10

    • The 2000 test corpus covered 3 months

    • The 2001 corpus covered 6 months

      • 35K more stories

    • Might affect performance, BUT appears not to.

    www.nist.gov/TDT


    Topic detection evaluation via a link style metric l.jpg

    Topic Detection Evaluation via a Link-Style Metric

    • Motivation:

      • There is instability of measured performance during system tuning

      • Likely to be a direct result of the need to map reference topic clusters to system-defined clusters

      • We would like to avoid the assumption of independent topics

    www.nist.gov/TDT


    Topic detection evaluation via a link style metric25 l.jpg

    Topic Detection Evaluation via a Link-Style Metric

    • Evaluation Criterion: “Is this pair of stories discuss the same topic?”

      • If a story pair is on the same topic

        • A missed detection is declared if the system put the stories in different clusters

        • Otherwise, it’s a correct detection

      • If a pair of stories in not on the same topic

        • A false alarm is declared if the system put the stories in the same cluster

        • Otherwise, it’s a correct non-detection

    www.nist.gov/TDT


    Link based vs topic detection metrics parameter optimization sweep l.jpg

    Link-Based vs. Topic Detection Metrics: Parameter Optimization Sweep

    System 1:

    62K Test Stories

    98 Topics

    • The link curve is less erratic for System1

    • Link curve is higher: What does this mean?

    System 2:

    27K Test Stories

    31 Topics

    www.nist.gov/TDT


    What can be learned l.jpg

    What can be learned?

    • Are all the experimental controls necessary?

      • Tracking performance degrades 50% going from manual to automatic transcription, and an additional 50% going to automatic boundaries

      • Cross-language issues still not solved

      • Most systems used only the required deferral period

    • Progress was modest: did the lack of a new evaluation corpus impede research?

    www.nist.gov/TDT


    Summary l.jpg

    Summary

    • TDT Evaluation Overview

    • 2001 TDT Evaluation Results

    • Evaluating Topic Detection with the Link-based metric is feasible, but inconclusive

    • The TDT3 corpus annotations are now public!

    www.nist.gov/TDT


  • Login