Overview of the tdt 2001 evaluation and results
Download
1 / 28

overview of the tdt 2001 evaluation and results - PowerPoint PPT Presentation


  • 579 Views
  • Uploaded on

Overview of the TDT 2001 Evaluation and Results. Jonathan Fiscus Gaithersburg Holiday Inn Gaithersburg, Maryland November 12-13, 2001. Outline. TDT Evaluation Overview 2001 TDT Evaluation Result Summaries First Story Detection (FSD) Topic Detection Topic Tracking Link Detection

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'overview of the tdt 2001 evaluation and results' - salena


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Overview of the tdt 2001 evaluation and results l.jpg

Overview of the TDT 2001 Evaluation and Results

Jonathan Fiscus

Gaithersburg Holiday Inn

Gaithersburg, Maryland

November 12-13, 2001


Outline l.jpg
Outline

  • TDT Evaluation Overview

  • 2001 TDT Evaluation Result Summaries

    • First Story Detection (FSD)

    • Topic Detection

    • Topic Tracking

    • Link Detection

  • Other Investigations

www.nist.gov/TDT


Tdt 101 l.jpg
TDT 101

“Applications for organizing text”

Terabytes of Unorganized data

  • 5 TDT Applications

    • Story Segmentation

    • Topic Tracking

    • Topic Detection

    • First Story Detection

    • Link Detection

www.nist.gov/TDT


Tdt s research domain l.jpg
TDT’s Research Domain

  • Technology challenge

    • Develop applications that organize and locate relevant stories from a continuous feed of news stories

  • Research driven by evaluation tasks

  • Composite applications built from

    • Automatic Speech Recognition

    • Story Segmentation

    • Document Retrieval

www.nist.gov/TDT


Definitions l.jpg
Definitions

A topicis …

a seminal event or activity, along with alldirectly related events and activities.

A storyis …

a topically cohesive segment of news that includes two or more DECLARATIVE independent clauses about a single event.

www.nist.gov/TDT


Example topic l.jpg
Example Topic

Title: Mountain Hikers Lost

  • WHAT: 35 or 40 young mountain hikers were lost in an avalanche in France around the 20th of January.

  • WHERE: Orres, France

  • WHEN: January 1998

  • RULES OF INTERPRETATION:

    • Rule #5. Accidents

www.nist.gov/TDT


Tdt 2001 evaluation corpus l.jpg
TDT 2001 Evaluation Corpus

  • TDT3 + Supplemental Corpora used for the evaluation*†

    • TDT3 Corpus

      • Third consecutive use for evaluations

      • XXX stories, 4th Qtr. 1998

      • Used for Tracking and Link Detection development test

    • Supplement of 35K stories added to TDT3

      • No annotations

      • Data added from both 3rd and 4th Qtr. 1998

      • Not used for FSD tests

  • LDC Annotations †

    • 120 fully annotated topics: divided into published and withheld sets

    • 120 partially annotated topics

    • FSD used all 240 topics

    • Topic Detection used the 120 fully annotated topics

    • Tracking and Link Detection used the 60 fully annotated withheld topics

* see www.nist.gov/speech/tests/tdt/tdt2001 for details

† see www.ldc.upenn.edu/Projects/TDT3/ for details

www.nist.gov/TDT


Tdt3 topic division l.jpg
TDT3 Topic Division

TDT 2000 Systems

  • Two topic sets:

    • Published topics

    • Withheld topics

  • Selection criteria:

    • 60 topics per set

      • 30 of the 1999 topics

      • 30 of the 2000 topics

    • Balanced by number of on-topic stories

www.nist.gov/TDT


Tdt evaluation methodology l.jpg
TDT Evaluation Methodology

  • Evaluation tasks are cast as detection tasks:

    • YES there is a target, or NO there is not

  • Performance is measured in terms of detection cost:

    “a weighted sum of missed detection and false alarm probabilities”CDet = CMiss • PMiss • Ptarget + CFA • PFA • (1- Ptarget)

    • CMiss = 1 and CFA=0.1 are preset costs

    • Ptarget = 0.02 is the a priori probability of a target

  • Detection Cost is normalized to generally lie between 0 and 1:(CDet)Norm = CDet/ min{CMiss • Ptarget, CFA • (1- Ptarget)}

  • When based on the YES/NO decisions, it is referred to as the actual decision cost

  • Detection Error Tradeoff (DET) curves graphically depict the performance tradeoff between PMiss and PFA

    • Makes use of likelihood scores attached to the YES|NO decisions

    • Minimum DET point is the best score a system could achieve with proper thresholds

  • www.nist.gov/TDT


    Tdt experimental control l.jpg
    TDT: Experimental Control

    • Good research requires experimental controls

    • Conditions that affect performance in TDT

      • Newswire vs. Broadcast News

      • Manual vs. automatic transcription of Broadcast News

      • Manual vs. automatic story segmentation

      • Mono vs. multilingual language material

      • Topic training amounts and languages

      • Default automatic English translations of Mandarin vs. native Mandarin orthography

      • Decision deferral periods

    www.nist.gov/TDT


    Outline11 l.jpg
    Outline

    • TDT Evaluation Overview

    • 2001 TDT Evaluation Result Summaries

      • First Story Detection (FSD)

      • Topic Detection

      • Topic Tracking

      • Link Detection

    • Other Investigations

    www.nist.gov/TDT


    First story detection results l.jpg

    First Stories on two topics

    = Topic 1

    = Topic 2

    Not First Stories

    First Story Detection Results

    System Goal:

    • To detect the first story that discusses each topic

      • Evaluating “part” of a Topic Detection system, i.e., when to start a new cluster

    www.nist.gov/TDT


    Slide13 l.jpg
    2001 TDT Primary FSD ResultsNewswire+BNews ASR, English texts,automatic story boundaries, 10 File Deferral

    www.nist.gov/TDT


    Slide14 l.jpg

    TDT Topic Detection Task

    System Goal:

    • To detect topics in terms of the (clusters of) storiesthat discuss them.

      • “Unsupervised” topic training

      • New topics must be detected as the incoming stories are processed.

      • Input stories are then associated with one of the topics.

    Topic 1

    Story Stream

    Topic 2


    Primary topic detection sys newswire bnasr multilingual auto boundaries deferral 10 l.jpg
    Primary Topic Detection Sys. Newswire+Bnasr, Multilingual, Auto Boundaries, Deferral=10

    Mandarin Native

    Translated Mandarin

    www.nist.gov/TDT


    Topic tracking task l.jpg

    training data

    on-topic

    unknown

    unknown

    test data

    Topic Tracking Task

    System Goal:

    • To detect stories that discuss the target topic,in multiple source streams.

      • Supervised Training

        • Given Nt sample stories that discuss a given target topic

      • Testing

        • Find all subsequent stories that discuss the target topic

    www.nist.gov/TDT


    Primary tracking results newswire bnman english training 1 positive 0 negative l.jpg
    Primary Tracking ResultsNewswire+BNman, English Training:1 Positive-0 Negative

    www.nist.gov/TDT


    Slide18 l.jpg

    TDT Link Detection Task

    System Goal:

    • To detect whether a pair of stories discuss the same topic.

      (Can be thought of as a “primitive operator” to build a variety of applications)

    ?

    www.nist.gov/TDT


    Primary link det results newswire bnasr deferral 10 l.jpg
    Primary Link Det. ResultsNewswire+BNasr, Deferral=10

    NTU’s threshholding is unusual

    Native Mandarin

    Mandarin Native

    Translated Mandarin

    www.nist.gov/TDT


    Outline20 l.jpg
    Outline

    • TDT Evaluation Overview

    • 2001 TDT Evaluation Result Summaries

      • First Story Detection (FSD)

      • Topic Detection

      • Topic Tracking

      • Link Detection

    • Other Investigations

    www.nist.gov/TDT


    Primary topic detection sys newswire bnasr multilingual auto boundaries deferral 1021 l.jpg
    Primary Topic Detection Sys. Newswire+Bnasr, Multilingual, Auto Boundaries, Deferral=10

    www.nist.gov/TDT


    Topic detection false alarm visualization l.jpg
    Topic Detection:False Alarm Visualization

    UMass1

    • Systems behave very differently

    • IMHO a user would not like to use a high FA rate system

    • Perhaps False alarms should get more weight in the cost function

    • Outer Circle: Number of stories in a cluster

      • Light => cluster was mapped to a reference topic

      • Blue => unmapped cluster

    • Inner Circle: Number of on-topic stories

    Topic ID

    TNO1-late

    System clusters, ordered by size

    Topic ID

    `

    System clusters, ordered by size


    Slide23 l.jpg
    Topic Detection:2000 vs. 2001 Index FilesMultilingual Text, Newswire + Broadcast News,Auto Boundaries, Deferral =10

    • The 2000 test corpus covered 3 months

    • The 2001 corpus covered 6 months

      • 35K more stories

    • Might affect performance, BUT appears not to.

    www.nist.gov/TDT


    Topic detection evaluation via a link style metric l.jpg
    Topic Detection Evaluation via a Link-Style Metric

    • Motivation:

      • There is instability of measured performance during system tuning

      • Likely to be a direct result of the need to map reference topic clusters to system-defined clusters

      • We would like to avoid the assumption of independent topics

    www.nist.gov/TDT


    Topic detection evaluation via a link style metric25 l.jpg
    Topic Detection Evaluation via a Link-Style Metric

    • Evaluation Criterion: “Is this pair of stories discuss the same topic?”

      • If a story pair is on the same topic

        • A missed detection is declared if the system put the stories in different clusters

        • Otherwise, it’s a correct detection

      • If a pair of stories in not on the same topic

        • A false alarm is declared if the system put the stories in the same cluster

        • Otherwise, it’s a correct non-detection

    www.nist.gov/TDT


    Link based vs topic detection metrics parameter optimization sweep l.jpg
    Link-Based vs. Topic Detection Metrics: Parameter Optimization Sweep

    System 1:

    62K Test Stories

    98 Topics

    • The link curve is less erratic for System1

    • Link curve is higher: What does this mean?

    System 2:

    27K Test Stories

    31 Topics

    www.nist.gov/TDT


    What can be learned l.jpg
    What can be learned?

    • Are all the experimental controls necessary?

      • Tracking performance degrades 50% going from manual to automatic transcription, and an additional 50% going to automatic boundaries

      • Cross-language issues still not solved

      • Most systems used only the required deferral period

    • Progress was modest: did the lack of a new evaluation corpus impede research?

    www.nist.gov/TDT


    Summary l.jpg
    Summary

    • TDT Evaluation Overview

    • 2001 TDT Evaluation Results

    • Evaluating Topic Detection with the Link-based metric is feasible, but inconclusive

    • The TDT3 corpus annotations are now public!

    www.nist.gov/TDT


    ad