overview of the tdt 2001 evaluation and results
Download
Skip this Video
Download Presentation
Overview of the TDT 2001 Evaluation and Results

Loading in 2 Seconds...

play fullscreen
1 / 28

overview of the tdt 2001 evaluation and results - PowerPoint PPT Presentation


  • 580 Views
  • Uploaded on

Overview of the TDT 2001 Evaluation and Results. Jonathan Fiscus Gaithersburg Holiday Inn Gaithersburg, Maryland November 12-13, 2001. Outline. TDT Evaluation Overview 2001 TDT Evaluation Result Summaries First Story Detection (FSD) Topic Detection Topic Tracking Link Detection

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'overview of the tdt 2001 evaluation and results' - salena


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
overview of the tdt 2001 evaluation and results

Overview of the TDT 2001 Evaluation and Results

Jonathan Fiscus

Gaithersburg Holiday Inn

Gaithersburg, Maryland

November 12-13, 2001

outline
Outline
  • TDT Evaluation Overview
  • 2001 TDT Evaluation Result Summaries
    • First Story Detection (FSD)
    • Topic Detection
    • Topic Tracking
    • Link Detection
  • Other Investigations

www.nist.gov/TDT

tdt 101
TDT 101

“Applications for organizing text”

Terabytes of Unorganized data

  • 5 TDT Applications
    • Story Segmentation
    • Topic Tracking
    • Topic Detection
    • First Story Detection
    • Link Detection

www.nist.gov/TDT

tdt s research domain
TDT’s Research Domain
  • Technology challenge
    • Develop applications that organize and locate relevant stories from a continuous feed of news stories
  • Research driven by evaluation tasks
  • Composite applications built from
    • Automatic Speech Recognition
    • Story Segmentation
    • Document Retrieval

www.nist.gov/TDT

definitions
Definitions

A topicis …

a seminal event or activity, along with alldirectly related events and activities.

A storyis …

a topically cohesive segment of news that includes two or more DECLARATIVE independent clauses about a single event.

www.nist.gov/TDT

example topic
Example Topic

Title: Mountain Hikers Lost

  • WHAT: 35 or 40 young mountain hikers were lost in an avalanche in France around the 20th of January.
  • WHERE: Orres, France
  • WHEN: January 1998
  • RULES OF INTERPRETATION:
    • Rule #5. Accidents

www.nist.gov/TDT

tdt 2001 evaluation corpus
TDT 2001 Evaluation Corpus
  • TDT3 + Supplemental Corpora used for the evaluation*†
    • TDT3 Corpus
      • Third consecutive use for evaluations
      • XXX stories, 4th Qtr. 1998
      • Used for Tracking and Link Detection development test
    • Supplement of 35K stories added to TDT3
      • No annotations
      • Data added from both 3rd and 4th Qtr. 1998
      • Not used for FSD tests
  • LDC Annotations †
    • 120 fully annotated topics: divided into published and withheld sets
    • 120 partially annotated topics
    • FSD used all 240 topics
    • Topic Detection used the 120 fully annotated topics
    • Tracking and Link Detection used the 60 fully annotated withheld topics

* see www.nist.gov/speech/tests/tdt/tdt2001 for details

† see www.ldc.upenn.edu/Projects/TDT3/ for details

www.nist.gov/TDT

tdt3 topic division
TDT3 Topic Division

TDT 2000 Systems

  • Two topic sets:
    • Published topics
    • Withheld topics
  • Selection criteria:
    • 60 topics per set
      • 30 of the 1999 topics
      • 30 of the 2000 topics
    • Balanced by number of on-topic stories

www.nist.gov/TDT

tdt evaluation methodology
TDT Evaluation Methodology
  • Evaluation tasks are cast as detection tasks:
    • YES there is a target, or NO there is not
  • Performance is measured in terms of detection cost:

“a weighted sum of missed detection and false alarm probabilities”CDet = CMiss • PMiss • Ptarget + CFA • PFA • (1- Ptarget)

      • CMiss = 1 and CFA=0.1 are preset costs
      • Ptarget = 0.02 is the a priori probability of a target
    • Detection Cost is normalized to generally lie between 0 and 1:(CDet)Norm = CDet/ min{CMiss • Ptarget, CFA • (1- Ptarget)}
    • When based on the YES/NO decisions, it is referred to as the actual decision cost
  • Detection Error Tradeoff (DET) curves graphically depict the performance tradeoff between PMiss and PFA
    • Makes use of likelihood scores attached to the YES|NO decisions
    • Minimum DET point is the best score a system could achieve with proper thresholds

www.nist.gov/TDT

tdt experimental control
TDT: Experimental Control
  • Good research requires experimental controls
  • Conditions that affect performance in TDT
    • Newswire vs. Broadcast News
    • Manual vs. automatic transcription of Broadcast News
    • Manual vs. automatic story segmentation
    • Mono vs. multilingual language material
    • Topic training amounts and languages
    • Default automatic English translations of Mandarin vs. native Mandarin orthography
    • Decision deferral periods

www.nist.gov/TDT

outline11
Outline
  • TDT Evaluation Overview
  • 2001 TDT Evaluation Result Summaries
    • First Story Detection (FSD)
    • Topic Detection
    • Topic Tracking
    • Link Detection
  • Other Investigations

www.nist.gov/TDT

first story detection results

First Stories on two topics

= Topic 1

= Topic 2

Not First Stories

First Story Detection Results

System Goal:

  • To detect the first story that discusses each topic
    • Evaluating “part” of a Topic Detection system, i.e., when to start a new cluster

www.nist.gov/TDT

slide13
2001 TDT Primary FSD ResultsNewswire+BNews ASR, English texts,automatic story boundaries, 10 File Deferral

www.nist.gov/TDT

slide14

TDT Topic Detection Task

System Goal:

  • To detect topics in terms of the (clusters of) storiesthat discuss them.
    • “Unsupervised” topic training
    • New topics must be detected as the incoming stories are processed.
    • Input stories are then associated with one of the topics.

Topic 1

Story Stream

Topic 2

primary topic detection sys newswire bnasr multilingual auto boundaries deferral 10
Primary Topic Detection Sys. Newswire+Bnasr, Multilingual, Auto Boundaries, Deferral=10

Mandarin Native

Translated Mandarin

www.nist.gov/TDT

topic tracking task

training data

on-topic

unknown

unknown

test data

Topic Tracking Task

System Goal:

  • To detect stories that discuss the target topic,in multiple source streams.
    • Supervised Training
      • Given Nt sample stories that discuss a given target topic
    • Testing
      • Find all subsequent stories that discuss the target topic

www.nist.gov/TDT

slide18

TDT Link Detection Task

System Goal:

  • To detect whether a pair of stories discuss the same topic.

(Can be thought of as a “primitive operator” to build a variety of applications)

?

www.nist.gov/TDT

primary link det results newswire bnasr deferral 10
Primary Link Det. ResultsNewswire+BNasr, Deferral=10

NTU’s threshholding is unusual

Native Mandarin

Mandarin Native

Translated Mandarin

www.nist.gov/TDT

outline20
Outline
  • TDT Evaluation Overview
  • 2001 TDT Evaluation Result Summaries
    • First Story Detection (FSD)
    • Topic Detection
    • Topic Tracking
    • Link Detection
  • Other Investigations

www.nist.gov/TDT

primary topic detection sys newswire bnasr multilingual auto boundaries deferral 1021
Primary Topic Detection Sys. Newswire+Bnasr, Multilingual, Auto Boundaries, Deferral=10

www.nist.gov/TDT

topic detection false alarm visualization
Topic Detection:False Alarm Visualization

UMass1

  • Systems behave very differently
  • IMHO a user would not like to use a high FA rate system
  • Perhaps False alarms should get more weight in the cost function
  • Outer Circle: Number of stories in a cluster
    • Light => cluster was mapped to a reference topic
    • Blue => unmapped cluster
  • Inner Circle: Number of on-topic stories

Topic ID

TNO1-late

System clusters, ordered by size

Topic ID

`

System clusters, ordered by size

slide23
Topic Detection:2000 vs. 2001 Index FilesMultilingual Text, Newswire + Broadcast News,Auto Boundaries, Deferral =10
  • The 2000 test corpus covered 3 months
  • The 2001 corpus covered 6 months
    • 35K more stories
  • Might affect performance, BUT appears not to.

www.nist.gov/TDT

topic detection evaluation via a link style metric
Topic Detection Evaluation via a Link-Style Metric
  • Motivation:
    • There is instability of measured performance during system tuning
    • Likely to be a direct result of the need to map reference topic clusters to system-defined clusters
    • We would like to avoid the assumption of independent topics

www.nist.gov/TDT

topic detection evaluation via a link style metric25
Topic Detection Evaluation via a Link-Style Metric
  • Evaluation Criterion: “Is this pair of stories discuss the same topic?”
    • If a story pair is on the same topic
      • A missed detection is declared if the system put the stories in different clusters
      • Otherwise, it’s a correct detection
    • If a pair of stories in not on the same topic
      • A false alarm is declared if the system put the stories in the same cluster
      • Otherwise, it’s a correct non-detection

www.nist.gov/TDT

link based vs topic detection metrics parameter optimization sweep
Link-Based vs. Topic Detection Metrics: Parameter Optimization Sweep

System 1:

62K Test Stories

98 Topics

  • The link curve is less erratic for System1
  • Link curve is higher: What does this mean?

System 2:

27K Test Stories

31 Topics

www.nist.gov/TDT

what can be learned
What can be learned?
  • Are all the experimental controls necessary?
    • Tracking performance degrades 50% going from manual to automatic transcription, and an additional 50% going to automatic boundaries
    • Cross-language issues still not solved
    • Most systems used only the required deferral period
  • Progress was modest: did the lack of a new evaluation corpus impede research?

www.nist.gov/TDT

summary
Summary
  • TDT Evaluation Overview
  • 2001 TDT Evaluation Results
  • Evaluating Topic Detection with the Link-based metric is feasible, but inconclusive
  • The TDT3 corpus annotations are now public!

www.nist.gov/TDT

ad