Spoken Term Detection Evaluation Overview

Spoken Term Detection Evaluation Overview http://www.nist.gov/speech/tests/std Jonathan Fiscus, Jérôme Ajot, George Doddington December 14-15, 2006 2006 Spoken Term Detection Workshop

Outline • Spoken Term Detection Task • STD definition of a “Search Term” • Expected STD System Architecture • Evaluation process • Evaluation corpus • Future activities

The Spoken Term Detection Task • System goal: to find all occurrences of a search term as fast as possible in heterogeneous audio sources • Applications • Audio mining – Indexing and searching huge data stores • Goals of the evaluation • Understand speed/accuracy tradeoffs • Understand technology solution tradeoffs: e.g., word vs. phone recognition • Understand technology issues for the three STD languages: Arabic, English, and Mandarin

The Typical Two Stage STD System Architecture System Input Files Audio Excerpt List Term List Poitners Many times One time STD System Detected Terms Indexer Searcher Metadata Store System Developer’s Domain The indexer never gets access to the terms The searcher never gets access to the audio

STD Definition of a Search Term • A search termis a sequence of one or more words represented orthographically • A narrow pragmatic definition because: • Limited corpora, time, and funding resources • Specific term attributes • Case is ignored • Homophones not distinguished • “wind” (air movement) <=> “wind” (to twist) • This would require phonetically annotated search terms • Orthographic variants of the same sense are distinguished • “catalogs” is NOT a target for the search term “catalog” • Meta-level application would perform term expansion

Outline • Spoken Term Detection Task • STD definition of a “Search Term” • Expected STD System Architecture • Evaluation process • STD system output • Defining reference occurrences • Assessing performance • Term occurrence alignment • STD Value • Detection Error Tradeoff Curves • Computation resource profiling • Evaluation corpus • Future enhancements

Evaluation ProcessSTD System Output • For each hypothesized term occurrence • The location in the audio stream of the term • A “Decision Score” indicating how likely the term exists with more positive values indicating more likely occurrences • An “Actual” decision as to whether the detected term is a term: • “YES” the putative term is likely true occurrence • “NO” it is not. • System operation measurements • Indexing time • Indexing memory consumption (high water mark) • Index size (on disk) • Search time for each term • Searching memory consumption (high water mark) • Computing platform benchmarks

Evaluation ProcessDefining Reference Occurrences • Reference occurrences found by searching human-generated transcripts • Use Speech-To-Text evaluation transcriptions • Each word has been time marked • Forced alignments for English transcripts • STT-system mediated time alignment for Mandarin and Arabic • Rules for locating reference occurrences • The occurrence must be a contiguous sequence of LEXEMEs. • Terms can not contain Disfluencies • The utterance “fire m- maker” is not an occurrence of the term “fire maker” • Terms can’t cross speaker and channel boundaries • The gap between adjacent words in a term must be <= 0.5 second • 0.5 sec. captures all but 0.04% of reference occurrences with 1.0 sec. • NON-LEX events are treated as silence

Evaluation Process:Assessing Performance • Challenges: STD isn’t a typical detection task • STD systems operate on continuous audio so there is no analog to a discrete “trial” • A “trial” is an answer to the question “Is this exemplar an example of object X” • System and reference occurrences need to be ALIGNED • Two methods of assessing performance • Value function • An application model that assigns value to correct output and negative value for incorrect output • A weighted linear combination of Missed Detection and False Alarm Probabilities • Detection Error Tradeoff (DET) curve • A graphical representation of the tradeoff between Missed Detections and False Alarms

Evaluation Process:Reference/System Term Alignments • Solved using the Hungarian Algorithm for solving bipartite graphs • Each term occurrence is a node in the graph • A minimum one-to-one mapping is found • Optimization function is: • Linear combination of time congruence and detection score • Infinite if the gap between reference and system occurrences is greater than 0.5 seconds System Terms Reference Terms

Error Counting Illustration Aligned occurrence Counting Errors For Actual Decisions Corr Miss FA Miss “NO” decisions ignored 10 seconds Reference Occ. System Occ. (with Actual Dec.) NO 0.5 YES 0.7 YES 0.6 NO 0.4

Evaluation Process:Estimates for Error Rates • Given the alignments, compute the following for each term: • Probability of Missed Detections • PMiss(term,θ) = 1 – Ncorrect(term,θ)/NTrue(term,θ) • not defined when NTrue(term,θ) = 0 • Probability of False Alarms • PFA(term,θ) = NSpurious(term,θ)/NNT(term,θ) • NT is the number of Non-Target trials • But this is NOT countable – Instead we used an Estimate • NNT(term,θ) = TrialsPerSecond * SpeechDuration - NTrue(term,θ) • TrialsPerSecond arbitrarily set to 1 for all languages • In hind sight, this is a unsatisfactory solution • For the next STD eval, we’d like to use a rate to measure False Alarms (e.g., False Alarms per hour). • Θ is a decision criteria to differentiate likely vs. unlikely putative term occurrences

Evaluation ProcessSTD Value Function • Value : An application model that assigns value to correct output and negative value for incorrect output • Value = 1 is perfect output • Value = 0 is the score for no output • Value < 0 is possible • Occurrence-Weighted Value (OWV) V = 1.0 = Value (benefit) of a correct response C = 0.1 = Cost of an incorrect response • Con: frequently occurring terms will dominate • Equal weighting by terms would be better

Evaluation ProcessSTD Value Function (cont.) • Term Weighted Value (TWV) • Derived from the OWV (won’t go into it here.) • Restricted to terms with reference occurrences • Choosing Θ • Developers choose a decision threshold for their “Actual Decisions” to optimize their term-weighted value : All the “YES” system occurrences • Called the “Actual Term Weighted Value” (ATWV) • The evaluation code searches for the system’s optimum decision score threshold • Called the “Maximum Term Weighted Value” (MTWV) C/V weights the error types Adjusts for the priors

Error Counting Illustration Aligned occurrence Counting Errors For Actual Decisions Corr Miss FA Miss “NO” decisions ignored Counting Errors For Θ = 0.5 (DET Curves) Corr Corr FA Miss “NO” decisions used 10 seconds Reference Occ. System Occ. (with Actual Dec.) NO 0.5 YES 0.7 YES 0.6 NO 0.4

Evaluation ProcessDetection Error Tradeoff (DET) Curves • Plot of PMiss vs. PFA • Axis are warped to the normal deviate scale • Term-Weighted DET Curve • Created by computing term-averaged PMiss and PFA over the full range of a system’s decision score space • Confidence estimates easily used for confidence curves +/- 2 Std. Error Curves Better The curves stop because systems don’t generate complete output Maximum Term-Weighted Value (MTWV) occurs a decision score –0.595

Processing Resource Profiling • Since speed is a major concern, participants quantified the consumed processing resources • Five measurements defined for the evaluation • Indexing time • Reported as a rate: Processing hours/Hour of Speech • Search time • Reported as a rate: Processing seconds/Hour of Speech • Index size (on disk) • Reported as a rate: Megabytes/Hour of Speech • Indexing memory consumption (high water mark) • Reported as Megabytes • NIST provided a tool, ProcGraph.pl, to track memory usage • Searching memory consumption (high water mark) • Reported as Megabytes • Useless without computing platform calibration • Sites report the output of NBench which is a public domain computing speed benchmark

Term Selection Procedure The “Candidate Lists” are made from a list of high-frequency n-grams which are comprised only of low-frequency word constituents. Selection is uniform over n-gram tokens. Pr(occ)high-frequency > 0.1% > pr(occ)low-frequency

2006 STD CorporaSpeech Types with Dialects Why such a small data set? Ans: That’s all we had of high quality transcripts

Future Activities • Additional results analysis • Transcript accuracy requirements • Minimum number of terms • Text normalization • Methods for finding transcription errors given system output • Modifications to the evaluation infrastructure • Improvements/clarifications to file definitions • Modify term-weighted value to include terms without reference occurrences • Score systems over all domains, (not split by domain) • Add a composite Speed/Accuracy measure • Develop methods to enable bigger test sets • Statistical Analysis

Spoken Term Detection Evaluation Overview

Spoken Term Detection Evaluation Overview

Presentation Transcript

The IBM 2006 Spoken Term Detection System

Spoken Dialogue Systems: System Overview

Rapid and Accurate Spoken Term Detection

Edge Detection Evaluation in Boundary Detection Framework

Using Conversational Word Bursts in Spoken Term Detection

Community Detection and Evaluation

QUALITY ASSESSMENT OF SEARCH TERMS IN SPOKEN TERM DETECTION

Improved Spoken Term Detection with Graph-Based Re-Ranking in Feature Space

GEOSS Mid-Term Evaluation

ASSESSING SEARCH TERM STRENGTH IN SPOKEN TERM DETECTION Amir Harati and Joseph Picone

Mid-Term Evaluation

ASSESSING SEARCH TERM STRENGTH IN SPOKEN TERM DETECTION Amir Harati and Joseph Picone

The SRI 2006 Spoken Term Detection System

Finger detection evaluation system

QUALITY ASSESSMENT OF SEARCH TERMS IN SPOKEN TERM DETECTION

Mid-term Evaluation

QUALITY ASSESSMENT OF SEARCH TERMS IN SPOKEN TERM DETECTION

ASSESSING SEARCH TERM STRENGTH IN SPOKEN TERM DETECTION Amir Harati and Joseph Picone

Rapid and Accurate Spoken Term Detection

ASSESSING SEARCH TERM STRENGTH IN SPOKEN TERM DETECTION Amir Harati and Joseph Picone

QUALITY ASSESSMENT OF SEARCH TERMS IN SPOKEN TERM DETECTION

Evaluation of Speech Detection