1 / 20

Spoken Term Detection Evaluation Overview

Spoken Term Detection Evaluation Overview. http://www.nist.gov/speech/tests/std. Jonathan Fiscus, J érô me Ajot, George Doddington December 14-15, 2006 2006 Spoken Term Detection Workshop. Outline. Spoken Term Detection Task STD definition of a “Search Term”

fleta
Download Presentation

Spoken Term Detection Evaluation Overview

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Spoken Term Detection Evaluation Overview http://www.nist.gov/speech/tests/std Jonathan Fiscus, Jérôme Ajot, George Doddington December 14-15, 2006 2006 Spoken Term Detection Workshop

  2. Outline • Spoken Term Detection Task • STD definition of a “Search Term” • Expected STD System Architecture • Evaluation process • Evaluation corpus • Future activities

  3. The Spoken Term Detection Task • System goal: to find all occurrences of a search term as fast as possible in heterogeneous audio sources • Applications • Audio mining – Indexing and searching huge data stores • Goals of the evaluation • Understand speed/accuracy tradeoffs • Understand technology solution tradeoffs: e.g., word vs. phone recognition • Understand technology issues for the three STD languages: Arabic, English, and Mandarin

  4. The Typical Two Stage STD System Architecture System Input Files Audio Excerpt List Term List Poitners Many times One time STD System Detected Terms Indexer Searcher Metadata Store System Developer’s Domain The indexer never gets access to the terms The searcher never gets access to the audio

  5. STD Definition of a Search Term • A search termis a sequence of one or more words represented orthographically • A narrow pragmatic definition because: • Limited corpora, time, and funding resources • Specific term attributes • Case is ignored • Homophones not distinguished • “wind” (air movement) <=> “wind” (to twist) • This would require phonetically annotated search terms • Orthographic variants of the same sense are distinguished • “catalogs” is NOT a target for the search term “catalog” • Meta-level application would perform term expansion

  6. Outline • Spoken Term Detection Task • STD definition of a “Search Term” • Expected STD System Architecture • Evaluation process • STD system output • Defining reference occurrences • Assessing performance • Term occurrence alignment • STD Value • Detection Error Tradeoff Curves • Computation resource profiling • Evaluation corpus • Future enhancements

  7. Evaluation ProcessSTD System Output • For each hypothesized term occurrence • The location in the audio stream of the term • A “Decision Score” indicating how likely the term exists with more positive values indicating more likely occurrences • An “Actual” decision as to whether the detected term is a term: • “YES” the putative term is likely true occurrence • “NO” it is not. • System operation measurements • Indexing time • Indexing memory consumption (high water mark) • Index size (on disk) • Search time for each term • Searching memory consumption (high water mark) • Computing platform benchmarks

  8. Evaluation ProcessDefining Reference Occurrences • Reference occurrences found by searching human-generated transcripts • Use Speech-To-Text evaluation transcriptions • Each word has been time marked • Forced alignments for English transcripts • STT-system mediated time alignment for Mandarin and Arabic • Rules for locating reference occurrences • The occurrence must be a contiguous sequence of LEXEMEs. • Terms can not contain Disfluencies • The utterance “fire m- maker” is not an occurrence of the term “fire maker” • Terms can’t cross speaker and channel boundaries • The gap between adjacent words in a term must be <= 0.5 second • 0.5 sec. captures all but 0.04% of reference occurrences with 1.0 sec. • NON-LEX events are treated as silence

  9. Evaluation Process:Assessing Performance • Challenges: STD isn’t a typical detection task • STD systems operate on continuous audio so there is no analog to a discrete “trial” • A “trial” is an answer to the question “Is this exemplar an example of object X” • System and reference occurrences need to be ALIGNED • Two methods of assessing performance • Value function • An application model that assigns value to correct output and negative value for incorrect output • A weighted linear combination of Missed Detection and False Alarm Probabilities • Detection Error Tradeoff (DET) curve • A graphical representation of the tradeoff between Missed Detections and False Alarms

  10. Evaluation Process:Reference/System Term Alignments • Solved using the Hungarian Algorithm for solving bipartite graphs • Each term occurrence is a node in the graph • A minimum one-to-one mapping is found • Optimization function is: • Linear combination of time congruence and detection score • Infinite if the gap between reference and system occurrences is greater than 0.5 seconds System Terms Reference Terms

  11. Error Counting Illustration Aligned occurrence Counting Errors For Actual Decisions Corr Miss FA Miss “NO” decisions ignored 10 seconds Reference Occ. System Occ. (with Actual Dec.) NO 0.5 YES 0.7 YES 0.6 NO 0.4

  12. Evaluation Process:Estimates for Error Rates • Given the alignments, compute the following for each term: • Probability of Missed Detections • PMiss(term,θ) = 1 – Ncorrect(term,θ)/NTrue(term,θ) • not defined when NTrue(term,θ) = 0 • Probability of False Alarms • PFA(term,θ) = NSpurious(term,θ)/NNT(term,θ) • NT is the number of Non-Target trials • But this is NOT countable – Instead we used an Estimate • NNT(term,θ) = TrialsPerSecond * SpeechDuration - NTrue(term,θ) • TrialsPerSecond arbitrarily set to 1 for all languages • In hind sight, this is a unsatisfactory solution • For the next STD eval, we’d like to use a rate to measure False Alarms (e.g., False Alarms per hour). • Θ is a decision criteria to differentiate likely vs. unlikely putative term occurrences

  13. Evaluation ProcessSTD Value Function • Value : An application model that assigns value to correct output and negative value for incorrect output • Value = 1 is perfect output • Value = 0 is the score for no output • Value < 0 is possible • Occurrence-Weighted Value (OWV) V = 1.0 = Value (benefit) of a correct response C = 0.1 = Cost of an incorrect response • Con: frequently occurring terms will dominate • Equal weighting by terms would be better

  14. Evaluation ProcessSTD Value Function (cont.) • Term Weighted Value (TWV) • Derived from the OWV (won’t go into it here.) • Restricted to terms with reference occurrences • Choosing Θ • Developers choose a decision threshold for their “Actual Decisions” to optimize their term-weighted value : All the “YES” system occurrences • Called the “Actual Term Weighted Value” (ATWV) • The evaluation code searches for the system’s optimum decision score threshold • Called the “Maximum Term Weighted Value” (MTWV) C/V weights the error types Adjusts for the priors

  15. Error Counting Illustration Aligned occurrence Counting Errors For Actual Decisions Corr Miss FA Miss “NO” decisions ignored Counting Errors For Θ = 0.5 (DET Curves) Corr Corr FA Miss “NO” decisions used 10 seconds Reference Occ. System Occ. (with Actual Dec.) NO 0.5 YES 0.7 YES 0.6 NO 0.4

  16. Evaluation ProcessDetection Error Tradeoff (DET) Curves • Plot of PMiss vs. PFA • Axis are warped to the normal deviate scale • Term-Weighted DET Curve • Created by computing term-averaged PMiss and PFA over the full range of a system’s decision score space • Confidence estimates easily used for confidence curves +/- 2 Std. Error Curves Better The curves stop because systems don’t generate complete output Maximum Term-Weighted Value (MTWV) occurs a decision score –0.595

  17. Processing Resource Profiling • Since speed is a major concern, participants quantified the consumed processing resources • Five measurements defined for the evaluation • Indexing time • Reported as a rate: Processing hours/Hour of Speech • Search time • Reported as a rate: Processing seconds/Hour of Speech • Index size (on disk) • Reported as a rate: Megabytes/Hour of Speech • Indexing memory consumption (high water mark) • Reported as Megabytes • NIST provided a tool, ProcGraph.pl, to track memory usage • Searching memory consumption (high water mark) • Reported as Megabytes • Useless without computing platform calibration • Sites report the output of NBench which is a public domain computing speed benchmark

  18. Term Selection Procedure The “Candidate Lists” are made from a list of high-frequency n-grams which are comprised only of low-frequency word constituents. Selection is uniform over n-gram tokens. Pr(occ)high-frequency > 0.1% > pr(occ)low-frequency

  19. 2006 STD CorporaSpeech Types with Dialects Why such a small data set? Ans: That’s all we had of high quality transcripts

  20. Future Activities • Additional results analysis • Transcript accuracy requirements • Minimum number of terms • Text normalization • Methods for finding transcription errors given system output • Modifications to the evaluation infrastructure • Improvements/clarifications to file definitions • Modify term-weighted value to include terms without reference occurrences • Score systems over all domains, (not split by domain) • Add a composite Speed/Accuracy measure • Develop methods to enable bigger test sets • Statistical Analysis

More Related