the ibm 2006 spoken term detection system l.
Skip this Video
Loading SlideShow in 5 Seconds..
The IBM 2006 Spoken Term Detection System PowerPoint Presentation
Download Presentation
The IBM 2006 Spoken Term Detection System

Loading in 2 Seconds...

play fullscreen
1 / 27

The IBM 2006 Spoken Term Detection System - PowerPoint PPT Presentation

  • Uploaded on

The IBM 2006 Spoken Term Detection System Jonathan Mamou IBM Haifa Research Labs Olivier Siohan Bhuvana Ramabhadran IBM T. J. Watson Research Center Outline System description Indexing

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'The IBM 2006 Spoken Term Detection System' - andrew

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
the ibm 2006 spoken term detection system

The IBM 2006 Spoken Term Detection System

Jonathan Mamou

IBM Haifa Research Labs

Olivier Siohan

Bhuvana Ramabhadran

IBM T. J. Watson Research Center

  • System description
    • Indexing
      • Audio processing for each source type: generation of CTM, word confusion networks (WCN) and phone transcripts
      • Index generation and storage
    • Search
  • Experiments/Results
system overview

Term List

System Overview


Phone Transcript

Phone Index


Posting list






In Voc



Word Transcript

Word Index




STD List

conversational telephone speech transcription system cts
Conversational Telephone Speech Transcription System (CTS)

D. Povey, B. Kingsbury, L. Mangu, G. Saon, H. Soltau, G. Zweig,

"fMPE: Discriminatively Trained Features for Speech Recognition",

in Proceedings International Conference on Acoustics Speech and

Signal Processing, Philadelphia, PA, 2005.

meeting transcription system confmtg
Meeting Transcription System (confmtg)

Huang, J. et al, “The IBM RT06S Speech-To-Text Evaluation System",

NIST TR06S Workshop, May 3-4, 2006.

phonetic lattice generation
Phonetic Lattice Generation

O. Siohan, M. Bacchiani, "Fast vocabulary independent audio search using path based graph indexing", Proceedings of Interspeech 2005, Lisbon, pp. 53-56.

Two-step algorithm:

  • Generate sub-word lattices using word fragments as decoding units
  • Convert word-fragment lattices into phonetic lattices

Required resources:

  • A word-fragment inventory
  • A word-fragment lexicon
  • A word-fragment language model

Main Issue: designing a fragment inventory

fragment based system design
Fragment based system design
  • Use a word-based system to convert the training material to phone strings
  • Train a phone n-gram with “large n” (say 5)
  • Prune the phone n-gram using entropy based pruning

A. Stolcke, "Entropy-based pruning of backoff languge models", in Proceedings DARPA Broadcast News Transcription and Understanding Workshop, pp. 270-274, Lansdowne, VA, Feb. 1998.

  • Use the retained n-grams as the selected fragments (n-gram structure ensures coverage of all strings)
  • Phonetic pronunciations for word fragments are trivial
  • Train a fragment-based n-gram model for use in the fragment-based ASR system

Indices are stored using Juru storage

  • Juru is a full-text search library written in Java, developed at IBM
    • D. Carmel, E. Amitay, M. Herscovici, Y. S. Maarek, Y. Petruschka, and A. Soffer, "Juru at TREC 10 - Experiments with Index Pruning", Proceedings of TREC-10, NIST 2001.
  • We have adapted the Juru storage model in order to store speech related data (e.g. begin time, duration)
  • The posting lists are compressed using classical index compression techniques (d-gap)
    • Gerard Salton and Michael J. McGill, Introduction to modern information retrieval, 1983.
indexing algorithm
Indexing Algorithm

Input: a corpus of word/sub-word transcripts


1. Extract units of indexing from the transcript

2. For each unit of indexing (word or sub-word), store in the index its posting

- transcript/speaker identifier (tid)

- begin time (bt)

- duration

- For WCN

- posterior probability

- rank relative to the other hypotheses

Output: an index on the corpus of transcripts

in vocabulary search
In-Vocabulary Search
  • Miss probability can be reduced by expanding the 1-best transcript with extra words, taken from the other alternatives provided by WCN transcript.
  • Such an expansion will probably reduce miss probability while increasing FA probability!
  • We need an appropriate scoring model in order to decrease the FA probability by punishing “bad” results

J. Mamou, D. Carmel and R. Hoory, "Spoken Document Retrieval from Call-center conversations", Proceedings of SIGIR, 2006

improving retrieval effectiveness for in voc search
Improving Retrieval Effectiveness for In Voc search

Our scoring model is based on two pieces of information provided by WCN:

  • the posterior probability of the hypothesis given the signal: it reflects the confidence level of the ASR in the hypothesis.
  • the rank of the hypothesis among the other alternatives: it reflects the relative importance of the occurrence.
improving retrieval effectiveness with oov search
Improving Retrieval Effectiveness with OOV search
  • BN model: 39 OOV queries
  • CTS model: 117 OOV queries
  • CONFMTG model: 89 OOV queries
  • Since the accuracy of phone transcript is worse than the accuracy of the word transcript, we use phone transcript only for OOV keywords
  • It tends to reduce miss probability without affecting FA probability too much
grapheme to phoneme conversion
Grapheme-to-phoneme conversion
  • OOV keywords are converted to phone sequence using a joint Maximum Entropy N-gram model
    • Given a letter sequence L, find the phone sequence P* that maximizes Pr(L,P)


  • Details in
    • Stanley Chen, “Conditional and Joint Models for Grapheme-to-Phoneme Conversion”, in Proc. of Eurospeech 2003.
search algorithm
Search Algorithm

Input: a query term, word based index , sub-word based index


1. Extract the query keywords

2. For In-Voc query keywords, extract the posting lists from the word based index

3. For OOV query keywords, convert the keywords to sub-words and extract the posting list of each sub-word from the sub-word index

4. Merge the different posting lists according to the timestamp of the occurrences in order to create results matching the query

- check that the words and sub-words appear in the right order according to their begin times

- check that the words/sub-words are adjacent (less that 0.5 sec for word-word, word-phoneme and less than 0.2 sec for phoneme-phoneme)

Output: the set of all the matches of the given term

search algorithm19
Search Algorithm

Extract Posting

List from Word


Word-Word, Word-Phone: < 0.5s

Phone-Phone: < 0.2s




in the Query

Merge based on

begin time and



Extract Posting

List from Phone



Set of matches for all

terms in the query

scoring for hard decision
Scoring for hard-decision
  • We have boosted the score of multiple-words terms
  • Decision thresholds are set according to the analysis of the DET curve obtained on the development set.
    • We have used different threshold values per source type
primary and contrast system differences
Primary and Contrast system differences
  • Primary system (WCN): WCN for all the types, CONFMTG transcripts generated using the BN model. Combination with phonetic 1-best transcripts for BN and CTS.
  • Contrastive 1 (WCN-C): same as P except for the WCN of CONFMTG that was generated using the CONFMTG model
  • Contrastive 2 (CTM): CTM for all the types, CONFMTG transcripts generated using the BN model. Combination with phonetic 1-best transcripts for BN and CTS.
  • Contrastive 3 (1-best-WCN): 1-best path extracted from WCN, CONFMTG transcripts generated using the BN model. Combination with phonetic 1-best transcripts for BN and CTS.
data results
Data Results
  • Retrieval performances are improved
    • using WCNs, relatively to 1-best path
    • using 1-best from WCN than CTM
  • Our ATWV is close to the MTWV; we have used appropriate thresholds for punishing bad results.
condition performance
Condition performance



In general we performed better on long terms.

system characteristics eval
System characteristics (Eval)
  • Index size: 0.3267 MB/HP
    • Compression of the index storage
  • Indexing time: 7.5627 HP/HS
  • Search speed: 0.0041 sec.P/HS
  • Index Memory Usage: 1653.4297 MB
  • Search Memory Usage: 269.1250 MB
  • Our system combines a word retrieval approach with a phonetic retrieval approach
  • Our work exploits additional information provided by WCNs
    • Extending the 1-best transcript with all the hypotheses of the WCN, considering confidence levels and boostingby term rank.
  • ATWV is increased compared to the 1-best transcript
    • Miss probability is significantly improved by indexing all the hypotheses provided by the WCN.
    • Decision score are set to NO for “bad” results in order to attenuate the effect of FA added by WCN.