progress of sphinx 3 x from x 4 to x 5 n.
Skip this Video
Loading SlideShow in 5 Seconds..
Progress of Sphinx 3.X, From X=4 to X=5 PowerPoint Presentation
Download Presentation
Progress of Sphinx 3.X, From X=4 to X=5

Loading in 2 Seconds...

play fullscreen
1 / 35

Progress of Sphinx 3.X, From X=4 to X=5 - PowerPoint PPT Presentation

  • Uploaded on

Progress of Sphinx 3.X, From X=4 to X=5. By Arthur Chan Evandro Gouvea Yitao Sun David Huggins-Daines Jahanzeb Sherwani. What is CMU Sphinx?. Definition 1 : a large vocabulary speech recognizer with high accuracy and speed performance. Definition 2 :

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'Progress of Sphinx 3.X, From X=4 to X=5' - Ava

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
progress of sphinx 3 x from x 4 to x 5

Progress of Sphinx 3.X, From X=4 to X=5


Arthur Chan

Evandro Gouvea

Yitao Sun

David Huggins-Daines

Jahanzeb Sherwani

what is cmu sphinx
What is CMU Sphinx?
  • Definition 1 :
    • a large vocabulary speech recognizer with high accuracy and speed performance.
  • Definition 2 :
    • a collection of tools and resources that enables developers/researchers to build successful speech recognizers
brief history of sphinx
Brief History of Sphinx
  • More detail version can be found at,


Sphinx I


Sphinx II


Sphinx III

“S3 slow”


Sphinx III

“S3 fast” or



-Sphinx IV




2004 Jul


2004 Oct



Sphinx become open-source

what is sphinx 3 x
What is Sphinx 3.X?
  • An extension of Sphinx 3’s recognizers
  • “Sphinx 3.X (X=5)” means “Sphinx 3.5”
  •  It helps to confuse people more.
  • Provide functionalities such as
    • Real-time speech recognition
    • Speaker adaptation
    • Developers Application Interfaces (APIs)
  • 3.X (X>3) is motivated by Project CALO
development history of sphinx 3 x
Development History of Sphinx 3.X


-Sphinx 3 flat-lexicon recognizer (s3 slow)


-Sphinx 3 tree-lexicon recognizer (s3 fast)

S3.3 -w live-mode demo

S3.4 -fast GMM computation

-support class-based LM

-some support for dynamic LM

S3.5 –some support on speaker adaptation

-live mode APIs

-Sphinx 3 and Sphinx 3.X code merge

this talk
This talk
  • A general summary of what’s going on.
  • Less technical than 3.4 talk
    • Folks were so confused by jargons in speech recognition’s black magic.
  • More for code development, less for acoustic modeling
    • Reason: I have not much time to do both 
    • (Incorrect version): “We need to adopt the latest technology to clown 2 to 3 Arthur Chan(s) for the CALO project.” –Prof. Alex Rudnicky, in one CALO meeting in 2004
    • (“Kindly” corrected by Prof. Alan Black): “We need to adopt the latest technology to clone 2 to 3 Arthur Chan(s) for the CALO project.” –Prof. Alex Rudnicky, in one CALO meeting in 2004
  • More on a project point of view
    • Speech recognition software easily shows phenomena described in “Mythical Man-Month”.
this talk outline
This talk (outline)
  • Sphinx 3.X, The recognizer (From X=4 to X=5) (~10 pages)
    • Accuracy and Speed (5 pages)
    • Speaker Adaptation (1 page)
    • Application Interfaces (APIs) (2 pages)
    • Architecture (2 pages)
  • Sphinx as a collection of resources (~10 pages)
    • Code distribution and management (3 pages)
    • Infrastructure of Training (1 page)
    • SphinxTrain: tools of training acoustic models. (1 page)
    • Documentation (3 pages)
    • Team and Organization (2 pages)
  • Development plan for Sphinx 3.X (X >= 6) (2 pages)
  • Relationship between speech recognition and other speech researches. (4 pages)
accuracy and speed
Accuracy and Speed
  • Why Sphinx 3.X ? Why not Sphinx 2?
  • Due to the limitation of computation in 90s
    • S2 only support restricted version of semi-continuous HMM (SCHMM)
    • S3.X supports fully continuous HMM (FCHMM)
  • Accuracy improvement is around relative 30%
    • You will see benchmarking results two slides later
  • Speed
    • S3.X is still slower than S2
    • But in many tasks, it seems to becomes reasonable to use it.
  • Fast Search techniques
    • Lexical tree search (s3.2)
    • Viterbi beam tuning and Histogram beam Pruning(s3.2)
    • Ravi’s talk
    • Phoneme look-ahead (s3.4 by Jahanzeb)
  • Fast GMM computation techniques (s3.4)
    • Using the measurement in the literature, that means
      • 75%-90% of GMM computation reduction with fast GMM computation + pruning.
      • <10% relative degradation can usually be achieved in clean database.
    • Further Detail: “Four-Layer Categorization Scheme of Fast GMM Computation Techniques“ A. Chan et al.
accuracy benchmarking communicator task
Accuracy Benchmarking (Communicator Task)
  • Test platform, 2.2G Pentium IV
    • CMU Communicator task
    • Vocabulary size (3k) , perplexity: ~90
    • All tunings were done without sacrificing 5% performance.
    • Batch mode decoder is used. (decode)
  • Sphinx 2 (tuned w speed-up techniques)
    • WER: 17.8% (0.34xRT)
  • Baseline results Sphinx 3.X 32 gaussian-FCGMM
    • WER: 14.053% (2.40xRT)
  • Baseline results Sphinx 3.X, 64 gaussian-FCGMM
    • WER: 11.7% (~3.67xRT)
  • Tuned Sphinx 3.X 64 gaussian-FCGMM
    • WER: 12.851% (0.87 xRT), 12.152% (1.17xRT)
  • Rong can make it better: Boosting training results : 10.5%
accuracy speed benchmarking wsj task
Accuracy/Speed Benchmarking (WSJ Task)
  • Test platform, 2.2G Pentium
    • Vocabulary Size (5k)
    • Standard NVP task.
  • Trained by both WSJ0 and WSJ1
  • Sphinx 2, 14.5% (?)
  • Sphinx 3.X, 8 gaussian-FCGMM
    • un-tuned 7.3% 1.6xRT
    • tuned: 8.29% 0.52xRT
accuracy speed benchmarking future plan
Accuracy/SpeedBenchmarking (Future Plan)
  • Issue 1 : Large variance in GMM computation.
    • Average performance is good, worse case can be disastrous.
  • Issue 2 : Tuning requires a black magician
    • Automatic tuning is necessary.
  • Issue 3 : Still need to work on larger databases (e.g. WSJ 20k, BN)
    • training setup need to be dig up
  • Issue 4 : Speed up in noisy corpus is tricky.
    • Results are not satisfactory (20-30% degradation in accuracy)
speaker adaptation
Speaker Adaptation
  • Start to support MLLR-based speaker adaptation
    • y=Ax+b , estimate A, b in a maximum likelihood fashion (Legetter 94)
  • Current functionality of sphinx 3.X + SphinxTrain
    • Allow estimation of transformation matrix
    • Transforming means offline
    • Transforming means online
    • Decoder only support single regression class.
  • Code gives exactly the same results as Sam Joo’s code.
  • Not fully benchmarked yet, still experimental
live mode apis
Live-mode APIs
  • Thanks to Yitao
  • Sets of C APIs that provide recognition functionality
    • Close to Sphinx 2’s style of APIs
  • Speech recognition resource initialization/un-initialization
  • Functions for Utterance level begin/end/process waveforms
live mode apis what are missing
Live-mode APIs : What are missing?
  • What we lack
    • Dynamic LM addition and deletion
      • part of the plan of s3.6
    • Finite state machine implementation
      • part of plan of s3.X where X=8 or 9
    • End-pointer integration and APIs
      • Ziad Al Bawab’s model-based classifier
      • Now as a customized version, s3ep
  • “Code duplication is the root of many evils”
  • Four tools of s3 are now incorporated into S3.5
    • align : an aligner
    • allphone : a phoneme recognizer
    • astar : lattice to N-best generation
    • dag : lattice best-path search
  • Many thanks to Dr. Carl Quillen of MIT Lincoln
architecture next step
Architecture : Next Step
  • decode_anytopo will be the next
  • Things we may incorporate someday
    • SphinxTrain
    • CMU-Cambridge LM Toolkit
    • lm3g2dmp and cepview
code distribution and management
Code Distribution and Management
  • Distribution
    • Internal Release -> RC I -> RC II .. -> RC N
    • If no one yell during calm-down period of RC N
      • Then, put a tar ball on Sourceforge web page
  • At every release,
    • Distribution have to go through ~10 platforms of compilation
  • First announcement usually made at the RC period.
  • Web page is maintained by
    • Evandro (<-extremely sane)
digression other versions of sphinx 3 x
Digression: Other versions of Sphinx 3.X
  • Code that are
    • Not satisfying design goal of the software
  • S3 slow w/ GMM Computation
  • S3.5 with end-pointer
  • CMU Researchers’ code and implementation
    • E.g. According to legend, Rita has >10 versions of Sphinx and SphinxTrain.
code management
Code Management
  • Concurrent Versions System (CVS) is used in Sphinx
    • Also used in other projects e.g. CALO and Festival
    • A very effective way to tie resource and knowledge together
  • Problems : Still has a lot of separate versions of code in CMU not in Sphinx’s CVS.
    • Please kindly contact us if you work on something using Sphinx or derived from Sphinx
infrastructure of training
Infrastructure of Training
  • A need for persistence and version control
    • Baseline were lost after several years.
  • setup will be now available in CVS for
    • Communicator (11.5%)
    • WSJ 5k NVP (7.3%)
    • ICSI Phase 3 Training
  • Far from the state of the art
    • Need to re-engineer and do archeology
  • Will add more tasks to the archive
  • You are welcomed to change the setup if you don’t like it
    • But you need to check in what you have done
  • SphinxTrain is never officially released
    • Still under work.
    • For sphinx3.X (X>=5), corresponding timestamp of SphinxTrain will also be published.
  • Recent Progress
    • Better on-line help
    • Added support for adaptation
    • Better support in perl scripts for FCHMM (Evandro)
    • Silence deletion in Baum-Welch Training (experimental)
hieroglyph using sphinx for building speech recognizers
Hieroglyph: Using Sphinx for building speech recognizers
  • Project Hieroglyphs
    • An effort to build a set of complete documentation for using Sphinx, SphinxTrain and CMU LM Toolkit fo building speech applications.
  • Largely based on Evandro, Rita, Ravi, Roni’s docs.
  • “Editor”: Arthur Chan <- do a lot of editing
  • Authors:
    • Arthur, David, Evandro, Rita, Ravi, Roni, Yitao
hieroglyph an outline
Hieroglyph: An outline
  • Chapter 1: Licensing of Sphinx, SphinxTrain and LM Toolkit
  • Chapter 2: Introduction to Sphinx
  • Chapter 3: Introduction to Speech Recognition
  • Chapter 4: Recipe of Building Speech Application using Sphinx
  • Chapter 5: Different Software Toolkits of Sphinx
  • Chapter 6: Acoustic Model Training
  • Chapter 7: Language Model Training
  • Chapter 8: Search Structure and Speed-up of the Speech recognizer
  • Chapter 9: Speaker Adaptation
  • Chapter 10: Research using Sphinx
  • Chapter 11: Development using Sphinx
  • Appendix A: Command Line Information
  • Appendix B: FAQ
hieroglyph status
Hieroglyph: Status
  • Still in the drafting stage
    • Chapter I : License and use of Sphinx, SphinxTrain and CMU LM Toolkit (1st draft, 3rd Rev)
    • Chapter II : Introduction to Sphinx, SphinxTrain and CMU LM Toolkit (1st draft, 1st Rev)
    • Chapter VIII : Search Structure and Speed-up of Sphinx's recognizers (1st draft, 1st Rev)
    • Chapter IX: Speaker adaptation using Sphinx (1st draft, 2nd Rev)
    • Chapter XI: Development using Sphinx (1st draft, 1st Rev)
    • Appendix A.2: Full SphinxTrain Command Line Information (1st draft, 2nd Rev)
  • Writing Quality : Low
  • The 1st draft will be completed ½ year later (hopefully)
team and organization
Team and Organization
  • “Sphinx Developers”:
  • A group of volunteers who maintain and enhance Sphinx and related resources
  • Current Members:
    • Arthur Chan (Project Manager / Coordinator)
    • Evandro Gouvea (Maintainer / Developer)
    • David Huggins-Daines (Developer)
    • Yitao Sun (Developer)
    • Ravi Mosur (Speech Advisor)
    • Alex Rudnicky (Speech Advisor)
    • All of you
      • Application Developers
      • Modeling experts
      • Linguists
      • Users
team and organization1
Team and Organization
  • We need help!
  • Several positions are still available for volunteers:
    • Project Manager : Enable Development of Sphinx
      • Translation: kick/fix miscellaneous people (lightly) everyday.
    • Maintainer : Ensure integrity of Sphinx code and resource
      • Translation: a good chance for you to understand life more
    • Tester : Enable test-based development in Sphinx
      • Translation: a good way to increase blood pressure.
    • Developers : Incorporate state-of-art technology into Sphinx
      • Translation: deal with legacy code and start to write legacy code yourself
  • For your projects, you can also send us temp people.
  • Regular meetings are scheduled biweekly.
    • Though, if we are too busy, we just skip it.
next 6 months sphinx 3 6
Next 6 months: Sphinx 3.6
  • More refined speaker adaptation
  • More support on dynamic LM
  • More speed-up of the code
  • Better documentation (Complete 1st Draft of Hieroglyph?)
  • Confidence measure(?)
if we still survive and have a full team
If we still survive and have a full team……
  • Roadmap of Sphinx 3.X (X>6)
    • X=7,
      • Decoder, Trainer code merge
      • FSG implementation
      • Confidence annotation
    • X=8 :
      • Trainer fixes
      • LM manipulation support
    • X=9 :
      • Better covariance modeling and speaker adaptation
      • Hieroglyph completed
    • X>= 10 : To move on, innovation is necessary.
speech recognition and other research
Speech recognition and other Research
  • The goal of Sphinx
    • Support innovation and development of new speech applications
    • A conscious and correct decision in long term speech recognition research
  • In Speech Synthesis:
    • aligner is important for unit selection
  • In Parsing/Dialog Modeling:
    • Sphinx 3.X still has a lot of errors!
    • We still need Phoenix! (Robust Parser)
    • We still need Ravenclaw House! (Dialog Manager)
  • In Speech Applications
    • Good recognizer is the basis
cost of research in speech recognition
Cost of Research in Speech Recognition
  • 30% WER reduction is usually perceivable to users
    • i.e. roughly translate to 1-2 good algorithmic improvements
  • Under a well-educated researchers group
    • known techniques usually require ½ year to implement and test.
    • Unknown techniques will take more time. (1 year per innovation)
    • Experienced developers :
      • 1 month to implement known techniques
      • 3 months to innovate
  • It still makes sense to continuously support on,
    • speech recognizer development
    • acoustic modeling improvement.
  • To consolidate, what we were lacking
    • 1, code and project management
      • Multi-developer environment is strictly essential.
    • 2, transferal of research to development
    • 3, acoustic modeling research: discriminative training, speaker adaptation
future of sphinx 3 x
Future of Sphinx 3.X
  • ICSLP 2004
    • “From Decoding Driven to Detection-Based Paradigms for Automatic Speech Recognition” by Prof. Chin-Hui Lee
  • Speech Recognition:
    • Still an open problem at 2004
  • Role of Speech Recognition in Speech Application:
    • Still largely unknown
    • Require open minds to understand
  • We’ve done something in 2004
    • Our effort starts to make a difference
  • We still need to do more in 2005
    • Making a Sphinx 3.X a backbone of speech application development
    • Consolidation of the current research and development in Sphinx
    • Seek for ways for sustainable development