progress of sphinx 3 x from x 5 to x 6
Skip this Video
Download Presentation
Progress of Sphinx 3.X From X=5 to X=6

Loading in 2 Seconds...

play fullscreen
1 / 56

progress of sphinx 3.x - PowerPoint PPT Presentation

  • Uploaded on

Progress of Sphinx 3.X From X=5 to X=6. Arthur Chan Evandro Gouvea David J. Huggins-Daines Alex I. Rudnicky Mosur Ravishankar Yitao Sun. If you want to leave now…… Take home message 1. Sphinx 3.6 Rocks!. Here is another one…… Take home message 2 . We need Better Acoustic Models .

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'progress of sphinx 3.x' - victoria

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
progress of sphinx 3 x from x 5 to x 6

Progress of Sphinx 3.XFrom X=5 to X=6

Arthur Chan

Evandro Gouvea

David J. Huggins-Daines

Alex I. Rudnicky

Mosur Ravishankar

Yitao Sun

here is another one take home message 2
Here is another one……Take home message 2

We need Better Acoustic Models.

this talk 37 pages
This talk (~37 pages)
  • Overview (6 pages)
  • Better Software Architecture (9 pages)
  • Speed of Sphinx 3.6 (3 pages)
  • Accuracy Improvement (7 pages)
  • Functionalities Improvement (3 pages)
  • Documentation (4 pages)
  • Sphinx 3.X (X>6) and Conclusion (~5 pages)
  • Discussion (10 mins?)
what is cmu sphinx
What is CMU Sphinx?
  • Definition 1 :
    • Large vocabulary speech recognizers with high accuracy and speed performance.
  • Definition 2 :
    • A collection of tools and resources that enables developers/researchers to build successful speech recognition systems
family of cmu sphinx
Family of CMU Sphinx
  • Decoders
    • Sphinx {II – IV}
    • PocketSphinx (by Dave at Oct 2005)
  • Acoustic Model Trainer
    • SphinxTrain
  • Documentation
    • Hieroglyphs
    • Robust/SphinxTrain Tutorial
sphinx developers
Sphinx Developers
  • Sphinx is maintained by
    • Volunteer programmers/researchers who like speech recognition
      • Funded by different projects
      • Motivated by different reasons
    • All contribution go to the samecodebase
    • Goal : Sustainable development of Sphinx
  • Sphinx Developer Meetings are held
    • regularly
    • secretly
    • to decide the way to go in Sphinx
what is sphinx 3 x
What is Sphinx 3.X?
  • An extension of Sphinx 3’s recognizers
  • “Sphinx 3.X (X=6)” means “Sphinx 3.6”
  • Provide more functionalities such as
    • Real-time speech recognition
    • Speaker adaptation
    • Developers Application Interfaces (APIs)
    • Different search algorithms
  • 3.X (X>3) is motivated by Project CALO and GALE
development history of sphinx 3 x
Development History of Sphinx 3.X

S3.2 -Sphinx 3 tree-lexicon recognizer (s3 fast)

S3 -Sphinx 3 flat-lexicon recognizer (s3 slow)



S3.3 -live-mode demo

S3.5 –some support on speaker adaptation

-live mode APIs

S3.4 -fast GMM, class-based LM, dynamic LM

- Better Search Architecture/Implementation

-More support for Speaker Adaptation

- Gentle Re-factoring of code-base

-Somme support on FSG decoding and confidence

-Better Documentation/Tutorial





this talk progress of sphinx 3 6
This talk – Progress of Sphinx 3.6
  • From the perspective of
    • a developer
    • an observer
  • Sphinx 3.6
    • Where are we now?
    • Where will we go?
  • Summary of 5 talks
motivation of re architecting sphinx 3 x
Motivation of Re-Architecting Sphinx 3.X
  • We start to need a new search algorithms
    • New search algorithm development could have risk.
    • We don’t want to throw away the old one.
    • Mere replacement could cause backward compatibility problem.
  • Code has grown to a stage where
    • Some changes could be very hard.
  • Multiple programmers become active at the same time
    • CVS conflict could become often if things are controlled by “if-else” structure
architecture of sphinx 3 x x 6
Architecture of Sphinx 3.X (X<6)
  • Batch sequential Architecture (Shaw 96)
  • Each executable has customized sub-routines






Initialization 1

(kb and kbcore)

Initialization 2

Initialization 3

Initialization 4

GMM Computation 1


GMM Computation 2

(Using gauden &

senone Method 1)

GMM Computation 3

(Using gauden &

senone Method 2)

GMM Computation 4

(Using gauden &

senone Method 3)

Search 1

Search 2

Search 3

Search 4

Process Controller 1

Process Controller 2

Process Controller 3

Process Controller 4

Command Line 1

Command Line 2

Command Line 3

Command Line 4

architecture diagram of sphinx 3 6
Architecture Diagram of Sphinx 3.6

User Defined


Fast Single Stream










Multi Stream










FSG Search






Flat Lexicon Search









Tree Lexicon Search












separation of mechanism and implementation
Separation of Mechanism and Implementation

-A class provides Atomic Search Operations (ASOs) in the form of function pointers

-Configured by just setting function pointers

- A single interface for applications

Search Mechanism

Module (srch.c)

Search Implementation

Module (srch.c)

Search Implementation

Module (srch.c)

Search Implementation

Module (srch.c)

-Could have many of them


A, Decoding with different implementations

B, Concept of search including


-phoneme recognition

-keyword spotting.

Search Implementation

Module (srch.c)

Search Implementation



search mechanism module what does it do
Search Mechanism Module – What does it do?
  • Computation of One Frame









(CD senone)











At word

End using



(e.g. LM)









(CI senone)



Search For One Frame

search implementation s
Search Implementations
  • Implemented (-op_mode)
    • Finite State Grammar Search (Mode 2)
    • Flat Lexicon Search (Mode 3)
    • Tree Search (Mode 4)
  • Not in 3.6
    • Aligner (Mode 0)
    • Phoneme recognition (Mode 1)
    • A new tree search (Mode 5)
different ways to implement search implementations
Different ways to implement search implementations
  • 1, Use default implementation
    • Just specify all atomic search operations (ASOs) provided
  • 2, Override “search_one_frame”
    • Only need to specify GMM computation and how to “search_one_frame”
  • 3, Override the whole mechanism
    • For people who dislike the default so much
    • Override how to “search”
consequence of re factoring
Consequence of Re-factoring
  • Calling decode
    • Could use flat-lexicon decoding as well
  • decode_anytopo still exists
    • For backward compatibility
    • decode_anytopo = decode
  • allphone, align, decode_anytopo could use fast GMM computation
  • decode could use S3’s SCHMM
  • Command-line is now synchronized
summary on the architecture
Summary on the Architecture
  • Sphinx 3.6
    • A gentle re-factoring has carried out.
    • A more flexible architecture
    • A better playground for AM and search people
      • S2 SCHMM computation routine?
      • NN, SVM, ML techniques for AM?
speed in sphinx 3 6
Speed in Sphinx 3.6
  • Further work on Context-Independent Senone-based GMM Selection (CIGMMS)
    • 20-30% Speed Up
  • 3 tricks were proposed
    • Fixed amount of CD senone compute.
    • Use of best Gaussian index
    • Tightening factor of CI-phone beam
  • Published in “On Improvements of CI-based GMM Selection “ (Chan 2005)
  •  but not very well received
    • Alright, there are accuracy lost
a note on sphinx 3 6 speed performance
A note on Sphinx 3.6 Speed Performance
  • Sphinx 3.X works under 1xRT in most tasks. E.g.
    • Smartnote/Sphinx Integration
    • Broadcast News UNTUNEDRESULT: 1.5xRT
  • Sphinx 3.X is still slower than Sphinx 2
    • Fast setup of Sphinx 2: use 256 codeword SCHMM
    • Fast setup of Sphinx 3: use 2000-6000 senone FCHMM
      • Historical notes: Comparable SCHMM setup has 4096 codewords
    • Need benchmarking to truly judge
speed conclusion
Speed - Conclusion
  • Sphinx 3.X is in a reasonable level
    • Sphinx 2 should still be used in speed-critical condition
  • Further work
    • GALE/CALO will still be around in 3.6/3.7
      • Accuracy become more motivated than speed
our immediate problem
Our Immediate Problem
  • What help us more in accuracy?
    • Acoustic modeling ?
    • Speaker Adaptation ?
    • Search Improvement ?
accuracy improvement of sphinx 3 6 speaker adaptation
Accuracy Improvement of Sphinx 3.6 – Speaker Adaptation
  • Speaker adaptation techniques are shown to be crucia
  • Even in tough task (e.g. CALO)
    • 10-15% relative improvement
    • Gain similar to LM/AM modeling work
accuracy improvement of sphinx 3 6 speaker adaptation cont
Accuracy Improvement of Sphinx 3.6 – Speaker Adaptation (cont.)
  • Dave has done a great job on
    • Multiple-class MLLR
    • MAP adaptation
  • Things to watch
    • Ziad’s VTLN implementation
conclusion in speaker adaptation
Conclusion in Speaker Adaptation
  • Observation in 3.6
    • Speaker adaptation is very important.
    • What we still need:
      • Maximum likelihood linear transformation (MLLT)
      • Combination of MLLT, MLLR, MAP and VTLN
        • Proved to be additive
accuracy improvement of sphinx 3 6 search
Accuracy Improvement of Sphinx 3.6 - Search
  • Our Attempts in Flat Lexicon Decoder
    • Full triphones
      • 2.5% rel. gain
      • But 100xRT
    • Full trigram
      • Will give another 5-10 times slowdown
  • Diff between Tree vs Flat Lex. Decoder
    • 5% relative
  • Conclusion:
    • Further improvement in search is limited
accuracy improvement in sphinx 3 6 modeling
Accuracy Improvement in Sphinx 3.6 -Modeling
  • Mainly
    • on addition of data (Major contributor)
    • interpolation of LM (very decent gain)
  • Things to watch: Yi’s LDA
  • Yet to explore
    • Speaker Adaptive Training (SAT)
    • Semi-tied Covariance (STC) Matrix
  • Conclusion:
    • Commodity techniques are still not widely used in Sphinx (Bad sign).
conclusion of accuracy improvement 3 6
Conclusion of Accuracy Improvement 3.6
  • 3.6 has a healthy development in speaker adaptation
  • Improvement in search is hard
  • Need 10x effort on acoustic modeling
    • Commodity techniques are still not there
    • Three final keywords: MLLT, SAT, STC
  • Priorities:
    • Adaptation > AM, LM > 2 stage Search >>

1st Stage

fsg search
FSG search
  • 3.6 supports FSG search
    • Adapted from Sphinx 2’s implementation
  • Current Issues
    • No lextree implementation
    • Static allocation of all HMMs; not allocated “on demand”
    • FSG transitions represented by NxN matrix
  • Other wish list
    • No histogram pruning
    • No state-based implementation
  • Need more testing
confidence annotation
Confidence Annotation
  • conf
  • Adapted from Rong with permission
    • Compute Word Posterior Probability of a word given lattice
  • Still under work
language model related
Language Model Related
  • Now fully supports
    • Text-based LM reading
    • Inter-conversion of LM in TXT & DMP format
      • lm_convert = lm3g2dmp++
    • LM switching API in live_decode_API
  • A collection of documentation of using Sphinx 3, SphinxTrain and CMU LM Tool kit
  • 1st Draft is completed
    • All chapter are filled with information.
    • Writing the 2nd Draft
  • “Chief Editor”: Arthur Chan
  • Does it even exist?
hieroglyph an outline
Hieroglyph: An outline
  • Chapter 1: Licensing of Sphinx, SphinxTrain and LM Toolkit
  • Chapter 2: Introduction to Sphinx
  • Chapter 3: Introduction to Speech Recognition
  • Chapter 4: Recipe of Building Speech Application using Sphinx
  • Chapter 5: Different Software Toolkits of Sphinx
  • Chapter 6: Acoustic Model Training
  • Chapter 7: Language Model Training
  • Chapter 8: Search Structure and Speed-up of the Speech recognizer
  • Chapter 9: Speaker Adaptation
  • Chapter 10: Research using Sphinx
  • Chapter 11: Development using Sphinx
  • Appendix A: Command Line Information
  • Appendix B: FAQ
book reviews of hieroglyphs
Book Reviews of Hieroglyphs
  • “You wrote the worst preface I have ever seen in my life. “ Dr. Evandro Gouvea
  • “The content is o. k., but the writing is still ……” Prof. Alex I. Rudnicky
  • “Wow, it is thick. And, oh…… there are no blank spaces! You are not supposed to add contents in any CMU open source manuals, don’t you know?” Dr. Alan W. Black
other documents
Other Documents
  • Robust Tutorial (Aka Sphinx 101)
    • Thanks to Evandro
    • Now could be used for
      • archive_s3
      • Sphinx 2
      • Sphinx 3
  • Doxygen documentation for Sphinx 3.x is fully available
what is important
What is important?
  • Keep the current design priorities:
    • 1, Accuracy
      • We are just OK and we badly need to improve it.
    • 2, Speed
      • We are OK and it doesn’t hurt to improve it
    • 3, Functionalities
      • Still a pain to use Sphinx 3 but it is constant improved
      • Usability eventually implies distributing models.
  • Accuracy should be prior to Speed
    • No excuse in 3.7
roadmap in x 7
Roadmap: In X=7……
    • Speaker Clustering/SAT
      • Bridging SI and SA
    • VTLN
    • LDA
  • 0.5 x CALO may need further speed improvement
    • BBI
    • More secret ideas in GMM computation
roadmap cont
Roadmap (cont.)
  • X=8
    • D.T.
      • MMIE, MCE
    • STC
    • Interface with HTK model
  • X=9
    • D.T. + S.A.
  • X>10
    • Time to fire Arthur Chan and hire an assistant professor
other possibilities of sphinx
Other Possibilities of Sphinx?

[You fill in this part]

we need your help
We need your help!
  • Project Manager: Enable Development of Sphinx
    • Translation: Kick/Fix people and Kicked/Fixed by Evandro
  • Developers: Incorporate state-of-art speech technology into Sphinx
    • Translation: Fix 1 bug and Generate 5 more
  • Maintainer: Ensure integrity of Sphinx code and resource
    • Translation: You become so called the “Grand Janitor of Sphinx”.
  • Tester: Enable test-based development in Sphinx
    • Translation: You will learn a lot of Zen-Buddhism.
our current motto subject to change
Our Current Motto (Subject to Change)

“Don’t ever underestimate yourself…… You never know what a kind of mess you could make.”

-Dr. Evandro Gouvea

conclusion for sphinx 3 x
Conclusion for Sphinx 3.X
  • We have done something
  • We are making some sense in the system development now
  • We have healthy growth in accuracy
    • But we still need more
thank you
Thank you
  • Acknowledgement
    • Rich/Alan: for your constant encouragement
    • Alex: for your understanding of Yin/Yang
    • Rong: for contributing the confidence estimation program
    • Bano: for reminding me I could die at any time when we were in Lake Arthur ->
      • Hieroglyphs 1st draft’s progress sped up.
    • Sphinx developers: without you, I won’t be the “Grand Janitor”.
    • Sphinx users: for your capabilities of giving me nightmares
postscript a word from my friend
Postscript, a word from my friend

“Don’t ever underestimate yourself…… You never know what a mess you could make.”

–Dr. Evandro Gouvea

pros cons of batch sequential architecture
Pros/Cons of Batch Sequential Architecture
  • Pros:
    • Great flexibility for individual programmers
    • No assumption, data structure are usually optimized for the application.
      • Align and allphone have optimization.
    • Crafting in individual application has high quality
  • Cons:
    • Great difficulty in maintenance
      • Most changes need to be carried out for 5-6 times.
    • Spread disease of code duplication
      • Code with functionality was duplicated multiple times