progress of sphinx 3 x from x 5 to x 6
Download
Skip this Video
Download Presentation
Progress of Sphinx 3.X From X=5 to X=6

Loading in 2 Seconds...

play fullscreen
1 / 56

progress of sphinx 3.x - PowerPoint PPT Presentation


  • 716 Views
  • Uploaded on

Progress of Sphinx 3.X From X=5 to X=6. Arthur Chan Evandro Gouvea David J. Huggins-Daines Alex I. Rudnicky Mosur Ravishankar Yitao Sun. If you want to leave now…… Take home message 1. Sphinx 3.6 Rocks!. Here is another one…… Take home message 2 . We need Better Acoustic Models .

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'progress of sphinx 3.x' - victoria


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
progress of sphinx 3 x from x 5 to x 6

Progress of Sphinx 3.XFrom X=5 to X=6

Arthur Chan

Evandro Gouvea

David J. Huggins-Daines

Alex I. Rudnicky

Mosur Ravishankar

Yitao Sun

here is another one take home message 2
Here is another one……Take home message 2

We need Better Acoustic Models.

this talk 37 pages
This talk (~37 pages)
  • Overview (6 pages)
  • Better Software Architecture (9 pages)
  • Speed of Sphinx 3.6 (3 pages)
  • Accuracy Improvement (7 pages)
  • Functionalities Improvement (3 pages)
  • Documentation (4 pages)
  • Sphinx 3.X (X>6) and Conclusion (~5 pages)
  • Discussion (10 mins?)
what is cmu sphinx
What is CMU Sphinx?
  • Definition 1 :
    • Large vocabulary speech recognizers with high accuracy and speed performance.
  • Definition 2 :
    • A collection of tools and resources that enables developers/researchers to build successful speech recognition systems
family of cmu sphinx
Family of CMU Sphinx
  • Decoders
    • Sphinx {II – IV}
    • PocketSphinx (by Dave at Oct 2005)
  • Acoustic Model Trainer
    • SphinxTrain
  • Documentation
    • Hieroglyphs
    • Robust/SphinxTrain Tutorial
sphinx developers
Sphinx Developers
  • Sphinx is maintained by
    • Volunteer programmers/researchers who like speech recognition
      • Funded by different projects
      • Motivated by different reasons
    • All contribution go to the samecodebase
    • Goal : Sustainable development of Sphinx
  • Sphinx Developer Meetings are held
    • regularly
    • secretly
    • to decide the way to go in Sphinx
what is sphinx 3 x
What is Sphinx 3.X?
  • An extension of Sphinx 3’s recognizers
  • “Sphinx 3.X (X=6)” means “Sphinx 3.6”
  • Provide more functionalities such as
    • Real-time speech recognition
    • Speaker adaptation
    • Developers Application Interfaces (APIs)
    • Different search algorithms
  • 3.X (X>3) is motivated by Project CALO and GALE
development history of sphinx 3 x
Development History of Sphinx 3.X

S3.2 -Sphinx 3 tree-lexicon recognizer (s3 fast)

S3 -Sphinx 3 flat-lexicon recognizer (s3 slow)

3.X/3.0

merge

S3.3 -live-mode demo

S3.5 –some support on speaker adaptation

-live mode APIs

S3.4 -fast GMM, class-based LM, dynamic LM

- Better Search Architecture/Implementation

-More support for Speaker Adaptation

- Gentle Re-factoring of code-base

-Somme support on FSG decoding and confidence

-Better Documentation/Tutorial

lm_convert

(lm3g2dmp)

dp

3.6

this talk progress of sphinx 3 6
This talk – Progress of Sphinx 3.6
  • From the perspective of
    • a developer
    • an observer
  • Sphinx 3.6
    • Where are we now?
    • Where will we go?
  • Summary of 5 talks
    • http://www.cs.cmu.edu/~archan/sphinxPresentation.html
motivation of re architecting sphinx 3 x
Motivation of Re-Architecting Sphinx 3.X
  • We start to need a new search algorithms
    • New search algorithm development could have risk.
    • We don’t want to throw away the old one.
    • Mere replacement could cause backward compatibility problem.
  • Code has grown to a stage where
    • Some changes could be very hard.
  • Multiple programmers become active at the same time
    • CVS conflict could become often if things are controlled by “if-else” structure
architecture of sphinx 3 x x 6
Architecture of Sphinx 3.X (X<6)
  • Batch sequential Architecture (Shaw 96)
  • Each executable has customized sub-routines

decode

livepretend

Decode_anytopo

align

allphone

Initialization 1

(kb and kbcore)

Initialization 2

Initialization 3

Initialization 4

GMM Computation 1

approx_cont_mgau

GMM Computation 2

(Using gauden &

senone Method 1)

GMM Computation 3

(Using gauden &

senone Method 2)

GMM Computation 4

(Using gauden &

senone Method 3)

Search 1

Search 2

Search 3

Search 4

Process Controller 1

Process Controller 2

Process Controller 3

Process Controller 4

Command Line 1

Command Line 2

Command Line 3

Command Line 4

architecture diagram of sphinx 3 6
Architecture Diagram of Sphinx 3.6

User Defined

Applications

Fast Single Stream

GMM

Computation

livedecode

API

Dictionary

Library

livepretend

Search

Library

Multi Stream

GMM

Computation

Search

Controller

dag

LM

Library

decode

(anytopo)

FSG Search

Process

Controller

AM

Library

decode

Flat Lexicon Search

Utility

Library

Search

Initializer

allphone

Feature

Library

align

Tree Lexicon Search

Command

Line

Processor

Miscellaneous

Library

astar

Controllers/

Abstractions

Applications

Implementations

Libraries

separation of mechanism and implementation
Separation of Mechanism and Implementation

-A class provides Atomic Search Operations (ASOs) in the form of function pointers

-Configured by just setting function pointers

- A single interface for applications

Search Mechanism

Module (srch.c)

Search Implementation

Module (srch.c)

Search Implementation

Module (srch.c)

Search Implementation

Module (srch.c)

-Could have many of them

-Possibilities:

A, Decoding with different implementations

B, Concept of search including

-alignment,

-phoneme recognition

-keyword spotting.

Search Implementation

Module (srch.c)

Search Implementation

Modules

(srch_????.c)

search mechanism module what does it do
Search Mechanism Module – What does it do?
  • Computation of One Frame

Select

Active

CD

Senone

Compute

Detail

GMM

Score

(CD senone)

Compute

Detail

HMM

Score

(CD)

Propagate

Graph

(Phone-

Level)

Rescoring

At word

End using

High-Level

KS

(e.g. LM)

Propagate

Graph

(Word-

Level)

Compute

Approx.

GMM

Score

(CI senone)

GMM

Compute

Search For One Frame

search implementation s
Search Implementations
  • Implemented (-op_mode)
    • Finite State Grammar Search (Mode 2)
    • Flat Lexicon Search (Mode 3)
    • Tree Search (Mode 4)
  • Not in 3.6
    • Aligner (Mode 0)
    • Phoneme recognition (Mode 1)
    • A new tree search (Mode 5)
different ways to implement search implementations
Different ways to implement search implementations
  • 1, Use default implementation
    • Just specify all atomic search operations (ASOs) provided
  • 2, Override “search_one_frame”
    • Only need to specify GMM computation and how to “search_one_frame”
  • 3, Override the whole mechanism
    • For people who dislike the default so much
    • Override how to “search”
consequence of re factoring
Consequence of Re-factoring
  • Calling decode
    • Could use flat-lexicon decoding as well
  • decode_anytopo still exists
    • For backward compatibility
    • decode_anytopo = decode
  • allphone, align, decode_anytopo could use fast GMM computation
  • decode could use S3’s SCHMM
  • Command-line is now synchronized
summary on the architecture
Summary on the Architecture
  • Sphinx 3.6
    • A gentle re-factoring has carried out.
    • A more flexible architecture
    • A better playground for AM and search people
      • S2 SCHMM computation routine?
      • NN, SVM, ML techniques for AM?
speed in sphinx 3 6
Speed in Sphinx 3.6
  • Further work on Context-Independent Senone-based GMM Selection (CIGMMS)
    • 20-30% Speed Up
  • 3 tricks were proposed
    • Fixed amount of CD senone compute.
    • Use of best Gaussian index
    • Tightening factor of CI-phone beam
  • Published in “On Improvements of CI-based GMM Selection “ (Chan 2005)
  •  but not very well received
    • Alright, there are accuracy lost
a note on sphinx 3 6 speed performance
A note on Sphinx 3.6 Speed Performance
  • Sphinx 3.X works under 1xRT in most tasks. E.g.
    • Smartnote/Sphinx Integration
    • Broadcast News UNTUNEDRESULT: 1.5xRT
  • Sphinx 3.X is still slower than Sphinx 2
    • Fast setup of Sphinx 2: use 256 codeword SCHMM
    • Fast setup of Sphinx 3: use 2000-6000 senone FCHMM
      • Historical notes: Comparable SCHMM setup has 4096 codewords
    • Need benchmarking to truly judge
speed conclusion
Speed - Conclusion
  • Sphinx 3.X is in a reasonable level
    • Sphinx 2 should still be used in speed-critical condition
  • Further work
    • GALE/CALO will still be around in 3.6/3.7
      • Accuracy become more motivated than speed
our immediate problem
Our Immediate Problem
  • What help us more in accuracy?
    • Acoustic modeling ?
    • Speaker Adaptation ?
    • Search Improvement ?
accuracy improvement of sphinx 3 6 speaker adaptation
Accuracy Improvement of Sphinx 3.6 – Speaker Adaptation
  • Speaker adaptation techniques are shown to be crucia
  • Even in tough task (e.g. CALO)
    • 10-15% relative improvement
    • Gain similar to LM/AM modeling work
accuracy improvement of sphinx 3 6 speaker adaptation cont
Accuracy Improvement of Sphinx 3.6 – Speaker Adaptation (cont.)
  • Dave has done a great job on
    • Multiple-class MLLR
    • MAP adaptation
  • Things to watch
    • Ziad’s VTLN implementation
conclusion in speaker adaptation
Conclusion in Speaker Adaptation
  • Observation in 3.6
    • Speaker adaptation is very important.
    • What we still need:
      • Maximum likelihood linear transformation (MLLT)
      • Combination of MLLT, MLLR, MAP and VTLN
        • Proved to be additive
accuracy improvement of sphinx 3 6 search
Accuracy Improvement of Sphinx 3.6 - Search
  • Our Attempts in Flat Lexicon Decoder
    • Full triphones
      • 2.5% rel. gain
      • But 100xRT
    • Full trigram
      • Will give another 5-10 times slowdown
  • Diff between Tree vs Flat Lex. Decoder
    • 5% relative
  • Conclusion:
    • Further improvement in search is limited
accuracy improvement in sphinx 3 6 modeling
Accuracy Improvement in Sphinx 3.6 -Modeling
  • Mainly
    • on addition of data (Major contributor)
    • interpolation of LM (very decent gain)
  • Things to watch: Yi’s LDA
  • Yet to explore
    • Speaker Adaptive Training (SAT)
    • Semi-tied Covariance (STC) Matrix
  • Conclusion:
    • Commodity techniques are still not widely used in Sphinx (Bad sign).
conclusion of accuracy improvement 3 6
Conclusion of Accuracy Improvement 3.6
  • 3.6 has a healthy development in speaker adaptation
  • Improvement in search is hard
  • Need 10x effort on acoustic modeling
    • Commodity techniques are still not there
    • Three final keywords: MLLT, SAT, STC
  • Priorities:
    • Adaptation > AM, LM > 2 stage Search >>

1st Stage

fsg search
FSG search
  • 3.6 supports FSG search
    • Adapted from Sphinx 2’s implementation
  • Current Issues
    • No lextree implementation
    • Static allocation of all HMMs; not allocated “on demand”
    • FSG transitions represented by NxN matrix
  • Other wish list
    • No histogram pruning
    • No state-based implementation
  • Need more testing
confidence annotation
Confidence Annotation
  • conf
  • Adapted from Rong with permission
    • Compute Word Posterior Probability of a word given lattice
  • Still under work
language model related
Language Model Related
  • Now fully supports
    • Text-based LM reading
    • Inter-conversion of LM in TXT & DMP format
      • lm_convert = lm3g2dmp++
    • LM switching API in live_decode_API
hieroglyphs
Hieroglyphs
  • A collection of documentation of using Sphinx 3, SphinxTrain and CMU LM Tool kit
  • 1st Draft is completed
    • All chapter are filled with information.
    • Writing the 2nd Draft
  • “Chief Editor”: Arthur Chan
  • Does it even exist?
hieroglyph an outline
Hieroglyph: An outline
  • Chapter 1: Licensing of Sphinx, SphinxTrain and LM Toolkit
  • Chapter 2: Introduction to Sphinx
  • Chapter 3: Introduction to Speech Recognition
  • Chapter 4: Recipe of Building Speech Application using Sphinx
  • Chapter 5: Different Software Toolkits of Sphinx
  • Chapter 6: Acoustic Model Training
  • Chapter 7: Language Model Training
  • Chapter 8: Search Structure and Speed-up of the Speech recognizer
  • Chapter 9: Speaker Adaptation
  • Chapter 10: Research using Sphinx
  • Chapter 11: Development using Sphinx
  • Appendix A: Command Line Information
  • Appendix B: FAQ
book reviews of hieroglyphs
Book Reviews of Hieroglyphs
  • “You wrote the worst preface I have ever seen in my life. “ Dr. Evandro Gouvea
  • “The content is o. k., but the writing is still ……” Prof. Alex I. Rudnicky
  • “Wow, it is thick. And, oh…… there are no blank spaces! You are not supposed to add contents in any CMU open source manuals, don’t you know?” Dr. Alan W. Black
other documents
Other Documents
  • Robust Tutorial (Aka Sphinx 101)
    • Thanks to Evandro
    • Now could be used for
      • archive_s3
      • Sphinx 2
      • Sphinx 3
    • http://www.cs.cmu.edu/~robust/Tutorial/
  • Doxygen documentation for Sphinx 3.x is fully available
    • http://www.speech.cs.cmu.edu/sphinx/sphinx3/doxygen/html/
what is important
What is important?
  • Keep the current design priorities:
    • 1, Accuracy
      • We are just OK and we badly need to improve it.
    • 2, Speed
      • We are OK and it doesn’t hurt to improve it
    • 3, Functionalities
      • Still a pain to use Sphinx 3 but it is constant improved
      • Usability eventually implies distributing models.
  • Accuracy should be prior to Speed
    • No excuse in 3.7
roadmap in x 7
Roadmap: In X=7……
  • For GALE/CALO
    • Speaker Clustering/SAT
      • Bridging SI and SA
    • VTLN
    • LDA
  • 0.5 x CALO may need further speed improvement
    • BBI
    • More secret ideas in GMM computation
roadmap cont
Roadmap (cont.)
  • X=8
    • D.T.
      • MMIE, MCE
    • STC
    • Interface with HTK model
  • X=9
    • D.T. + S.A.
  • X>10
    • Time to fire Arthur Chan and hire an assistant professor
other possibilities of sphinx
Other Possibilities of Sphinx?

[You fill in this part]

we need your help
We need your help!
  • Project Manager: Enable Development of Sphinx
    • Translation: Kick/Fix people and Kicked/Fixed by Evandro
  • Developers: Incorporate state-of-art speech technology into Sphinx
    • Translation: Fix 1 bug and Generate 5 more
  • Maintainer: Ensure integrity of Sphinx code and resource
    • Translation: You become so called the “Grand Janitor of Sphinx”.
  • Tester: Enable test-based development in Sphinx
    • Translation: You will learn a lot of Zen-Buddhism.
our current motto subject to change
Our Current Motto (Subject to Change)

“Don’t ever underestimate yourself…… You never know what a kind of mess you could make.”

-Dr. Evandro Gouvea

conclusion for sphinx 3 x
Conclusion for Sphinx 3.X
  • We have done something
  • We are making some sense in the system development now
  • We have healthy growth in accuracy
    • But we still need more
thank you
Thank you
  • Acknowledgement
    • Rich/Alan: for your constant encouragement
    • Alex: for your understanding of Yin/Yang
    • Rong: for contributing the confidence estimation program
    • Bano: for reminding me I could die at any time when we were in Lake Arthur ->
      • Hieroglyphs 1st draft’s progress sped up.
    • Sphinx developers: without you, I won’t be the “Grand Janitor”.
    • Sphinx users: for your capabilities of giving me nightmares
postscript a word from my friend
Postscript, a word from my friend

“Don’t ever underestimate yourself…… You never know what a mess you could make.”

–Dr. Evandro Gouvea

pros cons of batch sequential architecture
Pros/Cons of Batch Sequential Architecture
  • Pros:
    • Great flexibility for individual programmers
    • No assumption, data structure are usually optimized for the application.
      • Align and allphone have optimization.
    • Crafting in individual application has high quality
  • Cons:
    • Great difficulty in maintenance
      • Most changes need to be carried out for 5-6 times.
    • Spread disease of code duplication
      • Code with functionality was duplicated multiple times
ad