Download
cmu shpinx speech recognition engine n.
Skip this Video
Loading SlideShow in 5 Seconds..
CMU Shpinx Speech Recognition Engine PowerPoint Presentation
Download Presentation
CMU Shpinx Speech Recognition Engine

CMU Shpinx Speech Recognition Engine

1092 Views Download Presentation
Download Presentation

CMU Shpinx Speech Recognition Engine

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. CMU Shpinx Speech Recognition Engine Reporter : Chun-Feng Liao NCCU Dept. of Computer Sceince Intelligent Media Lab

  2. Purposes of this project • Finding out how an efficient speech recognition engine can be implemented. • Examine the source code of Sphinx2 to find out the role and function of each component. • Reading key chapters of Dr. Mosur K. Ravishankar’s thesis as a reference. • Some demo programs will be given during oral presentation.

  3. Presentation Agenda • Project Summary/ Agenda/ Goal. (In English) • Introduction. • Basics of Speech Recognitions. • Architecture of CMU Sphinx. • Acoustic Model and HMM. • Language Model. • Java™ Platform Issues. • Demo • Conclusion.

  4. Voice Technologies • In the mid- to late 1990s, personal computers started to become powerful enough to support ASR • The two key underlying technologies behind these advances are speech recognition (SR) and text-to-speech synthesis (TTS).

  5. Basics of Speech Recognition

  6. Speech Recognition • Capturing speech (analog) signals • Digitizing the sound waves, converting them to basic language units or phonemes(音素). • Constructing words from phonemes, and contextually analyzing the words to ensure correct spelling for words that sound alike (such as write and right).

  7. Speech Recognition Process Flow Source:Microsoft Speech.NET Home(http://www.microsoft.com/speech/ )

  8. Recognition Process Flow Summary • Step 1:User Input • The system catches user’s voice in the form of analog acoustic signal . • Step 2:Digitization • Digitize the analog acoustic signal. • Step 3:Phonetic Breakdown • Breaking signals into phonemes.

  9. Recognition Process Flow Summary(2) • Step 4:Statistical Modeling • Mapping phonemes to their phonetic representation using statistics model. • Step 5:Matching • According to grammar , phonetic representation and Dictionary , the system returns an n-best list (I.e.:a word plus a confidence score) • Grammar-the union words or phrases to constraint the range of input or output in the voice application. • Dictionary-the mapping table of phonetic representation and word(EX:thu,theethe)

  10. Architecture of CMU Sphinx.

  11. Introduction to CMU Sphinx • A speech recognition system developed at Carnegie Mellon University. • Consists of a set of libraries • core speech recognition functions • low-level audio capture • Continuous speech decoding • Speaker-independent

  12. Brief History of CMU Sphinx • Sphinx-I (1987) • The first user independent ,high performance ASR of the world. • Written in C by Kai-Fu Lee (李開復博士,現任Microsoft Asia首席技術顧問/副總裁). • Sphinx-II (1992) • Written by Xuedong Huang in C. (黃學東博士,現為Microsoft Speech.NET團隊領導人) • 5-state HMM / N-gram LM. • (我們可以推測,CMU Sphinx的核心技術對Microsoft Speech SDK影響很大。)

  13. Brief History of CMU Sphinx (2) • Sphinx 3 (1996) • Built by Eric Thayer and Mosur Ravishankar. • Slower than Sphinx-II but the design is more flexible. • Sphinx 4 (Originally Sphinx 3j) • Refactored from Sphinx 3. • Fully implemented in Java. • Not finished yet.

  14. Components of CMU Sphinx

  15. Front End • libsphinx2fe.lib / libsphinx2ad.lib • Low-level audio access • Continuous Listening and Silence Filtering • Front End API overview.

  16. Knowledge Base • The data that drives the decoder. • Three sets of data • Acoustic Model. • Language Model. • Lexicon (Dictionary).

  17. Acoustic Model • /model/hmm/6k • Database of statistical model. • Each statistical model represents a phoneme. • Acoustic Models are trained by analyzing large amount of speech data.

  18. HMM in Acoustic Model • HMM represent each unit of speech in the Acoustic Model. • Typical HMM use 3-5 states to model a phoneme. • Each state of HMM is represented by a set of Gaussian mixture density functions. • Sphinx2 default phone set.

  19. Gaussian Mixtures • Refer to text book p 33 eq 38 • Represent each state in HMM. • Each set of Gaussian Mixtures are called “senones”. • HMM can share “senones”.

  20. Language Model • Describes what is likely to be spoken in a particular context • Word transitions are defined in terms of transition probabilities • Helps to constrain the search space • See examples of LM.

  21. N-gram Language Model • Probability of word N dependent on word N-1, N-2, ... • Bigrams and trigrams most commonly used • Used for large vocabulary applications such as dictation • Typically trained by very large (millions of words) corpus

  22. Decoder • Selects next set of likely states • Scores incoming features against these states • Drop low scoring states • Generates results

  23. Speech in Java™ Platform

  24. Sun Java Speech API • First released on October 26, 1998. • The Java™ Speech API allows Java applications to incorporate speech technology into their user interfaces. • Defines a cross-platform API to support command and control recognizers, dictation systems and speech synthesizers.

  25. Implementations of Java Speech API • Open Source • FreeTTS / CMU Sphinx4. • IBM Speech for Java. • Cloud Garden. • L&H TTS for Java Speech API. • Conversa Web 3.0.

  26. Free TTS • Fully implemented with Java. • Based upon Flite 1.1: a small run-time speech synthesis engine developed at CMU. • Partial support for JSAPI 1.0. • Speech Recognition functions. • JSML.

  27. Sphinx 4 (Sphinx 3j) • Fully implemented with Java. • Speed is equal or faster than Sphinx3. • Acoustic model and Language model is under construction. • Source code are available by CVS.(but you can not run any applications without models !) For Example : To check out the Sphinx4 ,you can using the following command. cvs -z3 -d:pserver:anonymous@cvs.sourceforge.net:/cvsroot/cmusphinx co sphinx4

  28. Java™ Platform Issues • GC makes managing data much easier • Native engines typically optimize inner loops for the CPU – can't do that on the Java platform. • Native engines arrange data to • optimize cache hits – can't really do that either.

  29. DEMO • Sphinx-II batch mode. • Sphinx-II live mode. • Sphinx-II Client / Server mode. • A Simple Free TTS Application. • (Java-based) TTS vs (c-based)SR . • Motion Planner with Free TTS-using Java Web Start™.(This is GRA course final project)

  30. Summary • Sphinx is a open source Speech Recognition developed at CMU. • FE / KB / Decoder form the core of SR system. • FE receives and processes speech signal. • Knowledge Base provide data for Decoder. • Decoder search the states and return the results. • Speech Recognition is a challenging problem for the Java platform.

  31. Reference • Mosur K.Ravishankar, Efficient Alogrithms for Speech Recognition, CMU, 1996. • Mosur K.Ravishankar, Kevin A. Lenzo ,Sphinx-II User Guide , CMU,2001. • Xuedong Huang,Alex Acerd,Hsiao-Wuen hon,Spoken Language Processing,Prentice Hall,2000.

  32. Reference (on-line) • On-line documents of Java™ Speech API • http://java.sun.com/products/java-media/speech/ • On-line documents of Free TTS • http://freetts.sourceforge.net/docs/ • On-line documents of Sphinx-II • http://www.speech.cs.cmu.edu/sphinx/

  33. Q & A