CMU Shpinx Speech Recognition Engine Reporter : Chun-Feng Liao NCCU Dept. of Computer Sceince Intelligent Media Lab
Purposes of this project • Finding out how an efficient speech recognition engine can be implemented. • Examine the source code of Sphinx2 to find out the role and function of each component. • Reading key chapters of Dr. Mosur K. Ravishankar’s thesis as a reference. • Some demo programs will be given during oral presentation.
Presentation Agenda • Project Summary/ Agenda/ Goal. (In English) • Introduction. • Basics of Speech Recognitions. • Architecture of CMU Sphinx. • Acoustic Model and HMM. • Language Model. • Java™ Platform Issues. • Demo • Conclusion.
Voice Technologies • In the mid- to late 1990s, personal computers started to become powerful enough to support ASR • The two key underlying technologies behind these advances are speech recognition (SR) and text-to-speech synthesis (TTS).
Speech Recognition • Capturing speech (analog) signals • Digitizing the sound waves, converting them to basic language units or phonemes(音素). • Constructing words from phonemes, and contextually analyzing the words to ensure correct spelling for words that sound alike (such as write and right).
Speech Recognition Process Flow Source:Microsoft Speech.NET Home(http://www.microsoft.com/speech/ )
Recognition Process Flow Summary • Step 1:User Input • The system catches user’s voice in the form of analog acoustic signal . • Step 2:Digitization • Digitize the analog acoustic signal. • Step 3:Phonetic Breakdown • Breaking signals into phonemes.
Recognition Process Flow Summary(2) • Step 4:Statistical Modeling • Mapping phonemes to their phonetic representation using statistics model. • Step 5:Matching • According to grammar , phonetic representation and Dictionary , the system returns an n-best list (I.e.:a word plus a confidence score) • Grammar-the union words or phrases to constraint the range of input or output in the voice application. • Dictionary-the mapping table of phonetic representation and word(EX:thu,theethe)
Introduction to CMU Sphinx • A speech recognition system developed at Carnegie Mellon University. • Consists of a set of libraries • core speech recognition functions • low-level audio capture • Continuous speech decoding • Speaker-independent
Brief History of CMU Sphinx • Sphinx-I (1987) • The first user independent ,high performance ASR of the world. • Written in C by Kai-Fu Lee (李開復博士，現任Microsoft Asia首席技術顧問/副總裁). • Sphinx-II (1992) • Written by Xuedong Huang in C. (黃學東博士，現為Microsoft Speech.NET團隊領導人) • 5-state HMM / N-gram LM. • (我們可以推測，CMU Sphinx的核心技術對Microsoft Speech SDK影響很大。)
Brief History of CMU Sphinx (2) • Sphinx 3 (1996) • Built by Eric Thayer and Mosur Ravishankar. • Slower than Sphinx-II but the design is more flexible. • Sphinx 4 (Originally Sphinx 3j) • Refactored from Sphinx 3. • Fully implemented in Java. • Not finished yet.
Front End • libsphinx2fe.lib / libsphinx2ad.lib • Low-level audio access • Continuous Listening and Silence Filtering • Front End API overview.
Knowledge Base • The data that drives the decoder. • Three sets of data • Acoustic Model. • Language Model. • Lexicon (Dictionary).
Acoustic Model • /model/hmm/6k • Database of statistical model. • Each statistical model represents a phoneme. • Acoustic Models are trained by analyzing large amount of speech data.
HMM in Acoustic Model • HMM represent each unit of speech in the Acoustic Model. • Typical HMM use 3-5 states to model a phoneme. • Each state of HMM is represented by a set of Gaussian mixture density functions. • Sphinx2 default phone set.
Gaussian Mixtures • Refer to text book p 33 eq 38 • Represent each state in HMM. • Each set of Gaussian Mixtures are called “senones”. • HMM can share “senones”.
Language Model • Describes what is likely to be spoken in a particular context • Word transitions are defined in terms of transition probabilities • Helps to constrain the search space • See examples of LM.
N-gram Language Model • Probability of word N dependent on word N-1, N-2, ... • Bigrams and trigrams most commonly used • Used for large vocabulary applications such as dictation • Typically trained by very large (millions of words) corpus
Decoder • Selects next set of likely states • Scores incoming features against these states • Drop low scoring states • Generates results
Sun Java Speech API • First released on October 26, 1998. • The Java™ Speech API allows Java applications to incorporate speech technology into their user interfaces. • Defines a cross-platform API to support command and control recognizers, dictation systems and speech synthesizers.
Implementations of Java Speech API • Open Source • FreeTTS / CMU Sphinx4. • IBM Speech for Java. • Cloud Garden. • L&H TTS for Java Speech API. • Conversa Web 3.0.
Free TTS • Fully implemented with Java. • Based upon Flite 1.1: a small run-time speech synthesis engine developed at CMU. • Partial support for JSAPI 1.0. • Speech Recognition functions. • JSML.
Sphinx 4 (Sphinx 3j) • Fully implemented with Java. • Speed is equal or faster than Sphinx3. • Acoustic model and Language model is under construction. • Source code are available by CVS.(but you can not run any applications without models !) For Example : To check out the Sphinx4 ,you can using the following command. cvs -z3 -d:pserver:email@example.com:/cvsroot/cmusphinx co sphinx4
Java™ Platform Issues • GC makes managing data much easier • Native engines typically optimize inner loops for the CPU can't do that on the Java platform. • Native engines arrange data to • optimize cache hits can't really do that either.
DEMO • Sphinx-II batch mode. • Sphinx-II live mode. • Sphinx-II Client / Server mode. • A Simple Free TTS Application. • (Java-based) TTS vs (c-based)SR . • Motion Planner with Free TTS-using Java Web Start™.(This is GRA course final project)
Summary • Sphinx is a open source Speech Recognition developed at CMU. • FE / KB / Decoder form the core of SR system. • FE receives and processes speech signal. • Knowledge Base provide data for Decoder. • Decoder search the states and return the results. • Speech Recognition is a challenging problem for the Java platform.
Reference • Mosur K.Ravishankar, Efficient Alogrithms for Speech Recognition, CMU, 1996. • Mosur K.Ravishankar, Kevin A. Lenzo ,Sphinx-II User Guide , CMU,2001. • Xuedong Huang,Alex Acerd,Hsiao-Wuen hon,Spoken Language Processing,Prentice Hall,2000.
Reference (on-line) • On-line documents of Java™ Speech API • http://java.sun.com/products/java-media/speech/ • On-line documents of Free TTS • http://freetts.sourceforge.net/docs/ • On-line documents of Sphinx-II • http://www.speech.cs.cmu.edu/sphinx/