Cmu shpinx speech recognition engine
Download
1 / 34

cmu shpinx - PowerPoint PPT Presentation


  • 897 Views
  • Uploaded on

CMU Shpinx Speech Recognition Engine. Reporter : Chun-Feng Liao NCCU Dept. of Computer Sceince Intelligent Media Lab. Purposes of this project. Finding out how an efficient speech recognition engine can be implemented.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'cmu shpinx ' - Leo


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Cmu shpinx speech recognition engine l.jpg

CMU Shpinx Speech Recognition Engine

Reporter : Chun-Feng Liao

NCCU Dept. of Computer Sceince

Intelligent Media Lab


Purposes of this project l.jpg
Purposes of this project

  • Finding out how an efficient speech recognition engine can be implemented.

  • Examine the source code of Sphinx2 to find out the role and function of each component.

  • Reading key chapters of Dr. Mosur K. Ravishankar’s thesis as a reference.

  • Some demo programs will be given during oral presentation.


Presentation agenda l.jpg
Presentation Agenda

  • Project Summary/ Agenda/ Goal. (In English)

  • Introduction.

  • Basics of Speech Recognitions.

  • Architecture of CMU Sphinx.

    • Acoustic Model and HMM.

    • Language Model.

  • Java™ Platform Issues.

  • Demo

  • Conclusion.


Voice technologies l.jpg
Voice Technologies

  • In the mid- to late 1990s, personal computers started to become powerful enough to support ASR

  • The two key underlying technologies behind these advances are speech recognition (SR) and text-to-speech synthesis (TTS).



Speech recognition l.jpg
Speech Recognition

  • Capturing speech (analog) signals

  • Digitizing the sound waves, converting them to basic language units or phonemes(音素).

  • Constructing words from phonemes, and contextually analyzing the words to ensure correct spelling for words that sound alike (such as write and right).


Speech recognition process flow l.jpg
Speech Recognition Process Flow

Source:Microsoft Speech.NET Home(http://www.microsoft.com/speech/ )


Recognition process flow summary l.jpg
Recognition Process Flow Summary

  • Step 1:User Input

    • The system catches user’s voice in the form of analog acoustic signal .

  • Step 2:Digitization

    • Digitize the analog acoustic signal.

  • Step 3:Phonetic Breakdown

    • Breaking signals into phonemes.


Recognition process flow summary 2 l.jpg
Recognition Process Flow Summary(2)

  • Step 4:Statistical Modeling

    • Mapping phonemes to their phonetic representation using statistics model.

  • Step 5:Matching

    • According to grammar , phonetic representation and Dictionary , the system returns an n-best list (I.e.:a word plus a confidence score)

    • Grammar-the union words or phrases to constraint the range of input or output in the voice application.

    • Dictionary-the mapping table of phonetic representation and word(EX:thu,theethe)



Introduction to cmu sphinx l.jpg
Introduction to CMU Sphinx

  • A speech recognition system developed at Carnegie Mellon University.

  • Consists of a set of libraries

    • core speech recognition functions

    • low-level audio capture

  • Continuous speech decoding

  • Speaker-independent


Brief history of cmu sphinx l.jpg
Brief History of CMU Sphinx

  • Sphinx-I (1987)

    • The first user independent ,high performance ASR of the world.

    • Written in C by Kai-Fu Lee (李開復博士,現任Microsoft Asia首席技術顧問/副總裁).

  • Sphinx-II (1992)

    • Written by Xuedong Huang in C. (黃學東博士,現為Microsoft Speech.NET團隊領導人)

    • 5-state HMM / N-gram LM.

  • (我們可以推測,CMU Sphinx的核心技術對Microsoft Speech SDK影響很大。)


Brief history of cmu sphinx 2 l.jpg
Brief History of CMU Sphinx (2)

  • Sphinx 3 (1996)

    • Built by Eric Thayer and Mosur Ravishankar.

    • Slower than Sphinx-II but the design is more flexible.

  • Sphinx 4 (Originally Sphinx 3j)

    • Refactored from Sphinx 3.

    • Fully implemented in Java.

    • Not finished yet.



Front end l.jpg
Front End

  • libsphinx2fe.lib / libsphinx2ad.lib

  • Low-level audio access

  • Continuous Listening and Silence Filtering

  • Front End API overview.


Knowledge base l.jpg
Knowledge Base

  • The data that drives the decoder.

  • Three sets of data

    • Acoustic Model.

    • Language Model.

    • Lexicon (Dictionary).


Acoustic model l.jpg
Acoustic Model

  • /model/hmm/6k

  • Database of statistical model.

  • Each statistical model represents a phoneme.

  • Acoustic Models are trained by analyzing large amount of speech data.


Hmm in acoustic model l.jpg
HMM in Acoustic Model

  • HMM represent each unit of speech in the Acoustic Model.

  • Typical HMM use 3-5 states to model a phoneme.

  • Each state of HMM is represented by a set of Gaussian mixture density functions.

  • Sphinx2 default phone set.


Gaussian mixtures l.jpg
Gaussian Mixtures

  • Refer to text book p 33 eq 38

  • Represent each state in HMM.

  • Each set of Gaussian Mixtures are called “senones”.

  • HMM can share “senones”.


Language model l.jpg
Language Model

  • Describes what is likely to be spoken in a particular context

  • Word transitions are defined in terms of transition probabilities

  • Helps to constrain the search space

  • See examples of LM.


N gram language model l.jpg
N-gram Language Model

  • Probability of word N dependent on word N-1, N-2, ...

  • Bigrams and trigrams most commonly used

  • Used for large vocabulary applications such as dictation

  • Typically trained by very large (millions of words) corpus


Decoder l.jpg
Decoder

  • Selects next set of likely states

  • Scores incoming features against these states

  • Drop low scoring states

  • Generates results



Sun java speech api l.jpg
Sun Java Speech API

  • First released on October 26, 1998.

  • The Java™ Speech API allows Java applications to incorporate speech technology into their user interfaces.

  • Defines a cross-platform API to support command and control recognizers, dictation systems and speech synthesizers.


Implementations of java speech api l.jpg
Implementations of Java Speech API

  • Open Source

    • FreeTTS / CMU Sphinx4.

  • IBM Speech for Java.

  • Cloud Garden.

  • L&H TTS for Java Speech API.

  • Conversa Web 3.0.


Free tts l.jpg
Free TTS

  • Fully implemented with Java.

  • Based upon Flite 1.1: a small run-time speech synthesis engine developed at CMU.

  • Partial support for JSAPI 1.0.

    • Speech Recognition functions.

    • JSML.


Sphinx 4 sphinx 3j l.jpg
Sphinx 4 (Sphinx 3j)

  • Fully implemented with Java.

  • Speed is equal or faster than Sphinx3.

  • Acoustic model and Language model is under construction.

  • Source code are available by CVS.(but you can not run any applications without models !)

For Example : To check out the Sphinx4 ,you can using the following command.

cvs -z3 -d:pserver:[email protected]:/cvsroot/cmusphinx co sphinx4


Java platform issues l.jpg
Java™ Platform Issues

  • GC makes managing data much easier

  • Native engines typically optimize inner loops for the CPU – can't do that on the Java platform.

  • Native engines arrange data to

  • optimize cache hits – can't really do that either.


Slide30 l.jpg
DEMO

  • Sphinx-II batch mode.

  • Sphinx-II live mode.

  • Sphinx-II Client / Server mode.

  • A Simple Free TTS Application.

  • (Java-based) TTS vs (c-based)SR .

  • Motion Planner with Free TTS-using Java Web Start™.(This is GRA course final project)


Summary l.jpg
Summary

  • Sphinx is a open source Speech Recognition developed at CMU.

  • FE / KB / Decoder form the core of SR system.

  • FE receives and processes speech signal.

  • Knowledge Base provide data for Decoder.

  • Decoder search the states and return the results.

  • Speech Recognition is a challenging problem for the Java platform.


Reference l.jpg
Reference

  • Mosur K.Ravishankar, Efficient Alogrithms for Speech Recognition, CMU, 1996.

  • Mosur K.Ravishankar, Kevin A. Lenzo ,Sphinx-II User Guide , CMU,2001.

  • Xuedong Huang,Alex Acerd,Hsiao-Wuen hon,Spoken Language Processing,Prentice Hall,2000.


Reference on line l.jpg
Reference (on-line)

  • On-line documents of Java™ Speech API

    • http://java.sun.com/products/java-media/speech/

  • On-line documents of Free TTS

    • http://freetts.sourceforge.net/docs/

  • On-line documents of Sphinx-II

    • http://www.speech.cs.cmu.edu/sphinx/



ad