1 / 25

Part V : Ease-of-use Chapter 17: Mobile Speech Recognition

Part V : Ease-of-use Chapter 17: Mobile Speech Recognition. Dirk Schnelle. Introduction. Voice based interaction with mobile devices Is NOT copying existing speech recognizers to the device and run it Limititations of the device have to be considered Computational power Limited memory …

rodney
Download Presentation

Part V : Ease-of-use Chapter 17: Mobile Speech Recognition

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Part V : Ease-of-useChapter 17: Mobile Speech Recognition Dirk Schnelle

  2. Introduction • Voice based interaction with mobile devices • Is NOT copying existing speech recognizers to the device and run it • Limititations of the device have to be considered • Computational power • Limited memory • … • Different architectures for enabling speech recognition on the device exist <Chapter Short-Title Here>:

  3. Speech Recognition Recognizer transcribes spoken language into text <Chapter Short-Title Here>:

  4. General Architecture • Signal Processor • Generates real valued vectorsi from the speech signal • At regular intervals e.g. each 10 msec • Model • Contains a set of prototypes i • Decoder • Converts i into an utterance • Finds the i closest to i for a given distance function d <Chapter Short-Title Here>:

  5. Word based Recognizer Acoustic symbols aiare words Example {a1=one, a2=two, a3=three} No post processing required Unflexible For smaller vocabularies Phoneme based Recognizer Acoustic symbols aiare phonemes Phonemes are small sound units Example {a1=W, a2=AH, a3=N, a1=., a2=T, a3=U, a1=W, …} Requires post processing More acurate Reduce decoding to small sound units More flexible Can handle a larger vocabulary more easily Analogy:First attempts to start writing Symbols for each word Symbol for each syllable Letters that wew find today Recognizer Types <Chapter Short-Title Here>:

  6. Limitations of Embedded Devices • Memory • Impossible to store large models • Computational Power • Signal Processor and Decoder are computational intensive • Power Consumption • Computational intensive tasks consume too much batteryLifetime is reduced • Floating Point Support • Current processors (StrongARM, XScale) do not support floating point arthmetics • Emulation is computational intenisive and slow <Chapter Short-Title Here>:

  7. Classification by Schnelle Service Dependent Speech Recogniton Recognizer as a service in the network Audio Streaming Media Resource Control Protocol (MRCP) Distributed Speech Recognition (DSR) Device Inherent Speech Recognition Recognizer on the device Hardware based Speech Recogniton Dynamic Time Warping (DTW) Hidden Markov Models (HMM) Artificial Neural Networks (ANN) Different classification by Zaykovskiy Client Signal Processor and Decoder on the device Client-Server Signal Processor on the device Decoder as a service in the network Server Signal Processor and Decoder as a recognition service on the server Main Architectures <Chapter Short-Title Here>:

  8. General Parameters Speaking Mode Isolated word recognition vs. Continuous speech Speaking Style Read speech vs. Spontanous speech Enrollment Speaker dependent vs. Speaker independent Vocabulary Size of the vocabulary Perplexity Number of words to follow a single word SNR Signal-to-noise ratio Transducer Noise cancelling headset vs. Telephone UC specific Parameters Network dependency None vs. Full network dependent Network bandwith Amount of data to dsend over the network Transmission degradation Loss of information while transmitting the data Server load Scalabilty of the server if any Integration and Maintanance Ease of access and applying bugfixes Responsiveness Real-time capabilities Parameters of Speech Recognition in UC <Chapter Short-Title Here>:

  9. Audio streaming • Use of the embedded device as a microphone replacement • Stream audio over the wireless network • Bluetooth • WI-FI • Signal Processor on the server • Advantages • Full featured recognizer • Large language models • Disadvantage • Requires a stable wireless network connection • Very large amoung of data streamed over the network • Own propietary protocol • Real-time capabilities? <Chapter Short-Title Here>:

  10. MRCP • Standard for audio streaming • Adopted by industry • API to enable clients control media resources over a network • Based on the Real Time Streaming Protocol (RTSP) <Chapter Short-Title Here>:

  11. DSR • Standard by ETSI (European Telecommunicaton Standards Institute) • Goal • Reduce network traffic • Use computational capbilities of the embedded device <Chapter Short-Title Here>:

  12. DSR Implementation • DSR Front-End (Signal Processor) • DSR Backend (Decoder) Sphinx profiling <Chapter Short-Title Here>:

  13. Quantization Pre-Emphasis Signal is weaker in higher frequencies High-pass filter Framing Division in overlapping frames of N samples Windowing Hamming Window Front-End Processing <Chapter Short-Title Here>:

  14. Power Spectrum DFT Mel Spectrum Power Cepstrum ceps=spec1 First 13 cepstral parameters are called the features Front-End Processing II <Chapter Short-Title Here>:

  15. Front-End Profiling • Utterances • Short 2.05 sec • Medium 6.05 sec • Long 30.04 sec • Canditates for improvement • fft • spec magnitude <Chapter Short-Title Here>:

  16. Hardware based speech recognition • Broad range • Partial (FFT computation in DSR) • Full featured recognizers • Advantages • Less runtime problems • Hardware is designed for this purpose • Disadvantage • Loss of flexibility • General • Advantages and disadvantages are dependant on the implemented technology <Chapter Short-Title Here>:

  17. Signal processor Comparable to front-end in DSR Output  =(1,…,n) Prototype Storage  =(i,1,.., i,m) Comparator d(i, j)>μ Distance funtion d Time warping function (relationship between the elements of iand j) Problem Unlikeliy that length of input and template are the same E.g. length of o in word DTW uses dynamic programming Dynamic Time Warping <Chapter Short-Title Here>:

  18. Hidden Markov Models <Chapter Short-Title Here>:

  19. HMMs described as λ=(S,A, B, π, V) States S=(s1,…,sn) Transition probabilities A={ai,j} ai,j denotes probabilityp(si, sj) to move from state si to sj Output probabilities B=(b1,…,bn) bi (x) denotes probability q(x| sj)to observe x in state si Observations O Domain of bi Probability of output sequence O Result Scoring for different recognition hypotheses Unit Matching <Chapter Short-Title Here>:

  20. Rabiner‘s basic questions • Given the observation sequence O=O1O2…Orand a model λ, how do we efficiently compute p(O|λ), the probability of the observation sequence, given the model? Evaluation Problem • For decoding, the question to solve is, given the observation sequence O=O1O2…Orand the model how we choose a corresponding state sequence Q=Q1Q2…Qtwhich is optimal in some meaningful sense (i.e. best “explains” the observations)? Important for speech recognition, find the correct state sequence • How do we adjust the model parameters λ to maximize p(O|λ)? Training <Chapter Short-Title Here>:

  21. Viterbi Algorithm • Solves Rabiner‘s question 2 • Based on dynamic programming • Tries to find the best score (highest probabilty) along a single path at time t (trellis) • Computational intensive • Requires |Au| multiplications and a additions • |Au| is the number of transitions in the model • Rabiner, Jellinek • Computational optimizations • Try to replace multiplications by additions • Usually increase speed at cost of accuracy <Chapter Short-Title Here>:

  22. Lexical Decoding Eliminate those words that do not have a valid dictionary entry Alternative: Statistical grammar Sequences are reduced to a couple of phonemes in a row, e.g. trigrams Output: list of trigrams ordered by score Not suitable for isolated word recognition Semantic Analysis Eliminate those parts that do not match an allowed sequence of words in a dictionary Both steps are Not computational intensive Fast memory access Smaller vocabularies are Faster to handle Require less memory Lexical Decoding and Semantic Analysis <Chapter Short-Title Here>:

  23. Artificial Neural Networks • Derived from the way the human brain works • Goal: create a system that • Is able to learn • Can be used for pattern classification • Alternative to HMM • Output of a neuron • Large amount of calculations • Advantage: only additions and multiplications • Disadvantage: too many operations • Not usable on devices with a low CPU frequency • Nearly no optimization possible • Solution: implement in hardware <Chapter Short-Title Here>:

  24. Future Research Directions • Overview of challenges to implement speech recognition on embedded devices • None of the architectures is ideal in all aspects • Hope of researchers • Embedded devices become powerful enough to run of-the-shelf speech recognition • Does not solve the problems we have today • Main approaches • Hardware engineers try to improve performance of embedded devices • Unable to meet the challenges short time • Will address some of them • Software engineers look for tricks to enable speech recognition on embedded devices • Current gain speed at cost of precision <Chapter Short-Title Here>:

  25. Service dependent speech recognition requires speech recognition service in the network Same potential as desktop speech recognition Speech recognition parameters depend on the used technology Full network dependency (slightly better for DSR) High server load Additional parameters fo UC are worse Device inherent speech recognition HMM, ANN offer highest flexibility DTW requires less resources Speaker dependent Requires enrollment Isolated word recognition Hardware Lowest flexibility Bad performance Requires too many resource Real time may not be achieved Smaller vocabularies Smaller models Lower perplexity Summary <Chapter Short-Title Here>:

More Related