Loading in 2 Seconds...
Loading in 2 Seconds...
74.406 Natural Language Processing - Speech Processing -. Spoken Language Processing from speech to text to syntax and semantics to speech Speech Recognition human speech recognition and production acoustics signal analysis phonetics recognition methods (HMMs) Review.
Acoustic / sound wave
Filtering, Sampling Spectral Analysis; FFT
Features (Phonemes; Context)
Signal Processing / Analysis
HMM, Neural Networks
Grammar or Statistics
Phoneme Sequences / Words
Grammar or Statistics for
likely word sequences
Word Sequence / Sentence
"She just had a baby."
From Signal Representation derive, e.g.
strong frequency components; characterize particular vowels; gender of speaker
baseline for higher frequency harmonics like formants; gender characteristic
characteristic for e.g. plosives (form of articulation)
Video of glottis and speech signal in lingWAVES (from http://www.lingcom.de)
Recognition Process based on
The Viterbi Algorithm finds an optimal sequence of states in continuous Speech Recognition, given an observation sequence of phones and a probabilistic (weighted) FA (state graph). The algorithm returns the path through the automaton which has maximum probability and accepts the observation sequence.
a[s,s'] is the transition probability (in the phonetic word model) from current state s to next state s', and b[s',ot] is the observation likelihood of s' given ot. b[s',ot] is 1 if the observation symbol matches the state, and 0 otherwise.
function VITERBI (observations of len T, state-graph) returns best-path
num-states NUM-OF-STATES (state-graph)
Create a path probability matrix viterbi[num-states+2,T+2]
for each time step tfrom 0to Tdo
for each state sfrom 0 to num-statesdo
for each transition s' from s in state-graph
new-score viterbi[s,t] * a[s,s'] * b[s',(ot)]
if ((viterbi[s',t+1] = 0) || (new-score > viterbi[s',t+1]))
Backtrace from highest probability state in the final column of viterbiand return path
single word vs. continuous speech
unlimited vs. large vs. small vocabulary
speaker-dependent vs. speaker-independent
training (or not)
Speech Recognition vs. Speaker Identification
Hong, X. & A. Acero & H. Hon: Spoken Language Processing. A Guide to Theory, Algorithms, and System Development. Prentice-Hall, NJ, 2001
Figures taken from:
Jurafsky, D. & J. H. Martin, Speech and Language Processing, Prentice-Hall, 2000, Chapters 5 and 7.
NL and Speech Resources and Tools:
German Demonstration Center for Speech and Language Technologies:http://www.lt-demo.org/