automatic speech recognition n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Automatic speech recognition PowerPoint Presentation
Download Presentation
Automatic speech recognition

Loading in 2 Seconds...

play fullscreen
1 / 61

Automatic speech recognition - PowerPoint PPT Presentation


  • 150 Views
  • Uploaded on

Automatic speech recognition. Contents ASR systems ASR applications ASR courses Presented by Kalle Palomäki Teaching material: Kalle Palomäki & Mikko Kurimo. About Kalle. Background: Acoustics and audio, auditory brain measurements, hearing models, noise robust ASR PhD 2005 at TKK

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Automatic speech recognition' - lona


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
automatic speech recognition
Speech recognitionAutomatic speech recognition

Contents

ASR systems

ASR applications

ASR courses

Presented by Kalle Palomäki

Teaching material: Kalle Palomäki & Mikko Kurimo

slide2
Speech recognition

About Kalle

  • Background: Acoustics and audio, auditory brain measurements, hearing models, noise robust ASR
  • PhD 2005 at TKK
  • Research experience at:
    • Department of Signal Processing and Acoustics
    • Department of Information and computer science, Aalto
    • University of Sheffield, Speech and Hearing group
  • Team leader of noise robust ASR team, Academy research fellow
  • Current research themes
    • Hearing inspired missing data approach in noise robust ASR
    • Sound separation and feature extraction
goals of today
Speech recognitionGoals of today

Learn what methods are used for automatic speech recognition (ASR)

Learn about typical ASR applications and things that affect the ASR performance

Definition: Automatic speech recognition, ASR = transformation of speech audio to text

orientation
Speech recognitionOrientation
  • What are the main challenges faced in automatic speech recognition
  • Try to think of three most important ones with your pair
asr tasks and solutions
ASR tasks and solutions
  • Speaking environment and microphone
    • Office, headset or close-talking
    • Telephone speech, mobile
    • Noise, outside, microphone far away
  • Style of speaking
  • Speaker modeling
asr tasks and solutions1
ASR tasks and solutions
  • Speaking environment and microphone
  • Style of speaking
    • Isolated words
    • Connected words, small vocabulary
    • Word spotting in fluent speech
    • Continuous speech, large vocabulary
    • Spontaneous speech, ungrammatical
  • Speaker modeling
asr tasks and solutions2
ASR tasks and solutions
  • Speaking environment and microphone
  • Style of speaking
  • Speaker modelling
    • Speaker-dependent models
    • Speaker-independent, average speaker models
    • Speaker adaptation
slide8

Automatic speech recognition

Large-vocabulary continuous speech (LVCSR)

Complex pattern recognition system that utilizes many

Probablistic models at different hierarchical levels

Transform speech to text

Speech

signal

Feature

extraction

Acoustic

modeling

Recognized

text

Decoder

Language

modeling

what is speech recognition
Speech recognitionWhat is speech recognition?

Find the most likely word sequence given the acoustic signal and statistical models!

Acoustic model defines the sound units independent of speaker and recording conditions

Language model defines words and how likely they occur together

Lexicon (vocabulary) defines the word set and how the words are formed from sound units

what is speech recognition1
Speech recognitionWhat is speech recognition?

Find the most likely word sequencegiven the acoustic observations and statistical models

what is speech recognition2
Speech recognitionWhat is speech recognition?

After applying Bayes rule

Find the most likely word sequence given the observations and models

Acoustic model Language model

slide12

Preprocessing & Features

Extract the essential information from the signal

Describe the signal by compact feature vectors computed from short time intervals

Speech

signal

Feature

extraction

Acoustic

modeling

Recognized

text

Decoder

Language

modeling

slide13

s t (n)

Audio signal

| DFT{s t(n)} |

Magnitude spectrogram

St,f

Auditory frequency resolution

Mel spectrogram

St,j

Compression

log {St,j}

Mel frequency cepstral coefficients (MFCC)

De-correlation

Discrete cosine transformation

slide14

Acoustic modeling

Find basic speech units and their models in the feature space

Given the features compute model probabilities

Speech

signal

Feature

extraction

Acoustic

modeling

Recognized

text

Decoder

Language

modeling

phonemes
Speech recognitionPhonemes

Basic units of language

Written language: letter

Spoken language: phoneme

Wikipedia: “The smallest contrastive linguistic unit which may bring about a change of meaning”

There are different writing systems, e.g. IPA (International Phonetic Alphabet)‏

The phoneme sets differ depending on language

slide16
Speech recognition

IPA symbols for US English

Speech recognition

slide19

Training

GMM

Classifier

Data collected

_ k kkkkaeaeaeae _ _ _ _ t tttt _ _ _ _ _ _ _

slide20

Training

GMM

Classifier

Data collected

_ k kkkkaeaeaeae _ _ _ _ t tttt _ _ _ _ _ _ _

slide21

Training

GMM

Classifier

Data collected

_ k kkkkaeaeaeae _ _ _ _ t tttt _ _ _ _ _ _ _

slide22

0

Testing

0.05

0.4

0.05

0.5

Sum()=1

_ k kkkkaeaeaeae _ _ _ _ t tttt _ _ _ _ _ _ _

slide24
Speech recognition

Hidden Markov Model (HMM), 3-states

Transitions

Observation probabilities

b(o1) b(o2) b(o3)

o1 o2 o3

Acoustic observations

GMM1

GMM2

GMM3

hidden markov model hmm 1 state
Speech recognitionHidden Markov Model (HMM)1-state

transitions

Observation sequence: O={o1, o2, o3}

Observation probability sequence:

B={b(o1), b(o2), b(o3)}

P=b(o1)*a11* b(o2)* a11*b(o3)* aO

slide26

0

0.05

Realistic scenario

0.4

0.05

0.5

_ k ae _ t _

Sum()=1

_ _ k t k t k aeowaeae _ _ _ _ t kk t t _ _ _ _ _ _ _

slide27

_

_

ae

k

t

_ k ae _ t

0.8

0.8

0.79

0.79

0.2

0.2

0.9

0.9

0.2

0.21

0.8

0.1

_ k t k aeowae _ _ t t k _ _ _ _

GMM

slide28

Exercise 1.

Calculate likelihood of phoneme sequence /k/ /ow/ as for word cow. Observation probabilities, temporal alignment, and a set of 1-state phoneme HMMs shown below.

alignment k ow

_ ae k ow t

slide29

Calculate likelihood of phoneme sequence /k/ /ow/ as for word cow. Observation probabilities, temporal alignment, and a set of 1-state phoneme HMMs shown below.

alignment k ow

0.4* 0.2 * 0.5 *0.92* 0.4* 0.92*0.5 = 0,006771

k ow

slide30

Context dependent HMMs

Triphone HMMs for: /_/, /k/, /ae/, /t/, /_/

_ kae

aet _

k ae t

slide31

More on HMMs

Lecture 12-Feb, “Sentence level processing” by Oskar Kohonen

Exercise 6, “Hidden Markov Models”

slide32

Language modeling

Gives a prior probability for any word (or phoneme sequence)

Defines basic language units (e.g. words)

Learns statistical models from large text collections

Speech

signal

Feature

extraction

Acoustic

modeling

Recognized

text

Decoder

Language

modeling

slide33

N-gram language model

  • Stochastic model of the relations between words
          • Which words often occur close to each other?
  • The model predicts the probability distribution of the next word given the previous ones
  • Estimated from large text corpuses i.e. millions of words
  • Smoothing and pruning required to learn compact long-span models from sparse training data
        • More information on lecture 26-Feb “Statistical language models” by Mikko Kurimo
slide34
Speech recognition

N-gram models

  • ‏Trigram = 3-gram:
  • Word occurrence depends only on immediate context
  • A conditional probability of word given its context

Picture by B.Pellom

Speech recognition

slide35
Speech recognition

Estimation of N-gram model

c(“eggplant stew”)

c(“eggplant”)

  • Bigram example:
    • Is a maximum likelihood estimate for prob. of wi given wj
    • c(wj,wi) count of wi,wjtogether
    • c(wj) count of wj
    • works well only for frequent bigrams

Speech recognition

slide36

Data from Berkeley restaurant corpus (Jurafsky & Martin, 2000 “Speech and language processing”).

Uni-gram counts

Calculate missing bi-gram probabilities

slide37

Data from Berkeley restaurant corpus (Jurafsky & Martin, 2000 “Speech and language processing”).

Uni-gram counts

1087 / 3437=.32

Calculate missing bi-gram probabilities

slide38

Data from Berkeley restaurant corpus (Jurafsky & Martin, 2000 “Speech and language processing”).

Uni-gram counts

1087 / 3437=.32

3 / 3256 = .00092

Calculate missing bi-gram probabilities

slide39

Data from Berkeley restaurant corpus (Jurafsky & Martin, 2000 “Speech and language processing”).

Uni-gram counts

1087 / 3437=.32

6 / 1215 = .0049

3 / 3256 = .00092

Calculate missing bi-gram probabilities

slide40
Speech recognition

On the N-gram sparsity

  • For Shakespeare’s complete works vocabulary size (word form types) is 29 066
  • Total number of words is 884 647
  • This makes number of possible bigrams 29 0662 = 844 million
  • Under 300 000 found in writings
  • Conclusion: even learned bigram model would be very sparse

Speech recognition

Speech recognition

slide41

Morphemes as language units

  • In many languages words are not suitable as basic units for the language models
  • Inflections, prefixes, suffixes and compound words
      • Finnish language has these issues
  • The best unitscarry meaning(e.g. just letters or syllables are not good)
    • -> Morpheme or “statistical morf”

tietä+isi+mme+kö+hän

would + we +really + know

April 28, 2008

http://www.cis.hut.fi/projects/speech/

lexicon for sub word units
Speech recognitionLexicon for sub-word units?

Better coverage, few or no OOVs, even new words

Phonemes, syllables, morphemes, or stem+endings?

un + re + late + d + ness

unrelate + d + ness

unrelated + ness

How to split and rebuild words?

slide43

More about language models

Lecture 26-Feb “Statistical language models” by Mikko Kurimo

Exercise 3. N-gram language models

slide44

Decoding

Join the acoustic and language probabilities

Find the most likely sentence hypothesis by pruning and choosing the best

Significant effect on recognition speed and accuracy

Speech

signal

Feature

extraction

Acoustic

modeling

Recognized

text

Decoder

Language

modeling

what is speech recognition3
Speech recognitionWhat is speech recognition?

After applying Bayes rule

Find the most likely word sequence given the observations and models

Acoustic model Language model

decoding
Speech recognitionDecoding

The task is to find the most probable word sequence, given models and the acoustic observations

Viterbi search: Find the most probable state sequence

An efficient exhaustive search by applying dynamic programming and recursion

For Large Vocabulary Continuous Speech Recognition (LVCSR) the space must be pruned and optimized

n best lists
Speech recognitionN-best lists
  • Easy to apply long span LMs for rescoring
  • The differences are small
  • Not very compact representation
  • Tokens can be decoded into a lattice or word graph structure that shows all good options

Picture by B.Pellom

automatic speech recognition1
Speech recognitionAutomatic speech recognition

Content today:

ASR systems today

ASR applications

ASR courses

typical applications
Speech recognitionTypical applications

User interface by speech

Dictation

Speech translation

Audio information retrieval

1 user interface by speech
Speech recognition1. User interface by speech

Give spoken commands to a system

Feedback often visual or synthesized speech

Synthesis can be adapted to the context (by ASR)‏

Typical devices to control

Mobile phones

Car navigation system

Telephone based information systems, e.g. call centers

2 dictation
Speech recognition2. Dictation

Online dictation of documents, emails or SMS

“Speech-To-Text”

Offline processing of audio files, voice mails, interviews, meeting minutes

Goal: Get as accurate transcription of the input speech as possible

Typically the speaker knows this and speaks clearly and slowly and goes into a quiet environment

offline demo
Speech recognitionOffline demo

http:// research.ics.tkk.fi/speech/wwwdemo.shtml

Submit an audio file and receive the transcription by email

The current Finnish version works best on news style (so far language model training material mostly newspapers, journals and books)‏

May take a while (5 min), because these jobs run on low priority at our server

3 speech translation
Speech recognition3. Speech translation

Online translation of input speech to another language

Combination of ASR and MT (Machine translation)‏

May also include TTS (Text-To-Speech)‏

“Speech-To-Speech” translation

Adapting TTS to speaker's own voice (by ASR)‏

The task is often limited to a specific domain (e.g. travel) or even only to specific phrases

4 audio information retrieval
Speech recognition4. Audio information retrieval

Main goal: ASR should provide raw text output that contain enough correct words, 100% not required

Typical tasks: Searches based on speech i.e. spoken Document Retrieval (SDR), Audio Indexing, Speech summarization

Typically spoken for another purpose (to human listeners), may contain difficult and fast speech and poor recording conditions

Speakers may vary quickly, adaptation difficult

Important to recognize rare words, “good indexing terms”

indexing and retrieval
Speech recognitionIndexing and retrieval

1. All speech transformed to text

2. The raw text output indexed as normal text documents

3. Relevant documents ranked for each query (as in normal search engines)‏

4. The user may read or play the results

The raw text outputs much easier to browse than audio format files

automatic speech recognition2
Speech recognitionAutomatic speech recognition

Content today:

ASR systems today

ASR applications

ASR courses

courses at aalto
Speech recognitionCourses at Aalto

At Dep. Information and Computer Science:

Natural language processing

Seminar courses

At Dep. Signal Processing and Acoustics:

https://noppa.aalto.fi/noppa/kurssit/elec/t4050

Speech recognition

Speech processing

Seminar courses, etc.

speech recognition 5 cr
Speech recognitionSpeech Recognition, 5 cr

Teaching Prof. Mikko Kurimo,

Goals:

Become familiar with speech recognition methods and applications

Learn the structure of a speech recognition system

Learn to construct one in practice!

Period II (Nov-Dec):

Theory lecture

Computer lecture

Home works and a Project (group) work

No exam!

slide61

Thanks for listening…

  • Contact: Kalle.Palomaki@tkk.fi
  • Publications, projects, demos etc:
    • http:// research.ics.tkk.fi/speech/