speech in multimedia
Download
Skip this Video
Download Presentation
Speech in Multimedia

Loading in 2 Seconds...

play fullscreen
1 / 25

Speech in Multimedia - PowerPoint PPT Presentation


  • 165 Views
  • Uploaded on

Speech in Multimedia. Hao Jiang Computer Science Department Boston College Oct. 9, 2007. Outline. Introduction Topics in speech processing Speech coding Speech recognition Speech synthesis Speaker verification/recognition Conclusion. Introduction.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Speech in Multimedia ' - lamar


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
speech in multimedia

Speech in Multimedia

Hao Jiang

Computer Science Department

Boston College

Oct. 9, 2007

outline
Outline
  • Introduction
  • Topics in speech processing
    • Speech coding
    • Speech recognition
    • Speech synthesis
    • Speaker verification/recognition
  • Conclusion
introduction
Introduction
  • Speech is our basic communication tool.
  • We have been hoping to be able to communicate with machines using speech.

C3PO and R2D2

speech production model
Speech Production Model

Anatomy Structure

Mechanical Model

characteristics of digital speech
Characteristics of Digital Speech

Waveform

Speech

Spectrogram

voiced and unvoiced speech
Voiced and Unvoiced Speech

Silence

unvoiced

voiced

short time parameters
Short-time Parameters

Short time

power

Waveform

Envelop

slide8

Zero

crossing

rate

Pitch

period

speech coding
Speech Coding
  • Similar to images, we can also compress speech to make it smaller and easier to store and transmit.
  • General compression methods such as DPCM can also be used.
  • More compression can be achieved by taking advantage of the speech production model.
  • There are two classes of speech coders:
    • Waveform coder
    • Vocoder
lpc speech coder
LPC Speech Coder

Vocal track

Parameter

Quantizer

speech

Pitch

Speech

buffer

Speech

Analysis

Code

generation

Code

stream

Voiced/

unvoiced

Energy

Parameter

Frame n+1

Frame n

lpc and vocal track
LPC and Vocal Track
  • Mathematically, speech can be modeled as the following generation model:
  • {a1, a2, …, ak} are called Linear Prediction Coefficients (LPC), which can be used to model the shape of vocal track.
  • e(n) is the excitation to generate the speech.

x(n) = åp=1k ap x(n-p) + e(n)

decoding and speech synthesis
Decoding and Speech Synthesis

Pitch Period

Impulse

Train

Generator

Glottal

Pulse

Generator

Gain

Vocal

Track

Model

Radiation

Model

speech

Random

Noise

Generator

U/V

an example for synthesizing speech
An Example for Synthesizing Speech

Glottal Pulse

Go through vocal track filter with gain control

Blending region

Go through radiation filter

lpc10 fs1015
LPC10 (FS1015)
  • 2.4kbps LPC10 was DOD speech coding standard for voice communication at 2.4kbps.
  • LPC10 works on speech of 8Hz, using a 22.5ms frame and 10 LPC coefficients.

Original

Speech

LPC Decoded

Speech

mixed excitation lp
Mixed Excitation LP
  • For real speech, the excitation is usually not pure pulse or noise but a mixture.
  • The new 2.4kbps standard (MELP) addresses this problem.

Gain

Bandpass

filter

w

pulses

Vocal

Track

Model

Radiation

Model

speech

+

Bandpass

filter

noise

1-w

Original

Speech

MELP

Decoded

Speech

hybrid speech codecs
Hybrid Speech Codecs
  • For higher bit rate speech coders, hybrid speech codecs have more advantage than vocoders.
  • FS1016: CELP (Code Excitation Linear Predictive)
  • G.723.1: A dual bit rate codec (5.3kbps and 6.3kbps) for multimedia communication through Internet.
  • G.729: CELP based codec at 8kbps.

code

speech

“perceptual”

comparison

Model parameter

generation

Analysis by Synthesis

Speech

synthesis

Sound at 5.3kbps

Sound at 6.3kbps

Sound at 8kbps

speech recognition
Speech Recognition
  • Speech recognition is the foundation of human computer interaction using speech.
  • Speech recognition in different contexts
    • Dependent or independent on the speaker.
    • Discrete words or continuous speech.
    • Small vocabulary or large vocabulary.
    • In quiet environment or noisy environment.

Reference patterns

speech

Comparison

and decision

algorithm

Parameter

analyzer

Words

Language model

how does speech recognition work
How does Speech Recognition Work?

Words: grey whales

Phonemes: g r ey w ey l z

Each phoneme

has different

characteristics

(for example,

The power

distribution).

speech recognition1
Speech Recognition

g g r ey ey ey ey w ey ey l l z

How do we “match” the word when there are time and other variations?

hidden markov model
Hidden Markov Model

P12

S1

S2

{a,b,c,…}

{a,b,c,…}

S3

{a,b,c,…}

dynamic programming in decoding
Dynamic Programming in Decoding

time

states

We can find a path that corresponds to max-probable phonemes

to generate the observation “feature” (extracted in each

speech frame) sequence.

hmm for a unigram language model
HMM for a Unigram Language Model

HMM1

(word1)

p1

HMM2

(word2)

s0

p2

p3

HMM3

(wordn)

speech synthesis
Speech Synthesis
  • Speech synthesis is to generate (arbitrary) speech with desired prosperities (pitch, speed, loudness, articulation mode, etc.)
  • Speech synthesis has been widely used for text-to-speech systems and different telephone services.
  • The easiest and most often used speech synthesis method is waveform concatenation.

Increase the pitch without changing the speed

speaker recognition
Speaker Recognition
  • Identifying or verifying the identity of a speaker is an application where computer exceeds human being.
  • Vocal track parameter can be used as a feature for speaker recognition.

Speaker one

Speaker two

LPC covariance feature

applications
Applications

Speech recognition

Call routing

Document input

Operator Services

Voice Commands

Directory Assistance

Speaker

recognition

Speech Coding

Voice over Internet

Fraud Control

Wireless Telephone

Document Correction

Personalized service

Speech Interface

Text-to-Speech

synthesis

ad