voice dsp processing i l.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Voice DSP Processing I PowerPoint Presentation
Download Presentation
Voice DSP Processing I

Loading in 2 Seconds...

play fullscreen
1 / 53

Voice DSP Processing I - PowerPoint PPT Presentation


  • 155 Views
  • Uploaded on

Voice DSP Processing I. Yaakov J. Stein Chief Scientist RAD Data Communications. Voice DSP. Part 1 Speech biology and what we can learn from it Part 2 Speech DSP (AGC, VAD, features, echo cancellation) Part 3 Speech compression techiques Part 4 Speech Recognition.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Voice DSP Processing I' - minh


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
voice dsp processing i
VoiceDSPProcessingI

Yaakov J. Stein

Chief ScientistRAD Data Communications

voice dsp
Voice DSP

Part 1 Speech biology and what we can learn from it

Part 2 Speech DSP (AGC, VAD, features, echo cancellation)

Part 3 Speech compression techiques

Part 4 Speech Recognition

voice dsp part 1a
Voice DSP - Part 1a

Speech production mechanisms

  • Biology of the vocal tract
  • Pitch and formants
  • Sonograms
  • The basic LPC model
  • The cepstrum
  • LPC cepstrum
  • Line spectral pairs
voice dsp part 1b
Voice DSP - Part 1b

Speech perception mechanisms

  • Biology of the ear
  • Psychophysical phenomena
    • Weber’s law
    • Fechner’s law
    • Changes
    • Masking
voice dsp part 1c
Voice DSP - Part 1c

Speech quality measurement

  • Subjective measurement
    • MOS and its variants
  • Objective measurement
    • PSQM, PESQ
voice dsp part 2a
Voice DSP - Part 2a

Basic speech processing

  • Simplest processing
    • AGC
    • Simplistic VAD
  • More complex processing
    • pitch tracking
    • formant tracking
    • U/V decision
    • computing LPC and other features
voice dsp part 2b
Voice DSP - Part 2b

Echo Cancellation

  • Sources of echo (acoustic vs. line echo)
  • Echo suppression and cancellation
  • Adaptive noise cancellation
  • The LMS algorithm
  • Other adaptive algorithms
  • The standard LEC
voice dsp part 3
Voice DSP - Part 3

Speech compression techniques

  • PCM
  • ADPCM
  • SBC
  • VQ
  • ABS-CELP
  • MBE
  • MELP
  • STC
  • Waveform Interpolation
voice dsp part 4
Voice DSP - Part 4

Speech Recognition tasks

ASR Engine

Phonetic labeling

DTW

HMM

State-of-the-Art

voice dsp part 1a10
Voice DSP - Part 1a

Speech

production

mechanisms

speech production organs

Esophagus

Speech Production Organs

Brain

Hard Palate

Nasal

cavity

Velum

Teeth

Lips

Uvula

Mouth

cavity

Pharynx

Tongue

Larynx

Trachea

Lungs

speech production organs cont
Speech Production Organs - cont.
  • Air from lungs is exhaled into trachea (windpipe)
  • Vocal chords (folds) in larynx can produce periodic pulses of air

by opening and closing (glottis)

  • Throat (pharynx), mouth, tongue and nasal cavity modify air flow
  • Teeth and lips can introduce turbulence
  • Epiglottis separates esophagus (food pipe) from trachea
voiced vs unvoiced speech
Voiced vs. Unvoiced Speech
  • When vocal cords are held open air flows unimpeded
  • When laryngeal muscles stretch them glottal flow is in bursts
  • When glottal flow is periodic called voiced speech
  • Basic interval/frequency called the pitch
  • Pitch period usually between 2.5 and 20 milliseconds

Pitch frequency between 50 and 400 Hz

You can feel the vibration of the larynx

  • Vowels are always voiced (unless whispered)
  • Consonants come in voiced/unvoiced pairs

for example : B/P K/G D/T V/F J/CH TH/th W/WH Z/S ZH/SH

excitation spectra
Excitation spectra
  • Voiced speech

Pulse train is not sinusoidal - harmonic rich

  • Unvoiced speech

Common assumption : white noise

f

f

effect of vocal tract
Effect of vocal tract
  • Mouth and nasal cavities have resonances
  • Resonant frequencies

depend on geometry

effect of vocal tract cont

F1

F2

F3

F4

voiced speech

unvoiced speech

Effect of vocal tract - cont.
  • Sound energy at these resonant frequencies is amplified
  • Frequencies of peak amplification are called formants

frequency response

frequency

F0

formant frequencies
Formant frequencies
  • Peterson - Barney data (note the “vowel triangle”)
cylinder model s
Cylinder model(s)

Rough model of throat and mouth cavity

With nasal cavity

Voice

Excitation

open

open

Voice

Excitation

open/closed

phonemes
Phonemes
  • The smallest acoustic unit that can change meaning
  • Different languages have different phoneme sets
  • Types: (notations: phonetic, CVC, ARPABET)
    • Vowels
      • front (heed, hid, head, hat)
      • mid (hot, heard, hut, thought)
      • back (boot, book, boat)
      • dipthongs (buy, boy, down, date)
    • Semivowels
      • liquids (w, l)
      • glides (r, y)
phonemes cont
Phonemes - cont.
  • Consonants
    • nasals (murmurs) (n, m, ng)
    • stops (plosives)
      • voiced (b,d,g)
      • unvoiced (p, t, k)
    • fricatives
      • voiced (v, that, z, zh)
      • unvoiced (f, think, s, sh)
    • affricatives (j, ch)
    • whispers (h, what)
    • gutturals ( ח,ע)
    • clicks, etc. etc. etc.
basic lpc model
Basic LPC Model

Pulse

Generator

U/V

Switch

LPC

synthesis

filter

White Noise

Generator

basic lpc model cont
Basic LPC Model - cont.
  • Pulse generator produces a harmonic rich periodic impulse train (with pitch period and gain)
  • White noise generator produces a random signal

(with gain)

  • U/V switch chooses between voiced and unvoiced speech
  • LPC filter amplifies formant frequencies

(all-pole or AR IIR filter)

  • The output will resemble true speech to within residual error
cepstrum
Cepstrum

Another way of thinking about the LPC model

Speech spectrum is the obtained from multiplication

Spectrum of (pitch) pulse train times

Vocal tract (formant) frequency response

So log of this spectrum is obtained from addition

Log spectrum of pitch train plus

Log of vocal tract frequency response

Consider this log spectrum to be the spectrum of some new signal

called the cepstrum

The cepstrum is the sum of two components:

excitation plus vocal tract

cepstrum cont
Cepstrum - cont.

Cepstral processing has its own language

  • Cepstrum (note that this is really a signal in the time domain)
  • Quefrency (its units are seconds)
  • Liftering (filtering)
  • Alanysis
  • Saphe

Several variants:

  • complex cepstrum
  • power cesptrum
  • LPC cepstrum
do we know enough
Do we know enough?

Standard speech model (LPC)

(used by most speech processing/compression/recognition systems)

is a model of speech production

Unfortunately, speech production and speech perception systems

are not matched

So next we’ll look at the biology of the hearing (auditory) system

and some psychophysics (perception)

voice dsp part 1b27
Voice DSP - Part 1b

Speech

Hearing &perception mechanisms

hearing organs cont
Hearing Organs - cont.
  • Sound waves impinge on outer ear enter auditory canal
  • Amplified waves cause eardrum to vibrate
  • Eardrum separates outer ear from middle ear
  • The Eustachian tube equalizes air pressure of middle ear
  • Ossicles (hammer, anvil, stirrup) amplify vibrations
  • Oval window separates middle ear from inner ear
  • Stirrup excites oval window which excites liquid in the cochlea
  • The cochlea is curled up like a snail
  • The basilar membrane runs along middle of cochlea
  • The organ of Corti transduces vibrations to electric pulses
  • Pulses are carried by the auditory nerve to the brain
function of cochlea
Function of Cochlea
  • Cochlea has 2 1/2 to 3 turns

were it straightened out it would be 3 cm in length

  • The basilar membrane runs down the center of the cochlea

as does the organ of Corti

  • 15,000 cilia (hairs) contact the vibrating basilar membrane

and release neurotransmitter stimulating 30,000 auditory neurons

  • Cochlea is wide (1/2 cm) near oval window and tapers towards apex
  • is stiff near oval window and flexible near apex
  • Hence high frequencies cause section near oval window to vibrate

low frequencies cause section near apex to vibrate

  • Overlapping bank of filter frequency decomposition
psychophysics weber s law
Psychophysics - Weber’s law

Ernst Weber Professor of physiology at Leipzig in the early 1800s

Just Noticeable Difference :

minimal stimulus change that can be detected by senses

Discovery: D I = K I

Example

Tactile sense: place coins in each hand

subject could discriminate between with 10 coins and 11,

but not 20/21, but could 20/22!

Similarlyvisionlengths of lines, tastesaltiness, soundfrequency

weber s law cont
Weber’s law - cont.

This makes a lot of sense

Bill Gates

psychophysics fechner s law
Psychophysics - Fechner’s law

Weber’s law is not a truepsychophysicallaw

it relates stimulus threshold to stimulus (both physical entities)

not internal representation (feelings) to physical entity

Gustav Theodor Fechner student of Webermedicine, physics philosophy

Simplest assumption: JND is single internal unit

Using Weber’s law we find:

Y = A log I + B

Fechner Day (October 22 1850)

fechner s law cont
Fechner’s law - cont.

Log is very compressive

Fechner’s law explains the fantastic ranges of our senses

Sight:single photon - direct sunlight 1015

Hearing: eardrum move 1 H atom - jet plane 1012

Beldefined to be log10 of power ratio

decibel (dB)one tenth of a Bel

d(dB) = 10 log10 P 1 / P 2

fechner s law sound amplitudes
Fechner’s law - sound amplitudes

Companding

adaptation of logarithm to positive/negative signals

m-lawandA-laware piecewise linear approximations

Equivalent to linear sampling at 12-14 bits

(8 bit linear sampling is significantly more noisy)

fechner s law sound frequencies

12 2

Fechner’s law - sound frequencies

octaves,well tempered scale

Critical bands

Frequency warping

Melody 1 KHz = 1000, JND afterwards M ~ 1000 log2 ( 1 + fKHz )

Barkhausen can be simultaneously heard B ~ 25 + 75 ( 1 + 1.4 f2KHz )0.69

excite different basilar membrane regions

f

psychophysics changes

Inverse

E

Filter

Psychophysics - changes

Our senses respond to changes

psychophysics masking
Psychophysics - masking

Masking: strong tones block weaker ones at nearby frequencies

narrowband noise blocks tones (up to critical band)

f

voice dsp part 1c39
Voice DSP - Part 1c

Speech

Quality

Measurement

why does it sound the way it sounds
Why does it sound the way it sounds?

PSTN

  • BW=0.2-3.8 KHz, SNR>30 dB
  • PCM, ADPCM (BER 10-3)
  • five nines reliability
  • line echo cancellation

Voice over packet network

  • speech compression
  • delay, delay variation, jitter
  • packet loss/corruption/priority
  • echo cancellation
subjective voice quality
Subjective Voice Quality

Old Measures

  • 5/9
  • DRT
  • DAM

The modern scale

  • MOS
  • DMOS

meet neat seat feet Pete beat heat

mos according to itu
MOS according to ITU

P.800 Subjective Determination of Transmission Quality

Annex B: Absolute Category Rating (ACR)

Listening Quality Listening Effort

5 excellent relaxed

4 good attention needed

3 fair moderate effort

2 poor considerable effort

1 bad no meaning

with feasible effort

mos according to itu cont
MOS according to ITU (cont)

Annex D Degradation Category Rating (DCR)

Annex E Comparison Category Rating (CCR)

  • ACR not good at high quality speech

DCR CCR

5 inaudible

4 not annoying

3 slightly annoying much better

2 annoying better

1 very annoying slightly better

0 the same

-1 slightly worse

-2 worse

-3 much worse

some mos numbers
Some MOS numbers

Effect of Speech Compression:

(from ITU-T Study Group 15)

  • Quiet room 48 KHz 16 bit linear sampling 5.0
  • PCM (A-law/mlaw) 64 Kb/s 4.1
  • G.723.1 @ 6.3 Kb/s 3.9
  • G.729 @ 8 Kb/s 3.9
  • ADPCM G.726 32 Kb/s 3.8 toll quality
  • GSM @ 13Kb/s 3.6
  • VSELP IS54 @ 8Kb/s 3.4
the problem s with mos
The Problem(s) with MOS

Accurate MOS tests are the only reliable benchmark

BUT

  • MOS tests are off-line
  • MOS tests are slow
  • MOS tests are expensive
  • Different labs give consistently different results
  • Most MOS tests only check one aspect of system
the problem s with snr
The Problem(s) with SNR

Naive question: Isn’t CCR the same as SNR?

SNR does not correlate well with subjective criteria

Squared difference is not an accurate comparator

  • Gain
  • Delay
  • Phase
  • Nonlinear processing
speech distance measures
Speech distance measures

Many objective measures have been proposed:

  • Segmental SNR
  • Itakura Saito distance
  • Euclidean distance in Cepstrum space
  • Bark spectral distortion
  • Coherence Function

None correlate well with MOS

ITU target - find a quality-measure that does correlate well

some objective methods
Some objective methods

Perceptual Speech Quality Measurement (PSQM)

ITU-T P.861

Perceptual Analysis Measurement System (PAMS)

BT proprietary technique

Perceptual Evaluation of Speech Quality (PESQ)

ITU-T P.862

Objective Measurement of Perceived Audio Quality (PAQM)

ITU-R BS.1387

objective quality strategy

channel

QM

to

MOS

QM

MOS

estimate

Objective Quality Strategy

speech

psqm philosophy from p 861
PSQM philosophy(from P.861)

Internal

Representation

Perceptual

model

Audible

Difference

Cognitive

Model

Perceptual

model

Internal

Representation

psqm philosophy cont
PSQM philosophy (cont)

Perceptual Modelling (Internal representation)

  • Short time Fourier transform
  • Frequency warping (telephone-band filtering, Hoth noise)
  • Intensity warping

Cognitive Modelling

  • Loudness scaling
  • Internal cognitive noise
  • Asymmetry
  • Silent interval processing

PSQM Values

  • 0 (no degradation) to 6.5 (maximum degradation)

Conversion to MOS

  • PSQM to MOS calibration using known references
  • Equivalent Q values
problems with psqm
Problems with PSQM

Designed for telephony grade speech codecs

Doesn’t take network effects into account:

  • filtering
  • variable time delay
  • localized distortions

Draft standard P.862 adds:

  • transfer function equalization
  • time alignment, delay skipping
  • distortion averaging
pesq philosophy from p 862
PESQ philosophy(from P.862)

Perceptual

model

Internal

Representation

Cognitive

Model

Time

Alignment

Audible

Difference

Internal

Representation

Perceptual

model