Multimedia Communications (371) Speech and Image Communications (348). John Mason Engineering Swansea University. Features in speech. X 1 . . . . X i. Feature extraction. Acquisition. time. (frame: 20/30 ms & sampling F: 8khz). Features in speech. X 1 . . . . X i . . .
Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.
John Mason
Engineering
Swansea University
EG348_371_09
X1
.
.
.
.
Xi
.
.
.
.
.
Feature extraction
Acquisition
time
(frame: 20/30 ms & sampling F: 8khz)
EG348_371_09
X1
.
.
.
.
Xi
.
.
.
.
.
Feature extraction
Acquisition
(frame: 20/30 ms & sampling F: 8khz)
EG348_371_09
the lungs
Vocal fold
Vocal tract
Speech
H1(z)
H2(z)
synthesised
Speech
noise
LPC Short and LongSpectral envelop reflects morphological characteristics of the vocal tract
EG348_371_09
Features: building of statistical model
T1
T2
T1
T2
T1
T2
T1
T2
T2
T1
T2
T1
T2
T1
T2
T1
T2
T1
T2
T1
T2
T1
EG348_371_09
EG348_371_09
EG348_371_09
EG348_371_09
PersontoPerson
PersontoMachine
speech/speaker recognition
MachinetoPerson
speech synthesis
EG348_371_09
Acoustic Air Path
l Transmission Path
Acoustic Air Path
Electronic
Link
EG348_371_09
Electronic
Link
Speech Comms: TelephonyMicrophone
ADC
Analysis
Coding
Transmitter
Receiver
Decoding
(re)Synthesis
DAC
Loudspeaker
EG348_371_09
Acoustic
generation
Transmission
Message
Creation
Language
Coding
Speech Bit Rateshundreds
thousands
Tens of
thousands
tens
Approx. bit rate in bps
Acoustic Space
Human
Hearing
Extraction
Message
Realisation
Language
decoding
EG348_371_09
Quality
Good
ADPCM
GSM
Fair
CELP
Poor
4
8
16
32
64 kbps
Criteria in Speech Comms.Quality versus Bitrate
4 Quality Measures:
intelligibility loudness
naturalness easeoflistening
EG348_371_09
EG348_371_09
The three main application areas are:
EG348_371_09
EG348_371_09
Dynamic Range  for flexibility
and robustness
Timevarying  to convey
information
EG348_371_09
EG348_371_09
voiced
unvoiced
sn
speech
en
H(z)
Speech Analysis/CodingEG348_371_09
Excitation
Speech
Vocal
Tract
Voiced
Speech
Model
f0
Unvoiced
Time Varying
Parameters
EG348_371_09
hn
S(z)
E(z)
en
sn
E(z)
S(z)
1/H(z)
sn
en
LPC Analysis/SynthesisEG348_371_09
E(z)
E(z)
S(z)
1/H(z)
H(z)
sn
en
sn
en
‘Perfect’ Analysis/SynthesisInput sn and output sn are identical
(within arithmetic limits)
EG348_371_09
Practical Analysis/Synthesis
EG348_371_09
E(z)
E(z)
S(z)
1/H(z)
H(z)
sn
en
sn
en
Transmission
Sending
Receiving
Practical Analysis/Synthesis
EG348_371_09
a
s
s
a
s
a
s
a
s
.
.
.
.
.
.
.
.
n
p
p
n
1
n
1
n
2
3
n
2
3
Linear Predictive Coding  LPC
Principle of linear prediction:
EG348_371_09
Error is simply difference between predicted and actual values:
sn
en
+

ˆ
sn
A’(z)
EG348_371_09
en
Synthesissn
H(z)
Parameters updated at frame rate
sn
en
+
+
A’(z)
NB ‘hat’ of approximation omitted for simplicity
EG348_371_09
en
sn
H(z)
Analysis
Analysis
sn
en
S(z)
+
E(z)
1/H(z)
sn

en
A’(z)
Analysis for SynthesisEG348_371_09
Recall:
where ai are the pprediction coefficients.The principle
behind LPC is to find a set of pcoefficients, a1, a2, a3, ...
ap, which in some sense minimizes the error signal en,
over a frame of speech, N. This leads to a set p
coefficients for each frame.
EG348_371_09
From which:
where:
In matrix form:
or
Derivation of A(z) – (2)Minimisation of En is achieved by setting the ppartial derivatives to zero:
The matrix [R] is Toepliz symmetric, offering numerically efficient inversion techniques  Durbin’s recursion algorithm being one of the most popular.
EG348_371_09
EG348_371_09
EG348_371_09
Harmonic Structures and Periodicities
EG348_371_09
Harmonic Structures and Periodicities
voiced
or
unvoiced
sn
speech
en
H(z)
Vocal tract
Short Term
Tp
p
Short term prediction
EG348_371_09
Harmonic Structures and Periodicities
voiced
unvoiced
epn
sn
speech
Hlt(z)
Hst(z)
en
Pitch
Vocal tract
Tp
P
Long term prediction
EG348_371_09
Gain
en
epn
sn
Hlt(z)
Hst(z)
Harmonic Structures and Periodicities
Two Structures: shortterm (formants) & longterm  pitch (excitation)
eg 20ms frame
160 samples @ 8Khz
ai eg p=3
ai eg p=10
NB Representations of these parameters are transmitted
EG348_371_09
en
epn
sn
Hlt(z)
Hst(z)
EG348_371_09
E(z)
E(z)
S(z)
1/H(z)
H(z)
sn
en
sn
en
‘Perfect’ Analysis/Synthesis (1)Input sn and output sn are identical
(within arithmetic limits)
EG348_371_09
E(z)
E(z)
E(z)
S(z)
S(z)
1 – A’(z)
1/H(z)
H(z)
sn
sn
en
sn
en
en
‘Perfect’ Analysis/Synthesis (2)S(z)
E(z)
1/(1–A’(z))
en
sn
en
sn
sn
en
1/(1–A’(z))
1 – A’(z)
EG348_371_09
sn
sn1
a1
ai
sni
snp
‘Perfect’ Analysis/Synthesis (3)sn
en
sn
en
1/(1–A’(z))
1 – A’(z)
Original Speech
Residual
sn
en
+

sn
Z1
Z1
Note – minus sign:
in Matlab combined with ai What determines p?
Z1
ap
EG348_371_09
sn
en
sn
en
1/(1–A’(z))
1 – A’(z)
sn
sn1
a1
a1
ai
ai
sni
snp
‘Perfect’ Analysis/Synthesis (4)Residual
ReSynth.
Original Speech
en
en
sn
+
+

sn
sn
Z1
Z1
Note
No minus
sn1
Z1
Z1
sni
Z1
Z1
snp
ap
ap
EG348_371_09
S(z)
E(z)
E(z)
S(z)
1/H(z)
H(z)
sn
en
sn
en
Input sn and output sn are “similar”
Practical SystemTransmitted
Data Frame
What does the Transmitted Data Frame Contain?
EG348_371_09
Integrated encoder & decoder at the encoder

sn
Basic
decoder
Adaptive
encoder
+
Weighted error
LPAS Encoder
EG348_371_09
In Comms, compuation is expensive and parameter vector approximations to D are used
EG348_371_09
GSM European Cellular RPELTP 13kb/s
FS1016 Secure Voice CELP 4.8
IS54 NA Cellular VSELP 7.95
IS96 “ QCELP 18
JDCFR Japanese Cellular VSELP 6.7
JDCHR “ PSICELP 3.67
G.728 (terrestrial) LDCELP 16
EG348_371_09
EG348_371_09
Quality
Good
ADPCM
GSM
Fair
CELP
Poor
4
8
16
32
64 kbps
Criteria in Speech Comms.Quality versus Bitrate
4 Quality Measures:
intelligibility loudness
naturalness easeoflistening
EG348_371_09
Shortterm coefficients
(formants)
Longterm coefficients
(pitch)
CB
Index
Gain
en
sn
Hlt(z)
Hst(z)
Excitation is
represented
by address
ie CB Index
en
EG348_371_09
Shortterm coefficients
(formants)
Longterm coefficients
(pitch)
CB
Index
Gain
sn
en
en
sn
sn
Hlt(z)
Hst(z)
Excitation is
represented
by address
ie CB Index
en

sn
Basic
decoder
Adaptive
encoder
+
Weighted error
EG348_371_09
LSF = ws . /2
zplane jy
x
ws
x
Conversion of LPC ParametersLSF’s
EG348_371_09
EG348_371_09
Consider one pair of complex roots, A1(z) :
A1(z) = 1 + a1 z 1 + a2 z 2
P1(z) = 1 + a1 z 1 + a2 z 2 + z 3(1 + a1 z1 + a2 z2 )
= (z2 + (a1+ a2 1)z + 1 )( z + 1 ) z –3
Q1(z) = 1 + a1 z 1 + a2 z 2  z 3(1 + a1 z1 + a2 z2 )
= (z2 + (a1  a2 + 1)z + 1 )( z  1 ) z 3
The roots at 0 and 1 are discarded
It follows that the LSF’s, 1 & 2 , are given by:
cos (1) =  (a1 + a2 1)/2
and cos (2) =  (a1  a2+ 1)/2
Show:
a1 = (cos (1) + cos (2) ) and
a2 = (cos (2)  cos (1) +1 )
EG348_371_09
A1(z) = 1 + a1 z 1 + a2 z  2
= (z2 + a1 z+ a2 )z  2
= (z2 + 2 cos() wn z+ wn2 ) z  2
where wn is radius and is angle from . So: radius = a2 & = 
Note: in P & Q all w n2 terms (of the multiple 2nd orders) are unity
EG 1: a2 = 1 then cos (1) =  (a1 + a2 1)/2 = (a1)/2
roots already on circle and do not move (unstable system – not practical)
EG 2: a1 = 0 then cos (1) =  (a1+ a21)/2 =  (a2  1)/2
cos (2) =  (a1 a2+ 1)/2 =  (a2 + 1)/2
so LSF’s are symmetric about /4
EG348_371_09
LSF’s/LSP’s are defined as:
P(z) = A(z) + z(n+1) A(z1 )
and
Q(z) = A(z)  z(n+1) A(z1 )
thus
A(z) = {P(z) + Q(z)} / 2
EG348_371_09
For a second order A(z)= 1 + a1 z1 + a2 z2
P (z) = 1 + a1 z1 + a2 z2 + (1 + a1 z1 + a2 z2)z3
= (z2 + (a1 + a2  1)z + 1)(z + 1)z–3
Q (z) = 1 + a1 z1 + a2 z2  (a1 z1 + a2 z2)z3
= (z2 + (a1  a2 + 1)z + 1)(z  1 )z–3
cf: (s2 + ( 2cos()wn )s + wn2)
EG348_371_09
P(z)
Q(z)
P(z)
2
1
LSF Review & Example (3)
For a second order A(z)= 1 + a1 z1 + a2 z2 :
P (z) = (z2 + (a1 + a2  1)z + 1)(z + 1)z–3
Q (z)= (z2 + (a1  a2 + 1)z + 1)(z  1 )z–3
cf: (s2 + ( 2cos()wn )s + wn2)
Thus:(a1 + a2  1) = 2cos(1)
=  2cos(1)
&
(a1  a2 + 1) =  2cos(2 )
So, given:
i) LPC coeffs., a1 and a2 , then LSFs 1 & 2can be found
ii) LSFs, 1 & 2 , then the LPC coeffs. a1 and a2be found
2
1
EG348_371_09
For a second order and with P(z) corresponding to the first root, Q(z) to the second root,
and so
P (z) = 1 + a1 z1 + a2 z2 + (1 + a1 z1 + a2 z2)z3
= (z2 + (a1 + a2  1)z + 1)(z + 1)z–3
for the second pair of qi, 1.37 and 1.77
= (z2  2cos(1.37) z + 1 )(z + 1) z–3
= (z3 +(1  2cos(1.37) z2+ (1  2cos(1.37))z + 1)z–3
Likewise
Q (z) = 1 + a1 z1 + a2 z2  (a1 z1 + a2 z2)z3
= (z2 + (a1  a2 + 1)z + 1)(z  1 )z–3
= (z2  2cos(1.77) z + 1 )(z  1) z–3
= (z3 +(1  2cos(1.77) z2+ (1 + 2cos(1.77))z  1)z–3
Then
A(z) = {P(z) + Q(z)} / 2)
= (z3 + (cos(1.37) + cos(1.77))z2 + (1  cos(1.37) + cos(1.77))z)z–3
EG348_371_09
EG348_371_09
A(z)= 1 + a1 z1 + a2 z2
P (z) = 1 + a1 z1 + a2 z2 + (1 + a1 z1 + a2 z2)z3
= (z2 + (a1 + a2  1)z + 1)(z + 1)z–3
= (z2 + (1.8 + 0.9  1)z + 1)(z + 1)z–3
= (z2  1.9 z + 1) (z + 1)z–3
cf: (z2 + ( 2cos()wn )z + wn2)
thus cos() =  1.9/2 or = 2.824 and 1 = π 
= 0.318
EG348_371_09
EG348_371_09
N = 2L
Identical book
i (0 … N1)
p
p
Data reduction: (p x B) to L
time
time
EG348_371_09
N = 2 k
i
M
index, i
A(z)
en
sn
H(z)
Codebook CompressionEG348_371_09
sn
H(z)
Codebook Compression  CELPCodebook of timedomain samples
start point
en
y ms
y ms
y ms
en are time domain samples (integers)
R samples per second (eg 8000 Hz)
Frame rate governs vector size
P = 2 j
Bit rate = j/y bits/ms
P
NB en also includes gain
EG348_371_09
x ms
N = 2 k
time
i
M
index, i
A[z] at time t
Vector with M elements, every x ms
Codebook with N = 2 kvectors
Bit rate = k/x bits per ms (not a function of M)
In practice A[z] is converted to LSF’s.
EG348_371_09
1) Initialise:
form a single centroid of all training data, N=1
2) Repeat
Split centroids: N > 2N
Repeat
Cluster data to nearest centroid
until convergence
until N large enough
EG348_371_09
Ramachandran & Mamone (eds) ‘Modern Methods of Speech Processing’ Kluer Academic, 1995
EG348_371_09
Ramachandran & Mamone (eds) ‘Modern Methods of Speech Processing’ Kluer Academic, 1995
EG348_371_09
0.5
0
Waveform
0.5
1
0
3.2
6.4
9.6
12.8
16
19.2
22.4
25.6
Time (ms)
LPC & FFT Spectra
LPC Roots
0.6651 ± 0.6695i
0.0560 ± 0.9709i
0.7228 ± 0.6225i
0.8714 ± 0.3694i
0.5758
0.4200
LSFs
40
20
0
Magnitude (dB)
20
40
0
1
2
3
4
5
Frequency (KHz) ( 0toFs/2)
EG348_371_09
20
Magnitude (dB)
0
20
40
0
1
2
3
4
5
LPC Spectra & LSF’s
LPC Roots
0.6651 ± 0.6695i
0.0560 ± 0.9709i
0.7228 ± 0.6225i
0.8714 ± 0.3694i
0.5758
0.4200
LSFs
Frequency (KHz) ( 0toFs/2)
EG348_371_09
A(z):
1.5537 0.8276
Roots:
0.7769 ± 0.4733i
1
0.5
0
0.5
H(0) = K
(1 (1.5537 0.8276))
H(ws/2) = K
(1 (1.5537 0.8276))
H(0)K/0.274
= = 21.8dB
H(ws /2) K/ 3.38
1
0
3.2
6.4
9.6
12.8
16
19.2
22.4
25.6
Time (ms)
40
20
0
20
40
0
1
2
3
4
5
Frequency (KHz) ( 0toFs/2)
EG348_371_09
EG348_371_09
Spectral Lifting: H(z) = (1az1)
Codebook Training
Spectral Differences between 2 frames
Cepstra
Modeling Speech Space  HMM’s
EG348_371_09
zplane jy
G(ws/2) = 1 + a
G(0) = 1  a
a
For G(ws/2 ) > G(0) then
a must be > 0
1+a = 2
ws/2
EG348_371_09
1
0.5
0
Imaginary Part
0.5
1
1
0.5
0
0.5
1
Real Part
50
40
30
1+a = 2
20
10
Magnitude (dB)
0
10
ws/2
20
30
0
1
2
3
4
5
Frequency (KHz) ( 0toFs/2)
EG348_371_09
the lungs
Vocal fold
Vocal tract
Speech
H1(z)
H2(z)
synthesised
Speech
noise
LPC Short and LongSpectral envelop reflects morphological characteristics of the vocal tract
EG348_371_09

Z1
Z1
a1
a1
Z1
ai
ai
ap
ap
ST & LT PredictionSpeech
Residual
e`n
sn
en
1 – A’(z)
1 – A’(z)
sn
+

Z1
sn
sn1
Z1
STP
sni
Z1
LTP
ai
Z1
snp
EG348_371_09