Multimedia Communications (371) Speech and Image Communications (348). John Mason Engineering Swansea University. Features in speech. X 1 . . . . X i. Feature extraction. Acquisition. time. (frame: 20/30 ms & sampling F: 8khz). Features in speech. X 1 . . . . X i . . .
Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.
Multimedia Communications (371)Speech and Image Communications (348)
John Mason
Engineering
Swansea University
EG348_371_09
X1
.
.
.
.
Xi
.
.
.
.
.
Feature extraction
Acquisition
time
(frame: 20/30 ms & sampling F: 8khz)
EG348_371_09
X1
.
.
.
.
Xi
.
.
.
.
.
Feature extraction
Acquisition
(frame: 20/30 ms & sampling F: 8khz)
EG348_371_09
Air from
the lungs
Vocal fold
Vocal tract
Speech
EG348_371_09
Air from
the lungs
Vocal fold
Vocal tract
Speech
H1(z)
H2(z)
synthesised
Speech
noise
Spectral envelop reflects morphological characteristics of the vocal tract
EG348_371_09
Features: building of statistical model
T1
T2
T1
T2
T1
T2
T1
T2
T2
T1
T2
T1
T2
T1
T2
T1
T2
T1
T2
T1
T2
T1
EG348_371_09
EG348_371_09
EG348_371_09
EG348_371_09
PersontoPerson
PersontoMachine
speech/speaker recognition
MachinetoPerson
speech synthesis
EG348_371_09
perhaps separated by long distance
(or in time)
EG348_371_09
Acoustic Air Path
l Transmission Path
Acoustic Air Path
Electronic
Link
EG348_371_09
Channel Transmission Path
Electronic
Link
Microphone
ADC
Analysis
Coding
Transmitter
Receiver
Decoding
(re)Synthesis
DAC
Loudspeaker
EG348_371_09
Human
Acoustic
generation
Transmission
Message
Creation
Language
Coding
hundreds
thousands
Tens of
thousands
tens
Approx. bit rate in bps
Acoustic Space
Human
Hearing
Extraction
Message
Realisation
Language
decoding
EG348_371_09
Excellent
Quality
Good
ADPCM
GSM
Fair
CELP
Poor
4
8
16
32
64 kbps
Quality versus Bitrate
4 Quality Measures:
intelligibilityloudness
naturalnesseaseoflistening
EG348_371_09
EG348_371_09
The three main application areas are:
EG348_371_09
EG348_371_09
Dynamic Range  for flexibility
and robustness
Timevarying  to convey
information
EG348_371_09
EG348_371_09
Excitation:
voiced
unvoiced
sn
speech
en
H(z)
EG348_371_09
Excitation
Speech
Vocal
Tract
Voiced
Speech
Model
f0
Unvoiced
Time Varying
Parameters
EG348_371_09
H(z)
hn
S(z)
E(z)
en
sn
E(z)
S(z)
1/H(z)
sn
en
EG348_371_09
S(z)
E(z)
E(z)
S(z)
1/H(z)
H(z)
sn
en
sn
en
Input sn and output sn are identical
(within arithmetic limits)
EG348_371_09
Practical Analysis/Synthesis
EG348_371_09
S(z)
E(z)
E(z)
S(z)
1/H(z)
H(z)
sn
en
sn
en
Transmission
Sending
Receiving
Practical Analysis/Synthesis
EG348_371_09
a
s
s
a
s
a
s
a
s
.
.
.
.
.
.
.
.
n
p
p
n
1
n
1
n
2
3
n
2
3
Linear Predictive Coding  LPC
Principle of linear prediction:
EG348_371_09
Transforming to the zdomain gives:
EG348_371_09
Error is simply difference between predicted and actual values:
sn
en
+

ˆ
sn
A’(z)
EG348_371_09
en
sn
H(z)
Parameters updated at frame rate
sn
en
+
+
A’(z)
NB ‘hat’ of approximation omitted for simplicity
EG348_371_09
Synthesis
en
sn
H(z)
Analysis
Analysis
sn
en
S(z)
+
E(z)
1/H(z)
sn

en
A’(z)
EG348_371_09
Recall:
where ai are the pprediction coefficients.The principle
behind LPC is to find a set of pcoefficients, a1, a2, a3, ...
ap, which in some sense minimizes the error signal en,
over a frame of speech, N. This leads to a set p
coefficients for each frame.
EG348_371_09
for i = 1, 2, .… p
From which:
where:
In matrix form:
or
Minimisation of En is achieved by setting the ppartial derivatives to zero:
The matrix [R] is Toepliz symmetric, offering numerically efficient inversion techniques  Durbin’s recursion algorithm being one of the most popular.
EG348_371_09
EG348_371_09
EG348_371_09
Harmonic Structures and Periodicities
EG348_371_09
Harmonic Structures and Periodicities
voiced
or
unvoiced
sn
speech
en
H(z)
Vocal tract
Short Term
Tp
p
Short term prediction
EG348_371_09
Harmonic Structures and Periodicities
voiced
unvoiced
epn
sn
speech
Hlt(z)
Hst(z)
en
Pitch
Vocal tract
Tp
P
Long term prediction
EG348_371_09
k
Gain
en
epn
sn
Hlt(z)
Hst(z)
Harmonic Structures and Periodicities
Two Structures: shortterm (formants) & longterm  pitch (excitation)
eg 20ms frame
160 samples @ 8Khz
ai eg p=3
ai eg p=10
NB Representations of these parameters are transmitted
EG348_371_09
en
epn
sn
Hlt(z)
Hst(z)
EG348_371_09
S(z)
E(z)
E(z)
S(z)
1/H(z)
H(z)
sn
en
sn
en
Input sn and output sn are identical
(within arithmetic limits)
EG348_371_09
S(z)
E(z)
E(z)
E(z)
S(z)
S(z)
1 – A’(z)
1/H(z)
H(z)
sn
sn
en
sn
en
en
S(z)
E(z)
1/(1–A’(z))
en
sn
en
sn
sn
en
1/(1–A’(z))
1 – A’(z)
EG348_371_09
sn
sn1
a1
ai
sni
snp
sn
en
sn
en
1/(1–A’(z))
1 – A’(z)
Original Speech
Residual
sn
en
+

sn
Z1
Z1
Note – minus sign:
in Matlab combined with ai What determines p?
Z1
ap
EG348_371_09
sn
en
sn
en
1/(1–A’(z))
1 – A’(z)
sn
sn1
a1
a1
ai
ai
sni
snp
Residual
ReSynth.
Original Speech
en
en
sn
+
+

sn
sn
Z1
Z1
Note
No minus
sn1
Z1
Z1
sni
Z1
Z1
snp
ap
ap
EG348_371_09
S(z)
E(z)
E(z)
S(z)
1/H(z)
H(z)
sn
en
sn
en
Input sn and output sn are “similar”
Transmitted
Data Frame
What does the Transmitted Data Frame Contain?
EG348_371_09
Integrated encoder & decoder at the encoder

sn
Basic
decoder
Adaptive
encoder
+
Weighted error
LPAS Encoder
EG348_371_09
In Comms, compuation is expensive and parameter vector approximations to D are used
EG348_371_09
GSMEuropean CellularRPELTP13kb/s
FS1016Secure VoiceCELP4.8
IS54NA CellularVSELP7.95
IS96“QCELP18
JDCFRJapanese CellularVSELP6.7
JDCHR“PSICELP3.67
G.728(terrestrial)LDCELP16
EG348_371_09
EG348_371_09
Excellent
Quality
Good
ADPCM
GSM
Fair
CELP
Poor
4
8
16
32
64 kbps
Quality versus Bitrate
4 Quality Measures:
intelligibilityloudness
naturalnesseaseoflistening
EG348_371_09
Shortterm coefficients
(formants)
Longterm coefficients
(pitch)
CB
Index
Gain
en
sn
Hlt(z)
Hst(z)
Excitation is
represented
by address
ie CB Index
en
EG348_371_09
Shortterm coefficients
(formants)
Longterm coefficients
(pitch)
CB
Index
Gain
sn
en
en
sn
sn
Hlt(z)
Hst(z)
Excitation is
represented
by address
ie CB Index
en

sn
Basic
decoder
Adaptive
encoder
+
Weighted error
EG348_371_09
LSF = ws . /2
zplane jy
x
ws
x
LSF’s
EG348_371_09
EG348_371_09
Consider one pair of complex roots, A1(z) :
A1(z) = 1 + a1 z 1 + a2 z 2
P1(z) = 1 + a1 z 1 + a2 z 2 + z 3(1 + a1 z1 + a2 z2 )
= (z2 + (a1+ a2 1)z + 1 )( z + 1 ) z –3
Q1(z) = 1 + a1 z 1 + a2 z 2  z 3(1 + a1 z1 + a2 z2 )
= (z2 + (a1  a2 + 1)z + 1 )( z  1 ) z 3
The roots at 0 and 1 are discarded
It follows that the LSF’s, 1 & 2 , are given by:
cos (1) =  (a1 + a2 1)/2
andcos (2) =  (a1  a2+ 1)/2
Show:
a1 = (cos (1) + cos (2) ) and
a2 = (cos (2)  cos (1) +1 )
EG348_371_09
A1(z) = 1 + a1 z 1 + a2 z  2
= (z2 + a1 z+ a2 )z  2
= (z2 + 2 cos() wn z+ wn2 ) z  2
where wn is radius and is angle from . So: radius = a2 & = 
Note: in P & Q all w n2 terms (of the multiple 2nd orders) are unity
EG 1: a2 = 1 then cos (1) =  (a1 + a2 1)/2 = (a1)/2
roots already on circle and do not move (unstable system – not practical)
EG 2: a1 = 0 then cos (1) =  (a1+ a21)/2 =  (a2  1)/2
cos (2) =  (a1 a2+ 1)/2 =  (a2 + 1)/2
so LSF’s are symmetric about /4
EG348_371_09
LSF Review & Example (1)
LSF’s/LSP’s are defined as:
P(z) = A(z) + z(n+1) A(z1 )
and
Q(z) = A(z)  z(n+1) A(z1 )
thus
A(z) = {P(z) + Q(z)} / 2
EG348_371_09
LSF Review & Example (2)
For a second order A(z)= 1 + a1 z1 + a2 z2
P (z) = 1 + a1 z1 + a2 z2 + (1 + a1 z1 + a2 z2)z3
= (z2 + (a1 + a2  1)z + 1)(z + 1)z–3
Q (z) = 1 + a1 z1 + a2 z2  (a1 z1 + a2 z2)z3
= (z2 + (a1  a2 + 1)z + 1)(z  1 )z–3
cf: (s2 + ( 2cos()wn )s + wn2)
EG348_371_09
Q(z)
P(z)
Q(z)
P(z)
2
1
LSF Review & Example (3)
For a second order A(z)= 1 + a1 z1 + a2 z2 :
P (z) = (z2 + (a1 + a2  1)z + 1)(z + 1)z–3
Q (z)= (z2 + (a1  a2 + 1)z + 1)(z  1 )z–3
cf: (s2 + ( 2cos()wn )s + wn2)
Thus:(a1 + a2  1) = 2cos(1)
=  2cos(1)
&
(a1  a2 + 1) =  2cos(2 )
So, given:
i) LPC coeffs., a1 and a2 , then LSFs 1 & 2can be found
ii) LSFs, 1 & 2 , then the LPC coeffs. a1 and a2be found
2
1
EG348_371_09
LSF Review & Example (4)
For a second order and with P(z) corresponding to the first root, Q(z) to the second root,
and so
P (z) = 1 + a1 z1 + a2 z2 + (1 + a1 z1 + a2 z2)z3
= (z2 + (a1 + a2  1)z + 1)(z + 1)z–3
for the second pair of qi, 1.37 and 1.77
= (z2  2cos(1.37) z + 1 )(z + 1) z–3
= (z3 +(1  2cos(1.37) z2+ (1  2cos(1.37))z + 1)z–3
Likewise
Q (z) = 1 + a1 z1 + a2 z2  (a1 z1 + a2 z2)z3
= (z2 + (a1  a2 + 1)z + 1)(z  1 )z–3
= (z2  2cos(1.77) z + 1 )(z  1) z–3
= (z3 +(1  2cos(1.77) z2+ (1 + 2cos(1.77))z  1)z–3
Then
A(z) = {P(z) + Q(z)} / 2)
= (z3 + (cos(1.37) + cos(1.77))z2 + (1  cos(1.37) + cos(1.77))z)z–3
EG348_371_09
LSF Examples
EG348_371_09
LSF Examples
A(z)= 1 + a1 z1 + a2 z2
P (z) = 1 + a1 z1 + a2 z2 + (1 + a1 z1 + a2 z2)z3
= (z2 + (a1 + a2  1)z + 1)(z + 1)z–3
= (z2 + (1.8 + 0.9  1)z + 1)(z + 1)z–3
= (z2  1.9 z + 1) (z + 1)z–3
cf: (z2 + ( 2cos()wn )z + wn2)
thus cos() =  1.9/2 or = 2.824 and 1 = π 
= 0.318
EG348_371_09
Example Bit Allocation
EG348_371_09
Codebooks & VQ
N = 2L
Identical book
i (0 … N1)
p
p
Data reduction: (p x B) to L
time
time
EG348_371_09
N = 2 k
i
M
index, i
A(z)
en
sn
H(z)
EG348_371_09
sn
H(z)
Codebook of timedomain samples
start point
en
y ms
y ms
y ms
en are time domain samples (integers)
R samples per second (eg 8000 Hz)
Frame rate governs vector size
P = 2 j
Bit rate = j/y bits/ms
P
NB en also includes gain
EG348_371_09
x ms
N = 2 k
time
i
M
index, i
A[z] at time t
Vector with M elements, every x ms
Codebook with N = 2 kvectors
Bit rate = k/x bits per ms (not a function of M)
In practice A[z] is converted to LSF’s.
EG348_371_09
1) Initialise:
form a single centroid of all training data, N=1
2) Repeat
Split centroids: N > 2N
Repeat
Cluster data to nearest centroid
until convergence
until N large enough
EG348_371_09
VQ Performance on Unseen Data
Ramachandran & Mamone (eds) ‘Modern Methods of Speech Processing’ Kluer Academic, 1995
EG348_371_09
VQ Performance on Unseen Data
Ramachandran & Mamone (eds) ‘Modern Methods of Speech Processing’ Kluer Academic, 1995
EG348_371_09
1
0.5
0
Waveform
0.5
1
0
3.2
6.4
9.6
12.8
16
19.2
22.4
25.6
Time (ms)
LPC & FFT Spectra
LPC Roots
0.6651 ± 0.6695i
0.0560 ± 0.9709i
0.7228 ± 0.6225i
0.8714 ± 0.3694i
0.5758
0.4200
LSFs
40
20
0
Magnitude (dB)
20
40
0
1
2
3
4
5
Frequency (KHz) ( 0toFs/2)
EG348_371_09
40
20
Magnitude (dB)
0
20
40
0
1
2
3
4
5
LPC Spectra & LSF’s
LPC Roots
0.6651 ± 0.6695i
0.0560 ± 0.9709i
0.7228 ± 0.6225i
0.8714 ± 0.3694i
0.5758
0.4200
LSFs
Frequency (KHz) ( 0toFs/2)
EG348_371_09
LPC & FFT Spectra  2nd Order
A(z):
1.5537 0.8276
Roots:
0.7769 ± 0.4733i
1
0.5
0
0.5
H(0) = K
(1 (1.5537 0.8276))
H(ws/2) = K
(1 (1.5537 0.8276))
H(0)K/0.274
= = 21.8dB
H(ws /2) K/ 3.38
1
0
3.2
6.4
9.6
12.8
16
19.2
22.4
25.6
Time (ms)
40
20
0
20
40
0
1
2
3
4
5
Frequency (KHz) ( 0toFs/2)
EG348_371_09
EG348_371_09
Other Related Topics
Spectral Lifting: H(z) = (1az1)
Codebook Training
Spectral Differences between 2 frames
Cepstra
Modeling Speech Space  HMM’s
EG348_371_09
1
 1
1
 1
30ms
(a)
(b)
Figure Q1
PreEmphasis Example
EG348_371_09
PreEmphasis Example
zplane jy
G(ws/2) = 1 + a
G(0) = 1  a
a
For G(ws/2 ) > G(0) then
a must be > 0
1+a = 2
ws/2
EG348_371_09
Zplane to Magnitude Spectrum
1
0.5
0
Imaginary Part
0.5
1
1
0.5
0
0.5
1
Real Part
50
40
30
1+a = 2
20
10
Magnitude (dB)
0
10
ws/2
20
30
0
1
2
3
4
5
Frequency (KHz) ( 0toFs/2)
EG348_371_09
Air from
the lungs
Vocal fold
Vocal tract
Speech
H1(z)
H2(z)
synthesised
Speech
noise
Spectral envelop reflects morphological characteristics of the vocal tract
EG348_371_09
+

Z1
Z1
a1
a1
Z1
ai
ai
ap
ap
Speech
Residual
e`n
sn
en
1 – A’(z)
1 – A’(z)
sn
+

Z1
sn
sn1
Z1
STP
sni
Z1
LTP
ai
Z1
snp
EG348_371_09