Multimedia communications 371 speech and image communications 348
This presentation is the property of its rightful owner.
Sponsored Links
1 / 80

Multimedia Communications (371) Speech and Image Communications (348) PowerPoint PPT Presentation


  • 57 Views
  • Uploaded on
  • Presentation posted in: General

Multimedia Communications (371) Speech and Image Communications (348). John Mason Engineering Swansea University. Features in speech. X 1 . . . . X i. Feature extraction. Acquisition. time. (frame: 20/30 ms & sampling F: 8khz). Features in speech. X 1 . . . . X i . . .

Download Presentation

Multimedia Communications (371) Speech and Image Communications (348)

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Multimedia communications 371 speech and image communications 348

Multimedia Communications (371)Speech and Image Communications (348)

John Mason

Engineering

Swansea University

EG-348_371_09


Features in speech

Features in speech

X1

.

.

.

.

Xi

.

.

.

.

.

Feature extraction

Acquisition

time

(frame: 20/30 ms & sampling F: 8khz)

EG-348_371_09


Features in speech1

Features in speech

X1

.

.

.

.

Xi

.

.

.

.

.

Feature extraction

Acquisition

(frame: 20/30 ms & sampling F: 8khz)

EG-348_371_09


Speech production

Speech production

Air from

the lungs

Vocal fold

Vocal tract

Speech

EG-348_371_09


Lpc short and long

Air from

the lungs

Vocal fold

Vocal tract

Speech

H1(z)

H2(z)

synthesised

Speech

noise

LPC Short and Long

Spectral envelop reflects morphological characteristics of the vocal tract

EG-348_371_09


Multimedia communications 371 speech and image communications 348

Features: building of statistical model

T1

T2

T1

T2

T1

T2

T1

T2

T2

T1

T2

T1

T2

T1

T2

T1

T2

T1

T2

T1

T2

T1

EG-348_371_09


Vt shape some vowels ladefoged 62

VT Shape & Some Vowels - Ladefoged ‘62

EG-348_371_09


Speech processing applications

Speech Processing - Applications

  • Why?

    • Communications

    • Synthesis

    • Recognition

      • Speech & Speaker

  • How?

    • Frame-based

    • Systems approach

EG-348_371_09


Some books

Some Books

  • Flanagan -’Speech Analysis, Synthesis and Perception’, Springer-Verlag, - a classic!

  • Furui - several books on recognition

  • Parsons - `Voice and Speech Processing’ - McGraw Hill, one of the first text books on computer speech processing

  • O’Shaughnessy - ‘Speech Comms - human and machine’ Addison-Wesley

  • Rabiner & Juang - ‘Fundamentals of Speech Recognition’ Prentice Hall, 1993

  • Ramachandran & Mamone (eds) ‘Modern Methods of Speech Processing’ Kluer Academic, 1995

EG-348_371_09


Speech communications

Speech Communications

Person-to-Person

Person-to-Machine

speech/speaker recognition

Machine-to-Person

speech synthesis

EG-348_371_09


Electronic speech communications

(Electronic)Speech Communications

perhaps separated by long distance

(or in time)

EG-348_371_09


Telephony broadcasting

Telephony & Broadcasting

Acoustic Air Path

l Transmission Path

Acoustic Air Path

Electronic

Link

EG-348_371_09


Speech comms telephony

Channel Transmission Path

Electronic

Link

Speech Comms: Telephony

Microphone

ADC

Analysis

Coding

Transmitter

Receiver

Decoding

(re-)Synthesis

DAC

Loudspeaker

EG-348_371_09


Speech bit rates

Human

Acoustic

generation

Transmission

Message

Creation

Language

Coding

Speech Bit Rates

hundreds

thousands

Tens of

thousands

tens

Approx. bit rate in bps

Acoustic Space

Human

Hearing

Extraction

Message

Realisation

Language

decoding

EG-348_371_09


Criteria in speech comms

Excellent

Quality

Good

ADPCM

GSM

Fair

CELP

Poor

4

8

16

32

64 kbps

Criteria in Speech Comms.

Quality versus Bit-rate

4 Quality Measures:

intelligibilityloudness

naturalnessease-of-listening

EG-348_371_09


Low bit rate speech coding compandent http www compandent com

Low Bit Rate Speech CodingCompandent http://www.compandent.com/

EG-348_371_09


Speech processing

Speech Processing

The three main application areas are:

  • Speech Comms. (the ‘electronic link’)

  • Automatic Speech/Speaker recognition

  • Speech SynthesisMuch of the underlying analysis is common, eg linear predictive coding

EG-348_371_09


What does speech look like

What does speech look like?

EG-348_371_09


What does speech look like1

What does speech look like?

Dynamic Range - for flexibility

and robustness

Time-varying - to convey

information

EG-348_371_09


Frame based analysis

Frame-based Analysis

  • To capture time variations:

    • 20-30 ms frames - ‘centi-second’ labeling

    • spectral analysis

      • FFT

      • Filter-bank

      • Linear Predictive Coding

EG-348_371_09


Speech analysis coding

Excitation:

voiced

unvoiced

sn

speech

en

H(z)

Speech Analysis/Coding

  • Two general cases:

    • Waveform coders

    • Source (voice) coders (vo-coders)

  • Source coders eg linear predictive coding (LPC):

    • Model the source ie the vocal tract (VT)

    • Linear, time varying model of VT, plus excitation

  • EG-348_371_09


    Systems approach

    Systems Approach

    Excitation

    Speech

    Vocal

    Tract

    Voiced

    Speech

    Model

    f0

    Unvoiced

    Time Varying

    Parameters

    EG-348_371_09


    Lpc analysis synthesis

    H(z)

    hn

    S(z)

    E(z)

    en

    sn

    E(z)

    S(z)

    1/H(z)

    sn

    en

    LPC Analysis/Synthesis

    • Synthesis:

      • Input: Excitation

      • output: Speech

  • Analysis:

    • Input: Speech

    • output: Excitation

  • EG-348_371_09


    Perfect analysis synthesis

    S(z)

    E(z)

    E(z)

    S(z)

    1/H(z)

    H(z)

    sn

    en

    sn

    en

    ‘Perfect’ Analysis/Synthesis

    Input sn and output sn are identical

    (within arithmetic limits)

    EG-348_371_09


    Multimedia communications 371 speech and image communications 348

    Practical Analysis/Synthesis

    EG-348_371_09


    Multimedia communications 371 speech and image communications 348

    S(z)

    E(z)

    E(z)

    S(z)

    1/H(z)

    H(z)

    sn

    en

    sn

    en

    Transmission

    Sending

    Receiving

    Practical Analysis/Synthesis

    • Parameters for Transmission :

      • Input / Excitation en

      • Source model H(z)

  • Thus Analysis must derive these parameters, and

  • Synthesis must use them to re-generate speech

  • EG-348_371_09


    Multimedia communications 371 speech and image communications 348

    a

    s

    s

    a

    s

    a

    s

    a

    s

    .

    .

    .

    .

    .

    .

    .

    .

    n

    p

    p

    n

    1

    n

    1

    n

    2

    3

    n

    2

    3

    Linear Predictive Coding - LPC

    Principle of linear prediction:

    • The next value (or sample) in a series, ie at time n, is predicted or estimated by a weighted sum of previous values, ie those at time n-1, n-2, ...

    • Thus for a predictor of order p, we have:

    EG-348_371_09


    Linear prediction

    Linear Prediction

    Transforming to the z-domain gives:

    EG-348_371_09


    Lpc error terms

    LPC Error Terms

    Error is simply difference between predicted and actual values:

    sn

    en

    +

    -

    ˆ

    sn

    A’(z)

    EG-348_371_09


    Synthesis

    en

    Synthesis

    sn

    H(z)

    Parameters updated at frame rate

    sn

    en

    +

    +

    A’(z)

    NB ‘hat’ of approximation omitted for simplicity

    EG-348_371_09


    Analysis for synthesis

    Synthesis

    en

    sn

    H(z)

    Analysis

    Analysis

    sn

    en

    S(z)

    +

    E(z)

    1/H(z)

    sn

    -

    en

    A’(z)

    Analysis for Synthesis

    • The Analysis and Synthesis must match

      • what is needed for the Synthesis?

      • Answer: en - the excitation and H(z) - the system

  • Thus the Analysis must derive these terms (from sn ):

  • The speech signal, sn is analysed to give en and H(z) ie A’(z) parameters for transmission.

  • EG-348_371_09


    Derivation of lpc coefficients a z

    Derivation of LPC Coefficients - A(z)

    Recall:

    where ai are the pprediction coefficients.The principle

    behind LPC is to find a set of pcoefficients, a1, a2, a3, ...

    ap, which in some sense minimizes the error signal en,

    over a frame of speech, N. This leads to a set p

    coefficients for each frame.

    EG-348_371_09


    Derivation of a z 2

    for i = 1, 2, .… p

    From which:

    where:

    In matrix form:

    or

    Derivation of A(z) – (2)

    Minimisation of En is achieved by setting the ppartial derivatives to zero:

    The matrix [R] is Toepliz symmetric, offering numerically efficient inversion techniques - Durbin’s recursion algorithm being one of the most popular.

    EG-348_371_09


    Derivation of a z 3

    Derivation of A(z) – (3)

    • When N very large r is the autocorrelation coefficients of s

    • S comes from e convolved with h (excitation & vocal tract)

    • we are interested here in separating e and h

    • the predictor order, p, is small to reflect the short-term periodicities (formants)

    • with higher predictor orders we will get the longer-term periodicities (pitch)

    • 2 practical problems with evaluating a:

      • matrix singularities in R-1

      • unstable resultant H(z)

    • in practice both are solved by windowing - shaping frame - Hamming

    EG-348_371_09


    Speech signal characteristics

    Speech Signal Characteristics

    • Duration

    • Dynamic Range

    • Periodicities:

      • vocal tract

      • pitch

    • Frame-based Analysis

      • frame size: quasi-stationarycapture transitiontypically 20 - 30ms

      • frame rate: task dependent: more means moreband-width/computation - up to 100 frames/second

    EG-348_371_09


    Multimedia communications 371 speech and image communications 348

    Harmonic Structures and Periodicities

    • Harmonic Structures & Periodicities give potential for data reduction

    • LPC is one way of gaining this compression

    • Speech has two obvious separate structures

      • vocal tract resonances

      • pitch

    EG-348_371_09


    Multimedia communications 371 speech and image communications 348

    Harmonic Structures and Periodicities

    voiced

    or

    unvoiced

    sn

    speech

    en

    H(z)

    Vocal tract

    Short Term

    Tp

    p

    Short term prediction

    EG-348_371_09


    Multimedia communications 371 speech and image communications 348

    Harmonic Structures and Periodicities

    voiced

    unvoiced

    epn

    sn

    speech

    Hlt(z)

    Hst(z)

    en

    Pitch

    Vocal tract

    Tp

    P

    Long term prediction

    EG-348_371_09


    Multimedia communications 371 speech and image communications 348

    k

    Gain

    en

    epn

    sn

    Hlt(z)

    Hst(z)

    Harmonic Structures and Periodicities

    Two Structures: short-term (formants) & long-term - pitch (excitation)

    eg 20ms frame

    160 samples @ 8Khz

    ai eg p=3

    ai eg p=10

    NB Representations of these parameters are transmitted

    EG-348_371_09


    Practical coding systems

    Practical Coding Systems

    • Waveform & Source Coders (Vocoders)

      • 2 periodicities/redundancies in source

        • short-term (formants)

        • long-term - pitch

      • Excitation en

    en

    epn

    sn

    Hlt(z)

    Hst(z)

    EG-348_371_09


    Perfect analysis synthesis 1

    S(z)

    E(z)

    E(z)

    S(z)

    1/H(z)

    H(z)

    sn

    en

    sn

    en

    ‘Perfect’ Analysis/Synthesis (1)

    Input sn and output sn are identical

    (within arithmetic limits)

    EG-348_371_09


    Perfect analysis synthesis 2

    S(z)

    E(z)

    E(z)

    E(z)

    S(z)

    S(z)

    1 – A’(z)

    1/H(z)

    H(z)

    sn

    sn

    en

    sn

    en

    en

    ‘Perfect’ Analysis/Synthesis (2)

    S(z)

    E(z)

    1/(1–A’(z))

    en

    sn

    en

    sn

    sn

    en

    1/(1–A’(z))

    1 – A’(z)

    EG-348_371_09


    Perfect analysis synthesis 3

    sn

    sn-1

    a1

    ai

    sn-i

    sn-p

    ‘Perfect’ Analysis/Synthesis (3)

    sn

    en

    sn

    en

    1/(1–A’(z))

    1 – A’(z)

    Original Speech

    Residual

    sn

    en

    +

    -

    sn

    Z-1

    Z-1

    Note – minus sign:

    in Matlab combined with ai What determines p?

    Z-1

    ap

    EG-348_371_09


    Perfect analysis synthesis 4

    sn

    en

    sn

    en

    1/(1–A’(z))

    1 – A’(z)

    sn

    sn-1

    a1

    a1

    ai

    ai

    sn-i

    sn-p

    ‘Perfect’ Analysis/Synthesis (4)

    Residual

    Re-Synth.

    Original Speech

    en

    en

    sn

    +

    +

    -

    sn

    sn

    Z-1

    Z-1

    Note

    No minus

    sn-1

    Z-1

    Z-1

    sn-i

    Z-1

    Z-1

    sn-p

    ap

    ap

    EG-348_371_09


    Practical system

    S(z)

    E(z)

    E(z)

    S(z)

    1/H(z)

    H(z)

    sn

    en

    sn

    en

    Input sn and output sn are “similar”

    Practical System

    Transmitted

    Data Frame

    What does the Transmitted Data Frame Contain?

    EG-348_371_09


    Analysis by synthesis lpas

    Analysis-by-Synthesis: LPAS

    Integrated encoder & decoder at the encoder

    -

    sn

    Basic

    decoder

    Adaptive

    encoder

    +

    Weighted error

    LPAS Encoder

    EG-348_371_09


    Log spectral estimates

    Log Spectral Estimates

    • Comparisons between frames are very important in many situations

    • log spectral estimates are the most common (though in Comms. An approximation is used to reduce computation)

    In Comms, compuation is expensive and parameter vector approximations to D are used

    EG-348_371_09


    Some standards

    Some Standards

    GSMEuropean CellularRPE-LTP13kb/s

    FS1016Secure VoiceCELP4.8

    IS54NA CellularVSELP7.95

    IS96“QCELP1-8

    JDC-FRJapanese CellularVSELP6.7

    JDC-HR“PSI-CELP3.67

    G.728(terrestrial)LD-CELP16

    EG-348_371_09


    Low bit rate speech coding compandent http www compandent com1

    Low Bit Rate Speech CodingCompandent http://www.compandent.com/

    EG-348_371_09


    Criteria in speech comms1

    Excellent

    Quality

    Good

    ADPCM

    GSM

    Fair

    CELP

    Poor

    4

    8

    16

    32

    64 kbps

    Criteria in Speech Comms.

    Quality versus Bit-rate

    4 Quality Measures:

    intelligibilityloudness

    naturalnessease-of-listening

    EG-348_371_09


    Celp eg

    CELP eg

    Short-term coefficients

    (formants)

    Long-term coefficients

    (pitch)

    CB

    Index

    Gain

    en

    sn

    Hlt(z)

    Hst(z)

    Excitation is

    represented

    by address

    ie CB Index

    en

    EG-348_371_09


    Celp lpas encoder

    CELP – LPAS (Encoder)

    Short-term coefficients

    (formants)

    Long-term coefficients

    (pitch)

    CB

    Index

    Gain

    sn

    en

    en

    sn

    sn

    Hlt(z)

    Hst(z)

    Excitation is

    represented

    by address

    ie CB Index

    en

    -

    sn

    Basic

    decoder

    Adaptive

    encoder

    +

    Weighted error

    EG-348_371_09


    Conversion of lpc parameters

    LSF = ws . /2

    z-plane jy

    x

    ws

    x

    Conversion of LPC Parameters

    • A(z) = 1 + a1 z - 1 + a2 z - 2 + …… ap z - p and a i are to be Tx’d

    • Line Spectral Frequencies (LSF) present a clever way of representing the LPC coefficients, the ai’s of A(z)

    • The ai’s are floating point numbers and their accuracy is important

    • Factorising A(z) tends to give complex roots in the z-domain

    • LSF’s map these complex roots on to the unit circle

    LSF’s

    • Lead to efficient coding

    • Ensure a minimum phase filter

    • Bit errors are spectrum localised minimising loss of speech quality

    EG-348_371_09


    Line spectral frequencies

    Line Spectral Frequencies

    • Consider

      • P(z) = A(z) + z—(n+1) A(z—1 )

  • and

    • Q(z) = A(z) - z—(n+1) A(z—1 )

  • then P(z) and Q(z) lead to what is known as LSF’s

  • Clearly if P(z) and Q(z) are known then A(z) can be found:

  • A(z) = {P(z) + Q(z)} / 2

  • Roots of P(z) and Q(z) lie on the unit circle in z-domain The locations give:

    • the LSF’s

    • P(z) and Q(z), and whence A(z)

  • EG-348_371_09


    Lsf evaluation

    LSF Evaluation

    Consider one pair of complex roots, A1(z) :

    A1(z) = 1 + a1 z -1 + a2 z -2

    P1(z) = 1 + a1 z -1 + a2 z -2 + z -3(1 + a1 z1 + a2 z2 )

    = (z2 + (a1+ a2- 1)z + 1 )( z + 1 ) z –3

    Q1(z) = 1 + a1 z -1 + a2 z -2 - z -3(1 + a1 z1 + a2 z2 )

    = (z2 + (a1 - a2 + 1)z + 1 )( z - 1 ) z -3

    The roots at 0 and 1 are discarded

    It follows that the LSF’s, 1 & 2 , are given by:

    cos (1) = - (a1 + a2- 1)/2

    andcos (2) = - (a1 - a2+ 1)/2

    Show:

    a1 = -(cos (1) + cos (2) ) and

    a2 = (cos (2) - cos (1) +1 )

    EG-348_371_09


    Lsf test example

    LSF Test Example

    A1(z) = 1 + a1 z -1 + a2 z - 2

    = (z2 + a1 z+ a2 )z - 2

    = (z2 + 2 cos() wn z+ wn2 ) z - 2

    where wn is radius and  is angle from . So: radius =  a2 &  =  - 

    Note: in P & Q all w n2 terms (of the multiple 2nd orders) are unity

    EG 1: a2 = 1 then cos (1) = - (a1 + a2- 1)/2 = -(a1)/2

    roots already on circle and do not move (unstable system – not practical)

    EG 2: a1 = 0 then cos (1) = - (a1+ a2-1)/2 = - (a2 - 1)/2

    cos (2) = - (a1- a2+ 1)/2 = - (-a2 + 1)/2

    so LSF’s are symmetric about  /4

    EG-348_371_09


    Multimedia communications 371 speech and image communications 348

    LSF Review & Example (1)

    LSF’s/LSP’s are defined as:

    P(z) = A(z) + z-(n+1) A(z-1 )

    and

    Q(z) = A(z) - z-(n+1) A(z-1 )

    thus

    A(z) = {P(z) + Q(z)} / 2

    EG-348_371_09


    Multimedia communications 371 speech and image communications 348

    LSF Review & Example (2)

    For a second order A(z)= 1 + a1 z-1 + a2 z-2

    P (z) = 1 + a1 z-1 + a2 z-2 + (1 + a1 z1 + a2 z2)z-3

    = (z2 + (a1 + a2 - 1)z + 1)(z + 1)z–3

    Q (z) = 1 + a1 z-1 + a2 z-2 - (a1 z1 + a2 z2)z-3

    = (z2 + (a1 - a2 + 1)z + 1)(z - 1 )z–3

    cf: (s2 + ( 2cos()wn )s + wn2)

    EG-348_371_09


    Multimedia communications 371 speech and image communications 348

    Q(z)

    P(z)

    Q(z)

    P(z)

    2

    1

    LSF Review & Example (3)

    For a second order A(z)= 1 + a1 z-1 + a2 z-2 :

    P (z) = (z2 + (a1 + a2 - 1)z + 1)(z + 1)z–3

    Q (z)= (z2 + (a1 - a2 + 1)z + 1)(z - 1 )z–3

    cf: (s2 + ( 2cos()wn )s + wn2)

    Thus:(a1 + a2 - 1) = 2cos(1)

    = - 2cos(1)

    &

    (a1 - a2 + 1) = - 2cos(2 )

    So, given:

    i) LPC coeffs., a1 and a2 , then LSFs 1 & 2can be found

    ii) LSFs, 1 & 2 , then the LPC coeffs. a1 and a2be found

    2

    1

    EG-348_371_09


    Multimedia communications 371 speech and image communications 348

    LSF Review & Example (4)

    For a second order and with P(z) corresponding to the first root, Q(z) to the second root,

    and so

    P (z) = 1 + a1 z-1 + a2 z-2 + (1 + a1 z1 + a2 z2)z-3

    = (z2 + (a1 + a2 - 1)z + 1)(z + 1)z–3

    for the second pair of qi, 1.37 and 1.77

    = (z2 - 2cos(1.37) z + 1 )(z + 1) z–3

    = (z3 +(1 - 2cos(1.37) z2+ (1 - 2cos(1.37))z + 1)z–3

     Likewise

    Q (z) = 1 + a1 z-1 + a2 z-2 - (a1 z1 + a2 z2)z-3

    = (z2 + (a1 - a2 + 1)z + 1)(z - 1 )z–3

    = (z2 - 2cos(1.77) z + 1 )(z - 1) z–3

    = (z3 +(-1 - 2cos(1.77) z2+ (1 + 2cos(1.77))z - 1)z–3

    Then

    A(z) = {P(z) + Q(z)} / 2)

    = (z3 + (cos(1.37) + cos(1.77))z2 + (1 - cos(1.37) + cos(1.77))z)z–3

    EG-348_371_09


    Multimedia communications 371 speech and image communications 348

    LSF Examples

    EG-348_371_09


    Multimedia communications 371 speech and image communications 348

    LSF Examples

    A(z)= 1 + a1 z-1 + a2 z-2

    P (z) = 1 + a1 z-1 + a2 z-2 + (1 + a1 z1 + a2 z2)z-3

    = (z2 + (a1 + a2 - 1)z + 1)(z + 1)z–3

    = (z2 + (-1.8 + 0.9 - 1)z + 1)(z + 1)z–3

    = (z2 - 1.9 z + 1) (z + 1)z–3

    cf: (z2 + ( 2cos()wn )z + wn2)

    thus cos() = - 1.9/2 or  = 2.824 and 1 = π -

    = 0.318

    EG-348_371_09


    Multimedia communications 371 speech and image communications 348

    Example Bit Allocation

    EG-348_371_09


    Multimedia communications 371 speech and image communications 348

    Codebooks & VQ

    N = 2L

    Identical book

    i (0 … N-1)

    p

    p

    Data reduction: (p x B) to L

    time

    time

    EG-348_371_09


    Codebook compression

    N = 2 k

    i

    M

    index, i

    A(z)

    en

    sn

    H(z)

    Codebook Compression

    • Principle

      • representative data sets

      • data vector is replaced / representedby “nearest” vector, chosen from a “codebook” - a closed set of vectors

    • Examples

      • LPC parameter sets

      • Excitation as in CELP

    EG-348_371_09


    Codebook compression celp

    sn

    H(z)

    Codebook Compression - CELP

    Codebook of time-domain samples

    start point

    en

    y ms

    y ms

    y ms

    en are time domain samples (integers)

    R samples per second (eg 8000 Hz)

    Frame rate governs vector size

    P = 2 j

    Bit rate = j/y bits/ms

    P

    NB en also includes gain

    EG-348_371_09


    Codebook compression of h z

    Codebook Compression of H(z)

    x ms

    N = 2 k

    time

    i

    M

    index, i

    A[z] at time t

    Vector with M elements, every x ms

    Codebook with N = 2 kvectors

    Bit rate = k/x bits per ms (not a function of M)

    In practice A[z] is converted to LSF’s.

    EG-348_371_09


    Codebook generation

    Codebook Generation

    1) Initialise:

    form a single centroid of all training data, N=1

    2) Repeat

    Split centroids: N -> 2N

    Repeat

    Cluster data to nearest centroid

    until convergence

    until N large enough

    EG-348_371_09


    Multimedia communications 371 speech and image communications 348

    VQ Performance on Unseen Data

    Ramachandran & Mamone (eds) ‘Modern Methods of Speech Processing’ Kluer Academic, 1995

    EG-348_371_09


    Multimedia communications 371 speech and image communications 348

    VQ Performance on Unseen Data

    Ramachandran & Mamone (eds) ‘Modern Methods of Speech Processing’ Kluer Academic, 1995

    EG-348_371_09


    Multimedia communications 371 speech and image communications 348

    1

    0.5

    0

    Waveform

    -0.5

    -1

    0

    3.2

    6.4

    9.6

    12.8

    16

    19.2

    22.4

    25.6

    Time (ms)

    LPC & FFT Spectra

    LPC Roots

    -0.6651 ± 0.6695i

    -0.0560 ± 0.9709i

    0.7228 ± 0.6225i

    0.8714 ± 0.3694i

    0.5758

    -0.4200

    LSFs

    40

    20

    0

    Magnitude (dB)

    -20

    -40

    0

    1

    2

    3

    4

    5

    Frequency (KHz) ( 0-to-Fs/2)

    EG-348_371_09


    Multimedia communications 371 speech and image communications 348

    40

    20

    Magnitude (dB)

    0

    -20

    -40

    0

    1

    2

    3

    4

    5

    LPC Spectra & LSF’s

    LPC Roots

    -0.6651 ± 0.6695i

    -0.0560 ± 0.9709i

    0.7228 ± 0.6225i

    0.8714 ± 0.3694i

    0.5758

    -0.4200

    LSFs

    Frequency (KHz) ( 0-to-Fs/2)

    EG-348_371_09


    Multimedia communications 371 speech and image communications 348

    LPC & FFT Spectra - 2nd Order

    A(z):

    1.5537 -0.8276

    Roots:

    0.7769 ± 0.4733i

    1

    0.5

    0

    -0.5

    H(0) = K

    (1- (1.5537 -0.8276))

    H(ws/2) = K

    (1- (-1.5537 -0.8276))

    H(0)K/0.274

    = = 21.8dB

    H(ws /2) K/ 3.38

    -1

    0

    3.2

    6.4

    9.6

    12.8

    16

    19.2

    22.4

    25.6

    Time (ms)

    40

    20

    0

    -20

    -40

    0

    1

    2

    3

    4

    5

    Frequency (KHz) ( 0-to-Fs/2)

    EG-348_371_09


    Multimedia communications 371 speech and image communications 348

    GSM

    • Groupe Special Mobile - EU

      • First digital cellular system in world

      • See Hodge 1990

      • Based on TDMA & FDMA at 900MHz, and RPE-LPC(ie it is an ‘LPAS’ system)

      • Now at 1800 MHz

      • Carriers at 200kHz

      • Supporting 8 TDMA time slots each

      • Time slots: 577ms - 156.26 bit slots

      • 8 time slots form 1 GSM frame of 4.62 ms

      • Modulation: Gaussian minimum shift key

      • 26 bit training in every time slot

      • Round-trip delay ~ 80ms

      • EU: GSMUS: D-AMPS

    EG-348_371_09


    Multimedia communications 371 speech and image communications 348

    Other Related Topics

    Spectral Lifting: H(z) = (1-az-1)

    Codebook Training

    Spectral Differences between 2 frames

    Cepstra

    Modeling Speech Space - HMM’s

    EG-348_371_09


    Multimedia communications 371 speech and image communications 348

    1

    - 1

    1

    - 1

    30ms

    (a)

    (b)

    Figure Q1

    Pre-Emphasis Example

    EG-348_371_09


    Multimedia communications 371 speech and image communications 348

    Pre-Emphasis Example

    z-plane jy

    G(ws/2) = 1 + a

    G(0) = 1 - a

    a

    For G(ws/2 ) > G(0) then

    a must be > 0

    1+a = 2

    ws/2

    EG-348_371_09


    Multimedia communications 371 speech and image communications 348

    Z-plane to Magnitude Spectrum

    1

    0.5

    0

    Imaginary Part

    -0.5

    -1

    -1

    -0.5

    0

    0.5

    1

    Real Part

    50

    40

    30

    1+a = 2

    20

    10

    Magnitude (dB)

    0

    -10

    ws/2

    -20

    -30

    0

    1

    2

    3

    4

    5

    Frequency (KHz) ( 0-to-Fs/2)

    EG-348_371_09


    Lpc short and long1

    Air from

    the lungs

    Vocal fold

    Vocal tract

    Speech

    H1(z)

    H2(z)

    synthesised

    Speech

    noise

    LPC Short and Long

    Spectral envelop reflects morphological characteristics of the vocal tract

    EG-348_371_09


    St lt prediction

    +

    -

    Z-1

    Z-1

    a1

    a1

    Z-1

    ai

    ai

    ap

    ap

    ST & LT Prediction

    Speech

    Residual

    e`n

    sn

    en

    1 – A’(z)

    1 – A’(z)

    sn

    +

    -

    Z-1

    sn

    sn-1

    Z-1

    STP

    sn-i

    Z-1

    LTP

    ai

    Z-1

    sn-p

    EG-348_371_09


  • Login