speech coding techniques l.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Speech Coding Techniques PowerPoint Presentation
Download Presentation
Speech Coding Techniques

Loading in 2 Seconds...

play fullscreen
1 / 38

Speech Coding Techniques - PowerPoint PPT Presentation


  • 126 Views
  • Uploaded on

Speech Coding Techniques. 潘奕誠 4/7/2003. Introduction. Efficient speech-coding techniques Advantages for VoIP Digital streams of ones and zeros The lower the bandwidth, the lower the quality RTP payload types Processing power

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Speech Coding Techniques' - mirari


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
speech coding techniques

Speech Coding Techniques

潘奕誠

4/7/2003

introduction
Introduction
  • Efficient speech-coding techniques
    • Advantages for VoIP
    • Digital streams of ones and zeros
    • The lower the bandwidth, the lower the quality
  • RTP payload types
  • Processing power
    • The better quality (for a given bandwidth) uses a more complex algorithm
    • A balance between quality and cost
voice quality
Voice Quality
  • Bandwidth is easily quantified
    • Voice quality is subjective
  • MOS, Mean Opinion Score
    • ITU-T Recommendation P.800
      • Excellent – 5
      • Good – 4
      • Fair – 3
      • Poor – 2
      • Bad – 1
    • A minimum of 30 people
    • Listen to voice samples or in conversations
slide4

P.800 recommendations

    • The selection of participants
    • The test environment
    • Explanations to listeners
    • Analysis of results
  • Toll quality
    • A MOS of 4.0 or higher
about speech
About Speech
  • Speech
    • Air pushed from the lungs past the vocal cords and along the vocal tract
    • The basic vibrations – vocal cords
    • The sound is altered by the disposition of the vocal tract ( tongue and mouth)
  • Model the vocal tract as a filter
    • The shape changes relatively slowly
  • The vibrations at the vocal cords
    • The excitation signal
speech sounds
Speech sounds
  • Voiced sound
    • The vocal cords vibrate open and close
    • Quasi-periodic pulses of air
    • The rate of the opening and closing – the pitch
  • Unvoiced sounds
    • Forcing air at high velocities through a constriction
    • Noise-like turbulence
    • Show little long-term periodicity
    • Short-term correlations still present
  • Plosive sounds
    • A complete closure in the vocal tract
    • Air pressure is built up and released suddenly
voice sampling
Voice Sampling
  • Discrete Time LTI Systems: The Convolution Sum

1

h[n]

0

1

2

n

2.5

2

2

x[n]

y[n]

0.5

0.5

0

1

n

0

1

2

3

n

quantization scalar quantization
Quantization (Scalar Quantization)

v1

vk+1

vL

v2

m1

m0= -A

m2 ……

mk

mk+1

mL1

mL=A

Jk+1

·Assume | x[n] |  A

divide the range [ A , A ] into L quantization levels

{ J1 , J2 , …… Jk ,….. JL }

Jk : [mk-1,mk ]

L = 2R

each quantization level Jk is represented by a value vk

S = U Jk , V = { v1 , v2 , ……vk ,….. vL }

non uniform quantization

m0 = -A

m1

m2 ……

0

mL=A

Non-Uniform Quantization

Concept : small quantization levels for small x

large quantization levels for large x

Goal: constant SNRQ for all x

companding

^

x[n]

x[n]

Uniform Quantization

F(x)

Uniform Decoder

F1(x)

Companding

Compressor …1101…1101… Expandor

Compressor + Expandor  Compandor

F(x) is to specify the non-uniform quantization characteristics

non uniform quantization12
Non-Uniform Quantization
  • -law
  • A-law
  • Typical values in practice
  •  = 255 , A = 87.6
types of speech codecs
Types of Speech Codecs
  • Waveform codecs,source codecs (also known as vocoders),and hybrid codecs.
speech source model and source coding

G(z) =

1

1akz-k

P

k = 1

Speech Source Model and Source Coding

G(z), G(), g[n]

unvoiced

G

v/u

voiced

N

Excitation parameters

v/u : voiced/ unvoiced

N : pitch for voiced

G : signal gain

 excitation signal u[n]

random

sequence

generator

u[n]

x[n]

periodic pulse train

generator

Vocal Tract Model

Vocal Tract parameters

{ak} : LPC coefficients

formant structure of speech signals

Excitation

A good approximation, though not precise enough

lpc vocoder voice coder

receiver

x[n]

{ ak }

N , G

v/u

Decoder

g[n]

G(z)

Ex

…11011…

LPC Vocoder(Voice Coder)

x[n]

{ ak }

N , G

v/u

LPC Analysis

Encoder

…11011…

N by pitch detection

v/u by voicing detection

{ak} can be non-uniform or vector quantized to reduce bit rate further

g 711
G.711
  • The most commonplace codec
    • Used in circuit-switched telephone network
    • PCM, Pulse-Code Modulation
  • If uniform quantization
    • 12 bits * 8 k/sec = 96 kbps
  • Non-uniform quantization
    • 65 kbps DS0 rate
      • North America
    • A-law
      • Other countries, a little friendlier to lower signal levels
    • An MOS of about 4.3
adpcm adaptive differential pcm
ADPCM(adaptive differential PCM)
  • DPCM and ADPCM.
    • ADPCM : Adaptive Prediction in DPCM Adaptive Quantization

Adaptive Quantization

      • Quantization level  varies with local signal level
      • [n] = ax[n]
      • x[n] : locally estimated standard deviation of x[n]
  • G.721:ADPCM-coded speech at 32Kbps.
  • G.726(A-law or )
    • 16,24,32,40Kbps
    • MOS 4.0 , at 32Kbps
analysis by synthesis abs codecs
Analysis-by-Synthesis (AbS) Codecs
  • Hybrid codec
    • Fill the gap between waveform and source codecs
    • The most successful and commonly used
      • Time-domain AbS codecs
      • Not a simple two-state, voiced/unvoiced
      • Different excitation signals are attempted
      • Closest to the original waveform is selected
      • MPE, Multi-Pulse Excited
      • RPE, Regular-Pulse Excited
      • CELP, Code-Excited Linear Predictive
g 728 ld celp
G.728 LD-CELP
  • CELP codecs
    • A filter; its characteristics change over time
    • A codebook of acoustic vectors
      • A vector = a set of elements representing various char. of the excitation
    • Transmit
      • Filter coefficients, gain, a pointer to the vector chosen
  • Low Delay CELP
    • Backward-adaptive coder
      • Use previous samples to determine filter coefficients
      • Operates on five samples at a time
        • Delay < 1 ms
      • Only the pointer is transmitted
slide20

1024 vectors in the code book

      • 10-bit pointer (index)
      • 16 kbps
  • LD-CELP encoder
    • Minimize a frequency-weighted mean-square error
slide21

LD-CELP decoder

    • An MOS score of about 3.9
    • One-quarter of G.711 bandwidth
slide22

G.723.1 ACELP

  • 6.3 or 5.3 kbps
    • Both mandatory
    • Can change from one to another during a conversation
  • The coder
    • A band-limited input speech signal
    • Sampled at 8 KHz, 16-bit uniform PCM quantization
    • Operate on blocks of 240 samples at a time
    • A look-ahead of 7.5 ms
    • A total algorithmic delay of 37.5 ms + other delays
    • A high-pass filter to remove any DC component
slide23

G.723.1 Annex A

    • Silence Insertion Description (SID) frames of size four octets
  • The two lsbs of the first octet
    • 00 6.3kbps 24 octets/frame
    • 01 5.3kbps 20
    • 10 SID frame 4
  • An MOS of about 3.8
    • At least 37.5 ms delay
slide24

G.729

  • 8 kbps
  • Input frames of 10 ms, 80 samples for 8 KHz sampling rate
  • 5 ms look-ahead
    • Algorithmic delay of 15 ms
  • An 80-bit frame for 10 ms of speech
  • A complex codec
    • G.729.A (Annex A), a number of simplifications
    • Same frame structure
    • Encoder/decoder, G.729/G.729.A
    • Slightly lower quality
slide25

G.729.B

    • VAD, Voice Activity Detection
      • Based on analysis of several parameters of the input
      • The current frames plus two preceding frames
    • DTX, Discontinuous Transmission
      • Send nothing or send an SID frame
      • SID frame contains information to generate comfort noise
    • CNG, Comfort Noise Generation
  • G.729, an MOS of about 4.0
  • G.729A an MOS of about 3.7
slide26

Other Codecs

  • CDMA QCELP defined in IS-733
    • Variable-rate coder
    • Two most common rates
      • The high rate, 13.3 kbps
      • A lower rate, 6.2 kbps
    • Silence suppression
    • For use with RTP, RFC 2658
slide27

GSM Enhanced Full-Rate (EFR)

    • GSM 06.60
    • An enhanced version of GSM Full-Rate
    • ACELP-based codec
    • The same bit rate and the same overall packing structure
      • 12.2 kbps
    • Support discontinuous transmission
    • For use with RTP, RFC 1890
slide28

GSM Adaptive Multi-Rate (AMR) codec

    • GSM 06.90
    • Eight different modes
    • 4.75 kbps to 12.2 kbps
    • 12.2 kbps, GSM EFR
    • 7.4 kbps, IS-641 (TDMA cellular systems)
    • Change the mode at any time
    • Offer discontinuous transmission
    • The coding choice of many 3G wireless networks
slide29

The MOS values are for laboratory conditions

    • G.711 does not deal with lost packets
    • G.729 can accommodate a lost frame by interpolating from previous frames
      • But cause errors in subsequent speech frames
  • Processing Power
    • G.728 or G.729, 40 MIPS
    • G.726 10 MIPS
slide30

Cascaded Codecs

    • E.g., G.711 stream -> G.729 encoder/decoder
    • Might not even come close to G.729
  • Each coder only generate an approximate of the incoming signal
slide31

Tones, Signal, and DTMF Digits

  • The hybrid codecs are optimized for human speech
    • Other data may need to be transmitted
    • Tones: fax tones, dialing tone, busy tone
    • DTMF digits for two-stage dialing or voice-mail
  • G.711 is OK
  • G.723.1 and G.729 can be unintelligible
  • The ingress gateway needs to intercept
    • The tones and DTMT digits
    • Use an external signaling system
slide32

Easy at the start of a call

    • Difficult in the middle of a call
  • Encode the tones differently form the speech
    • Send them along the same media path
    • An RTP packet provides the name of the tone and the duration
    • Or, a dynamic RTP profile; an RTP packet containing the frequency, volume and the duration
    • RFC 2198
      • An RTP payload format for redundant audio data
      • Sending both types of RTP payload
slide33

RTP Payload Format for DTMF Digits

    • An Internet Draft
    • Both methods described before
    • A large number of tones and events
      • DTMF digits, a busy tone, a congestion tone, a ringing tone, etc.
  • The named events
    • E: the end of the tone, R: reserved
discrete time lti systems the convolution sum
Discrete Time LTI Systems: The Convolution Sum

1

h[n]

0

1

2

n

2.5

2

2

x[n]

y[n]

0.5

0.5

0

1

n

0

1

2

3

n