using speech recognition to predict voip quality n.
Skip this Video
Loading SlideShow in 5 Seconds..
Using Speech Recognition to Predict VoIP Quality PowerPoint Presentation
Download Presentation
Using Speech Recognition to Predict VoIP Quality

Loading in 2 Seconds...

play fullscreen
1 / 15

Using Speech Recognition to Predict VoIP Quality - PowerPoint PPT Presentation

  • Uploaded on

Using Speech Recognition to Predict VoIP Quality. Wenyu Jiang IRT Lab April 3, 2002. Introduction to Voice Quality. Quality factors in Voice over IP (VoIP) Packet loss, delay, and jitter Choice of voice codec Quality metric: Mean Opinion Score Widely used Human based Time consuming

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'Using Speech Recognition to Predict VoIP Quality' - maddox

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
using speech recognition to predict voip quality

Using Speech Recognition to Predict VoIP Quality

Wenyu Jiang


April 3, 2002

introduction to voice quality
Introduction to Voice Quality
  • Quality factors in Voice over IP (VoIP)
    • Packet loss, delay, and jitter
    • Choice of voice codec
  • Quality metric: Mean Opinion Score
    • Widely used
    • Human based
      • Time consuming
      • Labor intensive
      • Results N/A in real-time
  • Features of a speech recognizer:
    • Automatic speech recognition (ASR), no human listeners needed
    • Accuracy of recognition is apparently coupled with the quality of input speech
    • Recognition can be done in real-time, allowing online quality monitoring.
    • Recognition performance may be related to speech intelligibility as well as quality.
related work
Related Work
  • ITU-T E-model [G.107/G.108]
    • An analytical model for estimating perceived quality
    • Provides loss-to-MOS mapping for some common codecs (G.729, G.711, G.723.1).
  • Chernick et al studies speech recognition performance with DoD-CELP codec
    • Effect of bit error rate instead of packet loss
    • Phoneme (instead of word) recognition ratio
    • Some MOS results, but not accurate enough
experiment setup
Experiment Setup
  • Speech recognition engine
    • IBM ViaVoice on Linux
    • Wrote software for both voice model training and performance testing
  • Training and Testing
    • 2 scripts, #1 for training, #2 for testing.
    • 2 speakers, A and B, both read 2 scripts.
      • Script #2 is split into 25 audio clips, with 5 clips per loss condition (0%, 2%, 5%, 10%, 15%)
    • Codec: G.729
    • Training by G.729 processed audio
experiment setup contd
Experiment Setup, contd.
  • Performance metric
    • Absolute word recognition ratio
    • Relative word recognition ratio
      • p is packet loss probability
  • MOS listening tests: 22 listeners
recognition ratio vs mos
Recognition Ratio vs. MOS
  • Both MOS and Rabs decrease w.r.t loss
  • Then, eliminate middle variable p
properties of asr performance
Properties of ASR Performance
  • When loss probability is low
    • Recognition ratio changes slowly
    • Possibly due to robustness in ViaVoice
    • Less accurate MOS prediction in such case
  • Importance of voice training method
    • Training audio should use same codec as testing
speaker dependence in asr
Speaker Dependence in ASR
  • ViaVoice SDK cites a 90% accuracy for
    • Average speaker without a heavy accent
    • Sampling at 22KHz, PCM linear-16
  • For speaker A, we achieved
    • About 42% accuracy with no packet loss
    • Reasons:
      • 8KHz sampling + G.729 compression
      • Accent + talk speed
    • Does not interfere with MOS prediction, but need to check for speaker dependence
speaker dependence check
Speaker Dependence Check
  • Absolute recognition ratio is
    • 70% for speaker B, but 42% for speaker A
    • dependent on the speaker
  • But the relative recognition ratio Rrel is universal and speaker-independent
r rel as universal mos predictor
Rrel as Universal MOS Predictor
  • Mapping from relative recognition ratio Rrel to MOS
human recognition results
Human Recognition Results
  • Listeners are asked to transcribe what they hear in addition to MOS grading.
  • Human recognition result curves are less “smooth” than MOS curves.
human results contd
Human Results, contd.
  • Two flat regions in loss-human curve
    • 2-5% loss (some loss but not very high)
    • 10-15% loss (loss is already too high)
  • Mapping between machine and human recognition performance
application scenarios
Application Scenarios
  • Sender transmits a pre-recorded audio clip of a speaker known to receiver.
  • Receiver does the following:
    • Looks up Rabs(0%) for this speaker
    • Performs speech recognition
    • Compare to the original text, compute Rrel
  • No need to store the original audio clip
    • Just the text is sufficient  less storage
  • Need not know packet loss probability
    • Suitable for e2e black-box measurements
  • Evaluation of speech recognition performance as a MOS predictor
  • Used ViaVoice speech engine
  • Performance metric: word recognition ratio
  • The relative word recognition ratio is a universal, speaker-independent metric
  • Also analyzed human recognition performance
  • Future work: evaluate other codecs, e.g., G.726, GSM.