1 / 15

Using Speech Recognition to Predict VoIP Quality

Using Speech Recognition to Predict VoIP Quality. Wenyu Jiang IRT Lab April 3, 2002. Introduction to Voice Quality. Quality factors in Voice over IP (VoIP) Packet loss, delay, and jitter Choice of voice codec Quality metric: Mean Opinion Score Widely used Human based Time consuming

maddox
Download Presentation

Using Speech Recognition to Predict VoIP Quality

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Using Speech Recognition to Predict VoIP Quality Wenyu Jiang IRT Lab April 3, 2002

  2. Introduction to Voice Quality • Quality factors in Voice over IP (VoIP) • Packet loss, delay, and jitter • Choice of voice codec • Quality metric: Mean Opinion Score • Widely used • Human based • Time consuming • Labor intensive • Results N/A in real-time

  3. Motivation • Features of a speech recognizer: • Automatic speech recognition (ASR), no human listeners needed • Accuracy of recognition is apparently coupled with the quality of input speech • Recognition can be done in real-time, allowing online quality monitoring. • Recognition performance may be related to speech intelligibility as well as quality.

  4. Related Work • ITU-T E-model [G.107/G.108] • An analytical model for estimating perceived quality • Provides loss-to-MOS mapping for some common codecs (G.729, G.711, G.723.1). • Chernick et al studies speech recognition performance with DoD-CELP codec • Effect of bit error rate instead of packet loss • Phoneme (instead of word) recognition ratio • Some MOS results, but not accurate enough

  5. Experiment Setup • Speech recognition engine • IBM ViaVoice on Linux • Wrote software for both voice model training and performance testing • Training and Testing • 2 scripts, #1 for training, #2 for testing. • 2 speakers, A and B, both read 2 scripts. • Script #2 is split into 25 audio clips, with 5 clips per loss condition (0%, 2%, 5%, 10%, 15%) • Codec: G.729 • Training by G.729 processed audio

  6. Experiment Setup, contd. • Performance metric • Absolute word recognition ratio • Relative word recognition ratio • p is packet loss probability • MOS listening tests: 22 listeners

  7. Recognition Ratio vs. MOS • Both MOS and Rabs decrease w.r.t loss • Then, eliminate middle variable p

  8. Properties of ASR Performance • When loss probability is low • Recognition ratio changes slowly • Possibly due to robustness in ViaVoice • Less accurate MOS prediction in such case • Importance of voice training method • Training audio should use same codec as testing

  9. Speaker Dependence in ASR • ViaVoice SDK cites a 90% accuracy for • Average speaker without a heavy accent • Sampling at 22KHz, PCM linear-16 • For speaker A, we achieved • About 42% accuracy with no packet loss • Reasons: • 8KHz sampling + G.729 compression • Accent + talk speed • Does not interfere with MOS prediction, but need to check for speaker dependence

  10. Speaker Dependence Check • Absolute recognition ratio is • 70% for speaker B, but 42% for speaker A • dependent on the speaker • But the relative recognition ratio Rrel is universal and speaker-independent

  11. Rrel as Universal MOS Predictor • Mapping from relative recognition ratio Rrel to MOS

  12. Human Recognition Results • Listeners are asked to transcribe what they hear in addition to MOS grading. • Human recognition result curves are less “smooth” than MOS curves.

  13. Human Results, contd. • Two flat regions in loss-human curve • 2-5% loss (some loss but not very high) • 10-15% loss (loss is already too high) • Mapping between machine and human recognition performance

  14. Application Scenarios • Sender transmits a pre-recorded audio clip of a speaker known to receiver. • Receiver does the following: • Looks up Rabs(0%) for this speaker • Performs speech recognition • Compare to the original text, compute Rrel • No need to store the original audio clip • Just the text is sufficient  less storage • Need not know packet loss probability • Suitable for e2e black-box measurements

  15. Conclusions • Evaluation of speech recognition performance as a MOS predictor • Used ViaVoice speech engine • Performance metric: word recognition ratio • The relative word recognition ratio is a universal, speaker-independent metric • Also analyzed human recognition performance • Future work: evaluate other codecs, e.g., G.726, GSM.

More Related