1 / 41

LYU0103 Speech Recognition Techniques for Digital Video Library

This presentation outlines the project objectives, ViaVoice recognition experiments, speech information processor, and audio information retrieval methods for a digital video library supervisor. It discusses the previous work done, including audio extraction and segmentation, as well as the experiments conducted using IBM ViaVoice for real-time dictation. The results and conclusions of the experiments are also presented, along with the development of a speech information processor for media playback, real-time dictation, timing information retrieval, and audio scene change detection. The presentation highlights the challenges faced in speech recognition and provides insights into future approaches.

kdenny
Download Presentation

LYU0103 Speech Recognition Techniques for Digital Video Library

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. LYU0103Speech Recognition Techniques for Digital Video Library Supervisor : Prof Michael R. Lyu Students: Gao Zheng Hong Lei Mo

  2. Outline of Presentation • Project objectives • ViaVoice recognition experiments • Speech information processor • Audio information retrieval • Summary

  3. Our Project Objectives • Speech recognition • Audio information retrieval

  4. Last Term’s Work • Extract audio channel (stereo 44.1 kHz) from mpeg video files into wave files (mono 22 kHz) • Segment the wave files into sentences by detecting its frame energy • Realtime dictation with IBM ViaVoice (ViaVoice is a speech recognition engine developed by IBM) • Developed a visual training tool

  5. Visual Training Tool Video Window; Dictation Window; Text Editor

  6. IBM ViaVoice Experiments • Employed 7 student helpers • Produce transcripts of 77 news video clips • Four experiments: • Baseline measurement • Trained model measurement • Slow down measurement • Indoor news measurement

  7. Baseline Measurement • To measure the ViaVoice recognition accuracy using TVB news video • Testing set: 10 video clips • The segmented wav files are dictated • Employ the hidden Markov model toolkit (HTK) to examine the accuracy

  8. Trained Model Measurement • To measure the accuracy of ViaVoice, trained by its correctly recognized words • 10 videos clips are segmented and dictated • The correctly dictated words of training set are used to train the ViaVoice by the SMAPI function SmWordCorrection • Repeat the procedures of “baseline measurement” after training to get the recognition performance • Repeat the procedures of using 20 videos clips

  9. Slow Down Measurement • Investigate the effect of slowing down the audio channel • Resample the segment wave files in the testing set by the ratio of 1.05, 1.1, 1.15, 1.2, 1.3, 1.4, and 1.6 • Repeat the procedures of “baseline measurement”

  10. Indoor News Measurement • Eliminate the effect of noise • Select the indoor news reporter sentence • Dictate the test set using untrained model • Repeat the procedure using trained model

  11. Experimental Results Overall Recognition Results (ViaVoice, TVB News )

  12. Experimental Result Cont. Result of trained model with different number of training videos Result of using different slow down ratio

  13. Analysis of Experimental Result • Trained model: about 1% accuracy improvement • Slowing down speeches: about 1% accuracy improvement • Indoor speeches are recognized much better • Mandarin: estimated baseline accuracy is about 70 % ( >> Cantonese)

  14. Experiment Conclusions • Four reasons for low accuracy • Language model mismatch • Voice channel mismatch • The broadcast is very fast and some characters are not so clear • The voice of video clips is too loud • The first two reasons are the most critical ones

  15. Speech Recognition Approach • We cannot do much acoustic model training with the ViaVoice API • Training is speaker dependent • Great difference between the news audio and the training speech for ViaVoice • The tool to adapt acoustic model is not currently available • Manually editing is necessary for producing correct subtitles

  16. Speech Information Processor (SIP) Media player, Text editor, Audio information panel

  17. Main Features • Media playback • Real-time dictation • Word time information • Dynamic recognition text editing • Audio scene change detection • Audio segments classification • Gender classification

  18. System Chart

  19. Timing Information Retrieval • Use ViaVoice Speech Manager API (SMAPI) • Asynchronous callback • The recognized text is organized in a basic unit called “firm word” • SIP builds an index to store the position and time of firm words • Highlight corresponding firm words during video playback

  20. Highlight words during playback

  21. Dynamic Index Alignment • While editing recognized result, firm word structure might be changed • Word index need to be updated accordingly • SIP captures WM_CHAR event of the text editor • Then search for the modified words, and update the corresponding entries in the index • In practice, binary search provides good responding time

  22. Time Index Alignment Example Before Editing Editing After Editing

  23. Audio Information Panel • The entire clip is divided into segments separated by audio scene changes • SIP classifies the segments into three categories, male, female, and non-speech • Click a segment to preview it

  24. Audio Information Retrieval

  25. Detection of Audio Scene Change--Motivations • Segments of different properties can be handled differently • Apply unsupervised learning to different clusters • Assistant tool to video scene change detection

  26. Bayesian Information Criterion (BIC) • Gaussian Distribution—model input stream • Maximum Likelihood—detect turns • BIC– make a decision

  27. Principle of BIC • Bayesian information criterion (BIC) is a likelihood criterion • The main principle is to penalize the system by the model complexity

  28. Detection of a single point change using BIC H0:x1,x2…xN~N(μ,Σ) H1:x1,x2…xi~N(μ1,Σ1), H2:xi+1,xi+2…xN~N(μ2,Σ2), The maximum likelihood ratio is defined as: R(I)=Nlog| Σ|-N1log| Σ1|-N2log| Σ2|

  29. Detection of a single point change using BIC • The difference between the BIC values of two models can be expressed as: BIC(I) = R(I) – λP P=(1/2)(d+(1/2d(d+1)))logN • If BIC value>0, detection of scene change

  30. Detection of multiple point changes by BIC • a.       Initialize the interval [a, b] with a=1, b=2 • b.      Detect if there is one changing point in interval [a, b] using BIC • c.       If (there is no change in [a, b]) let b= b + 1 else let t be the changing point detected assign a = t +1; b = a+1; end d. go to step (b) if necessary

  31. Advantages of BIC approach • Robustness • Thresholding-free • Optimality

  32. Comparison of different algorithms

  33. Gender Classification: Motivation and Purpose • Allowing different speech analysis algorithms for each gender • Facilitating speech recognition by cutting the search space in half • Helping us to build gender-dependent recognition model and better training of the system

  34. Gender Classification Male Female

  35. Speech/Non-Speech Classification • Motivation • One method we used : pitch tracking

  36. Speech/Non-Speech classification Speech Non-Speech

  37. Summary • ViaVoice training experiments • Speech recognition editing • Dynamic index alignment • Audio scene change detection • Speech classification • Integrated the above functions into a speech processor

  38. Q & A

More Related