1 / 21

Dynamic Captioning: Video Accessibility Enhancement for Hearing Impairment

Dynamic Captioning: Video Accessibility Enhancement for Hearing Impairment. Richang Hong, Meng Wang, Mengdi Xu y Shuicheng Yany and Tat- Seng Chua School of Computing, National University of Singapore, 117417, Singapore yDepartment of ECE, National University of Singapore. Outline.

huy
Download Presentation

Dynamic Captioning: Video Accessibility Enhancement for Hearing Impairment

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Dynamic Captioning: Video AccessibilityEnhancement for Hearing Impairment Richang Hong, Meng Wang, MengdiXuyShuichengYany and Tat-Seng Chua School of Computing, National University of Singapore, 117417, Singapore yDepartment of ECE, National University of Singapore

  2. Outline • Introduction • Processing • Face Detection, Tracking, Grouping • Script-Face Mapping • Non-Salient Region Detection • Script-Speech Alignment • Volume Analysis • Experiments • Conclusion

  3. Introduction • For hearing-Impairments, simply place subtitles may loss following information: • Emotion(volume change) • Multiple people speaking simultaneously(messy subtitle) • Lose tracking of subtitle(speaking pace change)

  4. Introduction • Dynamic Captioning • Sets up an indicator to represent speaking volume • Makes arrow from subtitle to speakingmouth • Highlights the words being spoken

  5. Flowchart

  6. Script & Subtitle Alignment (Hello! My name is... Buffy” – Automatic Naming of Characters in TV Video)[22]

  7. Face Detection, Tracking, Grouping • Face Detector[17] • Robust Foreground Correspondence Tracker[18] Size of overlap area in adjacent frames > threshold

  8. Script-Face Mapping Determine who is speaker • Lip motion analysis[19] • Haar Feature based cascade mouth detector (mouth region) • Compute Mean Square Distance for pixel values in mouth region in each two continuous frames • Set two thresholds to separate three states : {speaking, nonspeaking, difficult to judge}

  9. Script-Face Mapping

  10. Script-Face Mapping • Extract SIFT from 9 facial keypoints(9x128=1152 dim) to be facial feature vector • If only one person is speaking, we can confirm who is speaking with script and subtitle file, then we can treat it with high confidence and use it to be training data (feature vector) • If two or more persons speaking, use training data to identify the unknown ones (sparse representation classification[20])

  11. Script-Face Mapping

  12. Non-Salient Region Detection (b) : for each pixel calculate Gaussian distance between self and adjacency pixels The lighter pixel represents more important they are

  13. Non-Salient Region Detection • Partition image into 5x5 grids (empirically) • Assign weight values to the blocks around speakers’ face block • Assign weight wi = 1 for pixel left/right at talking block • Assign weight wi = 0.8 for RT/LT/RD/LD blocks • For each block b, a saliency energy s (0 < s <1) is computed by averaging all the normalized energies of the pixels within b. • Calculate score by • Insert captions in region with maximal score

  14. Script-Speech Alignment

  15. Script-Speech Alignment • Use 39-dim MFCC feature to describe the sound segment • Translate each word of CMU pronouncing dictionary into phonetic sequence • SPHINX II recognition engine with pronouncing dictionary • Find match part which contain more than 3(emperically) words to be anchor • Do matching when there is still unmatched segments

  16. Volume Analysis Symbolize and illustrate the voice volume Compute the power of the audio signal in a small local window (30ms)

  17. Experiments

  18. Experiments

  19. Experiments

  20. Conclusion Contribute: Helps hearing impaired audiences enjoy more Future Work: 1. Improves script-face mapping accuracy and face to larger dataset study 2. Deal with videos without script

  21. The End

More Related