1 / 38

Multimedia Retrieval

Multimedia Retrieval. Outline. Audio Retrieval Spoken information Music Document Image Analysis and Retrieval Video Retrieval. A Taxonomy of Audio. Sound. Music. Speech. Other?. ?. Jazz. Country. Sports Announcer. Male. Rock. Classical. Female. Disco. Hip Hop. Choir.

kolton
Download Presentation

Multimedia Retrieval

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Multimedia Retrieval

  2. Outline • Audio Retrieval • Spoken information • Music • Document Image Analysis and Retrieval • Video Retrieval

  3. A Taxonomy of Audio Sound Music Speech Other? ? Jazz Country SportsAnnouncer Male Rock Classical Female Disco Hip Hop Choir Orchestra StringQuartet Piano

  4. Spoken Document Retrieval

  5. Spoken Document Retrieval

  6. Acoustic Modeling Describes the sounds that make up speech Speech Recognition Lexicon Describes which sequences of speech sounds make up valid words Language Model Describes the likelihood of various sequences of words being spoken Speech Recognition Knowledge Sources

  7. Speech Recognition in Brief Grammar Decoder (Language Model) PhoneticProbability Signal Speech Words Processing Estimator (Acoustic Model) Pronunciation Lexicon

  8. Hints For Better Recognition • Goal: improve the estimation p(word|acoustic_sig) • Main idea: • p(word|acoustic_sign)  p(word|acoustic_signal, X) • Topical information • News of the day • Image information ? What could be X?

  9. Hints For Better Recognition • Goal: improve the estimation p(word|acoustic_sig) • Main idea: • p(word|acoustic_sign)  p(word|acoustic_signal, X) • Topical information • News of the day • Image information • Lip reading • Video Optical Character Recognition (VOCR) What could be X?

  10. Speech Recognition AccuracyWord Error Rate

  11. 100 90 80 70 60 50 40 30 Relative Precision 0 10 20 30 40 50 60 70 80 A rather small degradation in retrieval when word error rate is small than 30% Information Retrieval Precision vs. Speech Accuracy % of Text IR Word Error Rate Indexing and Search of Multimodal Information, Hauptmann, A., Wactlar, H. Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP-97), Munich, Germany, April 1997.

  12. Spoken Document Retrieval • Segmentation issue • Continuous speech data without story boundaries • Typical segmentation approaches • Overlapping windows (30 sec for each segment) • Automatic detection of speaker changes

  13. Spoken Document Retrieval:Document Expansion • Motivation: documents are erroneous • Goal: apply expansion techniques to reduce the impacts of recognition errors in spoken documents • Similar to query expansion

  14. Spoken Document Retrieval:Document Expansion • Motivation: documents are erroneous • Goal: apply expansion techniques to reduce the impacts of recognition errors in spoken documents • Similar to query expansion doc1 Clean Doc Collection (web docs) Find common words in top ranked docs doc2 Speech Recognized Transcript doc3 doc4

  15. Spoken Document Retrieval:Document Expansion • Motivation: documents are erroneous • Goal: apply expansion techniques to reduce the impacts of recognition errors in spoken documents • Similar to query expansion • Treat each speech document as a query • Find clean documents that are relevant to speech documents • Expand each speech document with the common words in the top ranked clean documents.

  16. Document Expansion (Sighal & Piereira, 1999)

  17. A Taxonomy of Audio Sound Music Speech Other? ? Jazz Country SportsAnnouncer Male Rock Classical Female Disco Hip Hop Choir Orchestra StringQuartet Piano

  18. Music Information Retrieval

  19. Music Retrieval • A textual retrieval approach • Using meta data: titles, artists, genres, … • Content-based music retrieval • Query by audio • Query by score document/segment

  20. 67 64 65 62 60 (Midi representation) -3 1 -3 -2 Content-based Music Retrieval On-line processing Microphone Signal input Sampling Short-term Autocorrelation Center Clipping Note Segmentation 11KHz Mid-level Representation Similarity Comparison Query results (Ranked song list) Midi message Extraction Songs Database Off-line processing

  21. Content-based Music Retrieval  : 1 1 2 0 -2 0 1 2 0  : -3 1 1 2 • N-gram representation • A vector representation for each music document • A typical information retrieval problem

  22. Document Image Analysis and Retrieval

  23. Document Image Analysis • Recognize text (OCR) • convert page images to Unicode • machine-printed, handwritten • Analyze page layout geometry • a 2-D problem (unlike speech, text) • good ‘language-free’ algorithms • Capture logical structure • output marked-up text (XML, etc) • exploit non-textual clues

  24. Video/Image OCR Block Diagram Text Area Detection Video orImage Text Area Preprocessing Commercial OCR UTF8 Text

  25. Text Detection

  26. Video OCR • Low resolution (as low as 10 pixel height/character) • limited by NTSC (352x248) /PAL/SECAM TV standard • Complex background • Character Hue and Brightness similar to background

  27. VOCR Preprocessing Problems

  28. Video Frames (1/2 s intervals) Filtered Frames AND-ed Frames

  29. OCR Document Retrieval • Task: find OCR recognized document relevant to a information need • Challenge: erroneous documents  needs to handle with word errors

  30. OCR Document Retrieval • Correction based approaches • Find potential word errors and replace each with the most likely correct one • Partial matching approaches • Word  a set of n-grams • Word matches  n-gram matches

  31. Video Retrieval

  32. Integration overcomes limitation of each Video Retrieval - Application of Diverse Technologies • Speech understanding for automatically derived transcripts • Image understanding for video “paragraphing”; face, text and other object recognition • Natural language for query expansion, topic detection and content summarization • Human computer interaction for video display, navigation and reuse

  33. Introduction to TREC Video Retrieval Track • NIST TREC Video Track web site: http://www-nlpir.nist.gov/projects/trecvid/ • Video Retrieval Track started in 2001 • Investigation of content-based retrieval from digital video • Focus on the shot as the unit of information retrieval rather than the scene or story/segment/clip

  34. The TRECVID Collections 2001 - 11 hours, 74 queries, 8000 shots 2002 - 40 hours, 25 queries, 14000 shots Video from the Internet Archive between the ‘50’s and ’70’s Advertising, educational, industrial and amateur films Common shot boundaries 2003 – 56 hours, 25 queries, 32000 shots 1998 Broadcast News (CNN, ABC, CSpan) + Common Speech Recognition + Common Annotations 2004 – 61 hours, 24 queries, 33000 shots More 1998 Broadcast News

  35. Speech: We’re looking for people that have a broad range of expertise that have business knowledge that have knowledge on quality management on quality improvement and in particular … OCR:H,arry Hertz a Director aro 7 wa-,i,,ty Program,Harry Hertz a Director Sample Query and Target Query: Find pictures of Harry Hertz, Director of the National Quality Program, NIST

  36. Query Text Image Audio Retrieval Agents Text Score Image Score Audio Score Final Score System Architecture (Trec Video Track 2001) • Combine video, audio and text retrieval scores

  37. Results for TREC01

More Related