1 / 29

Maximizing the Synergy between Man and Machine

Exploiting Human Abilities in Video Retrieval Interfaces. Maximizing the Synergy between Man and Machine. Alex Hauptmann School of Computer Science Carnegie Mellon University. Background. Automatic video analysis for detection/recognition is still quite poor

haru
Download Presentation

Maximizing the Synergy between Man and Machine

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Exploiting Human Abilities in Video Retrieval Interfaces Maximizing the Synergy between Man and Machine Alex Hauptmann School of Computer Science Carnegie Mellon University

  2. Background Automatic video analysis for detection/recognition is still quite poor • Consider baseline (random guessing) • Improvement is limited • Consider near-duplicates (trivial similarity) • Does not generalize well over video sources • Better than nothing Need humans to make up for this shortcoming!

  3. Differences to VIRAT • Most interface work done on broadcast TV • Harder: • Unconstrained subject matter • Graphics, animations, photos • Broadcasters • Many different, short shots out of context • Easier: • Better resolution • Conventions in editing, structure • Audio track • Keyframes are typical unit of analysis

  4. “Classic” Interface Work • Interactive Video Queries • Fielded text query matching capabilities • Fast image matching with simplified interface for launching image queries • Interactive Browsing, Filtering, and Summarizing • Browsing by visual concepts • Quick display of contents and context in synchronized views

  5. Informedia Client Interface Example

  6. Interface options Christel08

  7. Suggesting Related ConceptsZavesky07

  8. Fork Browser Snoek09

  9. “Classic” Video Interface Results • Concept browsing and image search used frequently • Novices still have lower performance than experts • Some topics cause “interactivity” to be one-shot query with little browsing/exploration • Classic Informedia interface including concept browsing often good enough that user never proceeds to any additional text or image query • “Classic Informedia” scored highest of those testing with novice users in TRECVID evaluations

  10. Visual Browsing

  11. Augmented Video Retrieval The computer observes the user and LEARNS, based on what is marked as relevant The system can learn: • What image characteristics are relevant • What text characteristics (words) are relevant • What combination weights should be used We exploit the human’s ability to quickly mark relevant video and the computer’s ability to learn from given examples

  12. Text Find Pope John Paul Audio Motion Image multimodal question Videolibrary … Closed Caption Audio Feature Motion Feature Color Feature … (3k) Face Building Knowledge Source/API Output1 Output2 Outputn Combination of diverse knowledge sources Final ranked list for interface Combining Concept Detectors for Retrieval

  13. Q-Type: general object Txt: 0.5, Img: 0.3, Face: -0.5 Outdoor: ?, Ocean: ? (Learned from training set) (Unable to be learned) Why Relevance Feedback? • Limited training data • Untrained sources are useful for some specific searches Query: finding some boats/ships

  14. Probabilistic Local Context Analysis (pLCA)Yan07 • Goal: Refine results of the current query • Method: assume the combination parameters of “un-learned” sources υto be latent variables andcompute P(yj|aj,Dj) • Discover useful search knowledge based on initial results Ai υ1:?, υ2:?,…, υN:? Query Y1 A1 Video1 Initial Search Result A2 Y2 Video 2 Ym Am Video M

  15. Undirected Model and Parameter Estimation Compute the posterior probability of document relevance Y given initial results A based on an undirected graphical model • Variational inference, i.e., iterate until convergence and approximate P(yj|aj) by qyj • Maximize w.r.t. variational para. of Y, • Maximize w.r.t. variational para. of v υ A1 Y1 D1 A2 Y2 D2 Am Ym Dm

  16. Automatic vs Interactive Search

  17. Extreme Video Retrieval • Automatic retrieval baseline for ranking order • Two methods of presentation: • System-controlled Presentation - Rapid Serial Visual Presentation (RSVP) • User-controlled Presentation – Manual Browsing with Resizing of Pages

  18. System-controlled Presentation • Rapid Serial Visual Presentation (RSVP) • Minimizes eye movements • All images in same location • Maximizes information transfer: System  Human • Up to 10 key images/second • 1 or 2 images per page • Presentation intervals are dynamically adjustable • Click when relevant shot is seen • Mark previous page also as relevant • A final verification step is necessary

  19. User-controlled presentations • Manual Browsing with Resizing of Pages • Manually page through images • User decides to view next page • Vary the number of images on a page • Allow chording on the keypad to mark shots • A very brief final verification step

  20. MBRP - Manual Browsing with Resizable Pages

  21. Extreme QA with RSVP 3x3 display 1 page/second Numpad chording to select shots

  22. Mindreading – an EEG interface • LearnRelevant/Non-Relevant • 5 EEG probes • Simple features • Too slow • 250 ms/image • Significant recoverytime after hit • Relevance feedback

  23. Summarizing Video: Beyond Keyframes • BBC Rushes • Unedited video for TV series production • Summarize as video in 1/50th of total • Note the non-scalable target factor • Lots of smart analysis • Clustering, salience, redundancy, importance • Best performance for retrieval was to play every frame

  24. Speed-up summarization results

  25. Surveillance Event Detection • Interesting stuff is rare • Detection accuracy is limited • Monitor many streams

  26. Surveillance Event Detection • Need action not key frame • Difficult for humans • Combine speed-upwith automatic analysis • Slow down when interesting stuff happens

  27. Summary Interfaces have much to contribute in retrieval • We don’t know what is best • Task-specific • User-specific • System-dependent • Collaborative search • Combining “best of current systems” • Simpler is usually better (Occam’s razor) General principles are difficult to find

  28. Questions?

More Related