1 / 33

Object detection in videos – Attention based cues

Mentor : Prof. Amitabha Mukherjee. Object detection in videos – Attention based cues. - Shubham Tulsiani (Y9574). The Importance of Attention. Object detection algorithms are computationally expensive

zared
Download Presentation

Object detection in videos – Attention based cues

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Mentor : Prof. AmitabhaMukherjee Object detection in videos –Attention based cues - ShubhamTulsiani (Y9574)

  2. The Importance of Attention • Object detection algorithms are computationally expensive • Modeling Attention is a biologically motivated way of preselecting regions for further costly computations

  3. Attention Based Approaches (Images) Previous Works • Attention based approaches have often been used for Static Images. Some examples are - • Itti Koch Saliency Model • Contextual Cues combined with Saliency for search tasks • Feature, Context and Saliency based attention model

  4. Itti Koch Saliency This models provides a measure of the ’saliency’ of each location in the image across various low-level features (contrast, color, orientation, texture, motion) It is a primitive model of attention used for object detection

  5. Context and Saliency based Attention - Eye Movements and attention - Torralba • Human visual system makes extensive use of contextual information for facilitating object search in natural scenes • This, combined with saliency was used to model attention for object detection

  6. Features, Context and Saliency A combination of Saliency, Context and Feature based cues has been used to obtain more evolved attention models

  7. Performance of Various Attention Models on images It has been observed that the combined model of visual attention performs better than isolated ones Thus, human visual attention is driven by various factors which combined models can capture more effectively

  8. Object Detection in Videos: Challenges • A lot of data !! • Applying an object detector for static images on each frame is very costly

  9. Object Detection in Videos: Advantages • We can exploit the information in across the frames for an effective detector. This will eliminate false positives which often occur in images • Various attention based cues do not have to be recomputed every frame

  10. Some Common Approaches • Feature based Object detection • Motion based Object Detection • There is no notable visual attention based approach for object detection in Videos

  11. A Proposed Methodology • We should compute maps for various cues which drive our attention like Saliency, Motion in video, Context and Feature resemblance • We can combine these cues to obtain a model for visual attention which gives us the regions for interest for object detection

  12. Saliency Based Cues A black dot on a white board is salient and draws our attention • Saliency is a bottom-up cue i.e saliency maps are independent of the object being searched for or the semantic content of the video. • High saliency represents that the region stands out from its surroundings.

  13. Motion Detection • We would like to focus our attention on regions where motion is detected because there is a higher probability of that the object of interest would be present • This cue is also bottom-up and corresponds to saliency in a temporal sense

  14. Contextual Cues A person is more likely to be present on the road than in the sky The context map for pedestrian detection will show higher values for regions near the ground • Give an indication of where the object is more likely to be present • Does not have to be computed very frequently in a video(specially for a static camera)

  15. Feature Based Cues While searching for a snake, we are likely to focus on long, thin objects • Indicate resemblance to the object being searched for • Instead of features from a static frame, we should take into account the feature from a set of frames • We can learn how the object looks in a sequence of frames

  16. Feature Based Cues • This can be achieved by modifying our base static detection approach to represent dynamic information by extending the static representation into the time domain • More complex approaches can be used but since the aim is to get a computationally inexpensive feature map, the above will suffice

  17. Combining Cues A object detection model for person should have more weight to motion cue as compared to a model for trees • We can combine the cues to obtain a model for Visual Attention in videos for the object to be detected • The combined map would determine the regions of interest in the video • For a general model applicable across all objects, we should be able to learn the weights to be associated with each of the cues

  18. An Implementation : Overview • We learn a detector for humans in videos based on the proposed methodology • We test our model on videos from an annotated video database ‘LabeME Video’ • We use the maps for Saliency, Motion, Context and Features to detect regions of interest

  19. A sample Video

  20. Saliency • We have used the Itti Koch model to compute the saliency maps • Since computing saliency is computationally effective, we have computed it for every frame but this may be made more effective

  21. Saliency

  22. Motion Detection • We highlight those regions where the value of pixels differs from the corresponding pixels in the previous frames (this approach does not work for moving cameras) • We have taken into account the slight instability of hand-held cameras in the computation of these motion maps

  23. Motion Detection

  24. Contextual Cues • Used over 600 images from the LabelMe database to train a context model • Since the context does not rapidly change in a video, we recompute it after a set of 10 frames

  25. Contextual Cues Context Original

  26. Feature Based Cues • We have trained a Viola-Jones algorithm based detector using adaboost over 1,00,000 base features • To take into account the temporal aspects of features, we will normalise the map over a set of frames

  27. Feature Based Cues

  28. The Dynamic Attention Model • We combine the various cues to obtain the model for visual attention in videos for pedestrian detection • Further, we can now select the region above a certain threshold (20%) for object detection via a costly algorithm

  29. A Demonstration

  30. A Demonstration

  31. Future Scope of Work • We can interpolate the various maps over time for more effective detectors • Alternate models for the various cues can be used • The proposed model can be extended to be implemented in real time

  32. References • A Trainable System for Object Detection in Images and Video Sequences - Constantine P. Papageorgiou • Modeling Search for People in 900 Scenes : A combined source model of eye guidance – Torralba et. al • Object Detection and Tracking in Video - ZhongGuo • Various Databases and Code Sources

  33. Thank YOU

More Related