1 / 33

CVPR 2019 Poster

CVPR 2019 Poster. Task. Grounding referring expressions is typically formulated as a task that identifies a proposal referring to the expressions from a set of proposals in an image. Summarize. visual features of single objects. global visual contexts. CNN. pairwise visual differences.

alfredod
Download Presentation

CVPR 2019 Poster

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CVPR2019Poster

  2. Task Grounding referring expressions is typically formulated as a task that identifies a proposal referring to the expressions from a set of proposals in an image . Summarize visual features of single objects global visual contexts CNN pairwise visual differences objectpaircontext global language contexts LSTM language features of the decomposed phrases

  3. Problem • existing work on global language context modeling and global visual context modeling introduces noisy information and makes it hard to match these two types of contexts • pairwise visual differences computed in existing work can only represent instance-level visual differences among objects of the same category. • existing work on context modeling for object pairs only considers first-order relationships but not multi-order relationships. • multi-order relationships are actually structured information, and the context encoders adopted by existing work on grounding referring expressions are simply incapable of modeling them.

  4. Pipeline

  5. SpatialRelationGraph

  6. LanguageContext wordtype wordrefertovertex vertexlanguagecontext

  7. Language-GuidedVisualRelationGraph edge vertex

  8. Language-VisionFeature LossFunction SemanticContextModeling

  9. ICCV2019

  10. Problem • almost all the existing approaches for referring expression comprehension do not introduce reasoning or only support single-step reasoning • themodelstrainedwiththoseapproacheshavepoorinterpretability

  11. Pipeline

  12. Language-GuidedVisualReasoningProcess q: which is the concatenation of the last hidden states of both the forward and backward LSTMs

  13. StaticAttention

  14. DynamicGraphAttention

  15. CVPR2019Oral

  16. Motivation • when we feed an unseen image scene into the framework, we usually get a simple and trivial caption about the salient objects such as “there is a dog on the floor”, which is no better than just a list of object detection • once we abstract the scene into symbols, the generation will be almost disentangled from the visual perception

  17. InductiveBias • everydaypracticemakesusperformsbetterthanmachinesinhigh-levelreasoning • template/rule-based caption models, is well-known ineffective compared to the encoder-decoder ones, due to the large gap between visual perception and language composition • Scenegraph-->bridgethegapbetweentwoworlds • we can embed the graph structure into vector representations;the vector representations are expected to transfer the inductive bias from the pure language domain to the vision-language domain

  18. Encoder-DecoderRevisited

  19. Auto-EncodingSceneGraphs Dictionary

  20. OverallModel:SGAE-basedEncoder-Decoder • objectdetector+relationdetector+attributeclassifier • multi-modalgraphconvolutionnetwork • pre-trainDcross-entropylossRL-basedloss • twodecoders:

  21. ICCV2019

  22. Motivation • unlike a visual concept in ImageNet which has 650 training images on average, a specific sentence in MS-COCO has only one single image, which is extremely scarce in the conventional view of supervised training • given a sentence pattern in Figure 1b, your descriptions for the three images in Figure 1a should be much more constrained • studies in cognitive science show that do us humans not speak an entire sentence word by word from scratch; instead, we compose a pattern first, then fill in the pattern with concepts, and we repeat this process until the whole sentence is finished

  23. Tacklingthedatasetbias

  24. RelationModule ObjectModule FunctionModule AttributeModule

  25. Controller Multi-stepReasoning:repeatthesoftfusionandlanguagedecodingMtimes. LinguisticLoss:

More Related