1 / 33

Jointly Generating Captions to Aid Visual Question Answering

This paper explores how captions can be used to enhance Visual Question Answering (VQA) systems. It proposes a joint VQA and captioning model that generates question-relevant captions to improve VQA performance. Experimental results show that the generated relevant captions significantly improve VQA compared to question-agnostic captions.

swest
Download Presentation

Jointly Generating Captions to Aid Visual Question Answering

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Jointly Generating Captions to Aid Visual Question Answering Raymond Mooney Department of Computer Science University of Texas at Austin with Jialin Wu

  2. VQA ImagecreditstoVQAwebsite

  3. VQA Architectures • Most systems are DNNs using both CNNs and RNNs.

  4. VQA with BUTD • We use a recent state-of-the-art VQA system BUTD (Bottom-Up-Top-Down) (Anderson et al. 2018). • BUTD first detects a wide range of objects and attributes trained on VisualGenome data, and attends to them when computing an answer.

  5. Using Visual Segmentations • We use recent methods for using detailed image segmentations for VQA (VQS, Ganet al., 2017). • Provides more precise visual information than BUTD’s bounding boxes.

  6. High-Level VQA Architecture

  7. How can captions help VQA? • Captions+Detectionsasinputs • CaptionscanprovideusefulinformationfortheVQAmodel

  8. Multitask VQA and Image Captioning • There are lots of datasets with image captions. • COCO data used in VQA comes with captions • Captioning and VQA both need knowledge of image content and language. • Should benefit from multitask learning (Caruana, 1997).

  9. Question relevant captions • For a particular question, some of the captions are relevant and some are not.

  10. Howtogeneratequestion-relevantcaptions • Input feature side • We need to bias the features to encode the necessary information for the questions. • We used the VQA joint representation for simplicity. • Supervision side • We need the relevant captions to train the model to generate the relevant captions.

  11. How to obtain relevant training captions • Directly Collecting captions for each question? • Over 1.1 million questions in the dataset(notscalable). • ThecaptionhastobeinlinewiththeVQAreasoningprocess. • Choosing the most relevant caption from existing dataset? • How to measure relevance? • What if there is no relevant caption for an image-question pair?

  12. Quantifying the relevance • Intuition • Generating relevance captions should share the optimization goal with answering the visual question. • The two objectives should share some descent directions. • Relevance is measured using the inner-product of the gradients from the caption generation loss and the VQA answer prediction loss. • A positive inner-product means the two objective functions share some descent directions in the optimization process, and therefore indicates that the corresponding captions help the VQA process.

  13. Quantifying the relevance • Selecting the most relevant human caption

  14. Howtousethecaptions • A Word GRU to identify important words for the question and images • A Caption GRU to encode the sequential information from the attended words.

  15. Joint VQA/Captioningmodel

  16. Examples

  17. VQA 2.0 Data • Training • 443,757 questions • 82,783 images • Validation • 214,354 questions • 40,504 images • Test • 447,793 questions • 81,434 images All images come with 5 human generated captions

  18. Experimental Results • Compare with the state-of-the-art

  19. Experimental Results • Comparing different types of captions • Generated relevant captions help VQA more than the question-agnostic captions from BUTD.

  20. ImprovingImageCaptioningUsingan Image-ConditionedAuto-Encoder

  21. Aiding Training by Using an Easier Task • Using an easier task that first encodes the human captions and the image, and then generates the caption back. C1: several doughnuts are in a cardboard box. C2: a box holds four pairs of mini doughnuts. C3: a variety of doughnuts sit in a box.C4: several different donuts are placed in the box.C5: a fresh box of twelve assorted glazed pastries. C1: several doughnuts are in a cardboard box. C2: a box holds four pairs of mini doughnuts. C3: a variety of doughnuts sit in a box.C4: several different donuts are placed in the box.C5: a fresh box of twelve assorted glazed pastries. ENC DEC

  22. ModelOverview

  23. TrainingforImageCaptioning • Maximum likelihood principle • REINFORCE algorithms

  24. Hidden State Supervision • Both of these training approaches provide supervision on the output word probabilities,thereforethe hidden states do not receive direct supervision. • Supervising the hidden states requires the oracle hidden states that contain richer information. • An easier task that first encodes the human captions and the image, and then generates the caption back can help. • Hidden state loss for time (t)

  25. Training with Maximum Likelihood • Jointly optimizes the log-likelihood and the hidden states loss at each time step (t)

  26. Training withREINFORCE • Objectives • Gradients • Problem • Every word receives the same amount of reward no matter how appropriate they are.

  27. Hidden State Loss as a Reward Bias • Motivation • A word should have more reward when its hidden state matches a high performance oracle encoder. • Reward bias

  28. Experimental Data • COCO (Chen et al., 2015) • Each image with 5 human captions • “Karpathy split” • 110,000 training images • 5,000 validation images • 5,000 test images

  29. Baseline Systems • FC (Rennie et al., 2017) • With and without “self critical sequence training” • Up-Down (aka BUTD) (Anderson et al., 2018) • With and without “self critical sequence training”

  30. Evaluation Metrics • BLEU-4 (B-4) • METEOR (M) • ROUGE-L (R-L) • CIDEr (C) • SPICE (S)

  31. Experimental Results for Max Likelihood

  32. Experimental Results for REINFORCE • Training with different reward metrics

  33. Conclusions • Jointly generating “question relevant” captions can improve Visual Question Answering. • First training an image-conditioned caption auto-encoder can help supervise a captioner to create better hidden state representations that improve final captioning performance.

More Related