deepid net deformable deep convolutional neural network for generic object detection n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
DeepID-Net: deformable deep convolutional neural network for generic object detection PowerPoint Presentation
Download Presentation
DeepID-Net: deformable deep convolutional neural network for generic object detection

Loading in 2 Seconds...

play fullscreen
1 / 44

DeepID-Net: deformable deep convolutional neural network for generic object detection - PowerPoint PPT Presentation


  • 246 Views
  • Uploaded on

DeepID-Net: deformable deep convolutional neural network for generic object detection.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'DeepID-Net: deformable deep convolutional neural network for generic object detection' - sharon-dennis


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
deepid net deformable deep convolutional neural network for generic object detection

DeepID-Net: deformable deep convolutional neural network for generic object detection

WanliOuyang, Ping Luo, XingyuZeng, Shi Qiu, YonglongTian, Hongsheng Li, Shuo Yang, Zhe Wang, YuanjunXiong, Chen Qian, Zhenyao Zhu, Ruohui Wang, Chen-Change Loy, Xiaogang Wang, Xiaoou Tang 

WanliOuyang et al. DeepID-Net: multi-stage and deformable deep convolutional neural network for generic object detection, arXiv:1409.3505 [cs.CV]

deepid net deformable deep convolutional neural network for gener i c object d etection

DeepID-Net: deformable deepconvolutional neural network for generic object detection

WanliOuyang, Ping Luo, XingyuZeng, Shi Qiu, YonglongTian, Hongsheng Li, Shuo Yang, Zhe Wang, YuanjunXiong, Chen Qian, Zhenyao Zhu, Ruohui Wang, Chen-Change Loy, Xiaogang Wang, Xiaoou Tang 

WanliOuyang et al. DeepID-Net: multi-stage and deformable deep convolutional neural network for generic object detection, arXiv:1409.3505 [cs.CV]

slide3

http://mmlab.ie.cuhk.edu.hk/index.html

  • Our most recent deep model on face verification reached 99.15%, the best performance on LFW.
slide4
RCNN

person

Selective search

AlexNet+SVM

Bounding box regression

horse

Proposed bounding boxes

Detection results

Refined bounding boxes

Image

Girshick, Ross, et al. "Rich feature hierarchies for accurate object detection and semantic segmentation." CVPR, 2014

slide5
RCNN

mAP 31

to 40.9 (45) on val2

person

person

Selective search

AlexNet+SVM

Bounding box regression

horse

horse

Proposed bounding boxes

Detection results

Refined bounding boxes

Image

Our approach

Selective search

Box rejection

DeepID-Net

Pretrain, def-pooling layer, sub-box, hinge-loss

Proposed bounding boxes

Remaining bounding boxes

Context modeling

Image

Bounding box regression

Model averaging

person

person

person

horse

horse

horse

slide6
RCNN

person

person

Selective search

AlexNet+SVM

Bounding box regression

horse

horse

Proposed bounding boxes

Detection results

Refined bounding boxes

Image

Our approach

Selective search

Box rejection

DeepID-Net

Pretrain, def-pooling layer, sub-box, hinge-loss

Proposed bounding boxes

Remaining bounding boxes

Context modeling

Image

Bounding box regression

Modelaveraging

person

person

person

horse

horse

horse

bounding box rejection
Bounding box rejection

Box rejection

  • Motivation
    • Speed up feature extraction by ~10 times
    • Improve mean AP by 1%
  • RCNN
    • Selective search: ~ 2400 bounding boxes per image
    • ILSVRC val: ~20,000 images, ~2.4 days
    • ILSVRC test: ~40,000 images, ~4.7days
  • Bounding box rejection by RCNN:
    • For each box, RCNN has 200 scores S1…200 for 200 classes
    • If max(S1…200) < -1.1, reject. 6% remaining bounding boxes

Girshick, Ross, et al. "Rich feature hierarchies for accurate object detection and semantic segmentation." CVPR, 2014

slide8
RCNN

person

person

Selective search

AlexNet+SVM

Bounding box regression

horse

horse

Proposed bounding boxes

Detection results

Refined bounding boxes

Image

Our approach

Selective search

Box rejection

DeepID-Net

Pretrain, def-pooling layer, sub-box, hinge-loss

Proposed bounding boxes

Remaining bounding boxes

Context modeling

Image

Bounding box regression

Modelaveraging

person

person

person

horse

horse

horse

slide10
RCNN

person

person

Selective search

AlexNet+SVM

Bounding box regression

horse

horse

Proposed bounding boxes

Detection results

Refined bounding boxes

Image

Our approach

Selective search

Box rejection

DeepID-Net

Pretrain, def-pooling layer, sub-box, hinge-loss

Proposed bounding boxes

Remaining bounding boxes

Context modeling

Image

Bounding box regression

Modelaveraging

person

person

person

horse

horse

horse

pretraining the deep model
Pretraining the deep model
  • RCNN (Cls+Det)
    • AlexNet
    • Pretrain on image-level annotation data with 1000 classes
    • Finetune on object-level annotation data with 200+1 classes
  • Our investigation
    • Classification vs. detection(image vs. tight bounding box)?
    • 1000 classes vs. 200 classes
    • AlexNet or Clarifai or other choices, e.g. GoogleLenet?
    • Complementary
deep model training pretrain
Deep model training – pretrain
  • RCNN (Image Cls+Det)
    • Pretrain on image-level annotation with 1000 classes
    • Finetune on object-level annotation with 200 classes
    • Gap: classification vs. detection, 1000 vs. 200

Image classification

Object detection

deep model training pretrain1
Deep model training – pretrain
  • RCNN (ImageNetCls+Det)
    • Pretrain on image-level annotation with 1000 classes
    • Finetune on object-level annotation with 200 classes
    • Gap: classification vs. detection, 1000 vs. 200
  • Our approach (ImageNetCls+Loc+Det)
    • Pretrain on image-level annotation with 1000 classes
    • Finetuneon object-level annotation with 1000 classes
    • Finetune on object-level annotation with 200 classes
deep model training pretrain2
Deep model training – pretrain
  • RCNN (Cls+Det)
    • Pretrain on image-level annotation with 1000 classes
    • Finetune on object-level annotation with 200 classes
    • Gap: classification vs. detection, 1000 vs. 200
  • Our approach (Loc+Det)
    • Pretrain on object-level annotation with 1000 classes
    • Finetune on object-level annotation with 200 classes
deep model design
Deep model design
  • AlexNet or Clarifai
result and discussion
Result and discussion
  • RCNN (Cls+Det),
  • Our investigation
      • Better pretraining on 1000 classes
result and discussion1
Result and discussion
  • RCNN (Cls+Det),
  • Our investigation
      • Better pretraining on 1000 classes
      • Object-level annotation is more suitable for pretraining

23% AP increase for rugby ball

17.4% AP increase for hammer

result and discussion2
Result and discussion
  • RCNN (ImageNetCls+Det),
  • Our investigation
      • Better pretraining on 1000 classes
      • Object-level annotation is more suitable for pretraining
      • Clarifaiis better. But Alex and Clarifai are complementary on different classes.

scorpion

AP diff

class

hamster

slide19
RCNN

person

person

Selective search

AlexNet+SVM

Bounding box regression

horse

horse

Proposed bounding boxes

Detection results

Refined bounding boxes

Image

Our approach

Selective search

Box rejection

DeepID-Net

Pretrain, def-pooling layer, sub-box, hinge-loss

Proposed bounding boxes

Remaining bounding boxes

Context modeling

Image

Bounding box regression

Modelaveraging

person

person

person

horse

horse

horse

deep model training def pooling layer
Deep model training – def-pooling layer
  • RCNN (ImageNetCls+Det)
    • Pretrain on image-level annotation with 1000 classes
    • Finetune on object-level annotation with 200 classes
    • Gap: classification vs. detection, 1000 vs. 200
  • Our approach (ImageNetLoc+Det)
    • Pretrain on object-level annotation with 1000 classes
    • Finetune on object-level annotation with 200 classes with def-pooling layers
deformation
Deformation
  • Learning deformation [a] is effective in computer vision society.
  • Missing in deep model.
  • We propose a new deformation constrained pooling layer.

[a] P. Felzenszwalb, R. B. Grishick, D.McAllister, and D. Ramanan. Object detection with discriminatively trained part based models. IEEE Trans. PAMI, 32:1627–1645, 2010. 

modeling part detectors
Modeling Part Detectors
  • Different parts have different sizes
  • Design the filters with variable sizes

Part models learned from HOG

Part models

Learned filtered at the second convolutional layer

deformation layer b
Deformation Layer [b]

[b] WanliOuyang, Xiaogang Wang, "Joint Deep Learning for Pedestrian Detection ", ICCV 2013. 

deformation constrained pooling layer
Deformation constrained pooling layer

Can capture multiple patterns simultaneously

our deep model with deformation layer
Our deep model with deformation layer

Patterns shared across different classes

slide28
RCNN

person

person

Selective search

AlexNet+SVM

Bounding box regression

horse

horse

Proposed bounding boxes

Detection results

Refined bounding boxes

Image

Our approach

Selective search

Box rejection

DeepID-Net

Pretrain, def-pooling layer, sub-box, hinge-loss

Proposed bounding boxes

Remaining bounding boxes

Context modeling

Image

Bounding box regression

Modelaveraging

person

person

person

horse

horse

horse

sub box features
Sub-box features
  • Take the per-channel max/average features of the last fully connected layer from 4 subboxes of the root window.
  • Concatenate subbox features and the features in the root window. Learn an SVM for combining these features.
  • Subboxes are proposed regions that has >0.5 overlap with the four quarter regions. Need not compute features.
  • 0.5 mAP improvement.
  • So far not combined with deformation layer. Used as one of the models in model averaging
slide30
RCNN

person

person

Selective search

AlexNet+SVM

Bounding box regression

horse

horse

Proposed bounding boxes

Detection results

Refined bounding boxes

Image

Our approach

Selective search

Box rejection

DeepID-Net

Pretrain, def-pooling layer, sub-box, hinge-loss

Proposed bounding boxes

Remaining bounding boxes

Context modeling

Image

Bounding box regression

Modelaveraging

person

person

person

horse

horse

horse

deep model training svm net
Deep model training – SVM-net
  • RCNN
    • Fine-tune using soft-max loss (Softmax-Net)
    • Train SVM based on the fc7 features of the fine-tuned net.
deep model training svm net1
Deep model training – SVM-net
  • RCNN
    • Fine-tune using soft-max loss (Softmax-Net)
    • Train SVM based on the fc7 features of the fine-tuned net.
  • Replace Soft-max loss by Hinge loss when fine-tuning (SVM-Net)
    • Merge the two steps of RCNN into one
    • Require no feature extraction from training data (~60 hours)
slide33
RCNN

person

person

Selective search

AlexNet+SVM

Bounding box regression

horse

horse

Proposed bounding boxes

Detection results

Refined bounding boxes

Image

Our approach

Selective search

Box rejection

DeepID-Net

Pretrain, def-pooling layer, sub-box, hinge-loss

Proposed bounding boxes

Remaining bounding boxes

Context modeling

Image

Bounding box regression

Modelaveraging

person

person

person

horse

horse

horse

context modeling
Context modeling
  • Use the 1000 class Image classification score.
  • ~1% mAP improvement.
context modeling1
Context modeling
  • Use the 1000-class Image classification score.
    • ~1% mAP improvement.
    • Volleyball: improve ap by 8.4% on val2.

Volleyball

Golf ball

Bathing cap

slide36
RCNN

person

person

Selective search

AlexNet+SVM

Bounding box regression

horse

horse

Proposed bounding boxes

Detection results

Refined bounding boxes

Image

Our approach

Selective search

Box rejection

DeepID-Net

Pretrain, def-pooling layer, sub-box, hinge-loss

Proposed bounding boxes

Remaining bounding boxes

Context modeling

Image

Bounding box regression

Model averaging

person

person

person

horse

horse

horse

model averaging
Model averaging
  • Not only change parameters
    • Net structure: AlexNet(A), Clarifai (C), Deep-ID Net (D), DeepID Net2 (D2)
    • Pretrain: Classification (C), Localization (L)
    • Region rejection or not
    • Loss of net, softmax (S), Hinge loss (H)
    • Choosedifferentsetsofmodelsfordifferentobjectclass
slide38
RCNN

person

person

Selective search

AlexNet+SVM

Bounding box regression

horse

horse

Proposed bounding boxes

Detection results

Refined bounding boxes

Image

Our approach

Selective search

Box rejection

DeepID-Net

Pretrain, def-pooling layer, sub-box, hinge-loss

Proposed bounding boxes

Remaining bounding boxes

Context modeling

Image

Bounding box regression

Model averaging

person

person

person

horse

horse

horse

component analysis
Component analysis

person

horse

Our approach

Selective search

Box rejection

DeepID-Net

Pretrain, def-pooling layer, sub-box, hinge-loss

Proposed bounding boxes

Remaining bounding boxes

Context modeling

Image

Bounding box regression

Model averaging

person

person

person

horse

horse

horse

component analysis1
Component analysis

person

horse

Our approach

Selective search

Box rejection

DeepID-Net

Pretrain, def-pooling layer, sub-box, hinge-loss

Proposed bounding boxes

Remaining bounding boxes

Context modeling

Image

Bounding box regression

Model averaging

person

person

person

horse

horse

horse

component analysis2
Component analysis
  • New results (training time, time limit (context))
take home message
Take home message
  • 1. Bounding rejection. Save feature extraction by about 10 times, slightly improve mAP (~1%).
  • 2. Pre-training with object-level annotation, more classes. 4.2% mAP
  • 3. Def-pooling layer. 2.5% mAP
  • 4. Hinge loss. Save feature computation time(~60h).
  • 5. Model averaging. Different modeldesignsandtraining schemesleadtohigh diversity
  • 6. Multi-stage Deep-id Net (poster)
acknowledgement
Acknowledgement
  • We gratefully acknowledge the support of NVIDIA for this research.
slide44

Code available soon on http://mmlab.ie.cuhk.edu.hk/projects/ImageNet2014.html