Training Discriminative Computer Vision Models with Weak Supervision

Training Discriminative Computer Vision Models with Weak Supervision Boris Babenko PhD Defense University of California, San Diego

Outline • Overview • Supervised Learning • Weakly Supervised Learning • Weakly Labeled Location • Object Localization and Recognition • Object Detection with Parts • Object Tracking • Weakly Labeled Categories • Object Detection with Sub-categories • Object Recognition with Super-categories • Theoretical Analysis of Multiple Instance Learning • Conclusions & Future Work

Computer Vision Problems • Want to detect, recognize/classify, track objects in images and videos • Examples: • Face detection for point-and-shoot cameras • Pedestrian detection for cars • Animal tracking for behavioral science • Landmark/place recognition for search-by-image

Old School • Hand tuned models per application • Example: face detection [Yang et al. ‘94]

New School • Adopt methods from machine learning • Train a generic* system by providing labeled examples (supervised learning) • Labeling examples is intuitive • Adapt to new domains/applications • Learn subtle queues that would be impossible to model by hand * Hand tuning/design still often required :-/

Supervised Learning • Training data: pairs of inputs and labels • Train classifier to predict label for novel input TRAINING RUN TIME ( ,non-face) ( ) ( , face) ( , face) ( ,non-face)

Supervised Learning • Training data: • Most common case: • Want to train a classifier: • Typically a classifier also outputs a confidence score, in addition to label Inputs/instances: Labels:

Discriminative vs Generative • Generative: model the distribution of the data • Discriminative: directly minimize classification error, model the boundary • E.g. SVM, AdaBoost, Perceptron • Tends to outperform generative models

Training Discriminative Model • Objective (minimize training error) • Loss function, , is typically a convex upper bound on 0/1 loss • Regularization term can help avoid over-fitting

Weak Supervision • Slightly overloaded term… • Any form of learning where the training data is missing some labels (i.e. latent variables)

Object Detection w/ Weak Supervision • Goal: train object detector • Strong: • Weak: only presence of object is known, not location + ( , face) + ( , face) ( ,non-face) -

Object Detection w/ Weak Supervision • Goal: train object detector • Strong: • Weak: only presence of object is known, not location <- latent + ( , face) + ( , face) ( ,non-face) -

Weak Supervision: Advantages • Reduce labor cost • Deal with inherent ambiguity & human error • Automatically discover latent information

Training w/ Latent Variables • Classifier now takes in input AND latent input • To predict label: • Objective:

Training w/ Latent Variables • Classifier now takes in input AND latent input • To predict label: • Objective: • Not convex!

Training w/ Latent Variables • Two ways of solving • Method 1: Alternate between finding latent variables and training classifier • Finding latent variables given a fixed classifier may require domain knowledge • E.g. EM (Dempster et al.), Latent Structural SVM (Yu & Joachims) – based on CCCP (Yuille & Rangarajan), Latent SVM (Felzenszwalb et al.), MI-SVM (Andrews et al.)

Training w/ Latent Variables • Method 2: Replace the hard max with “soft” approximation, and then do gradient descent • E.g. MILBoost (Viola et al.), MIL-Logistic Regression (Ray et al.)

Outline • Overview • Supervised Learning • Weakly Supervised Learning • Weakly Labeled Location • Object Detection, Localization and Recognition • Object Detection with Parts • Object Tracking • Weakly Labeled Categories • Object Detection with Sub-categories • Object Recognition with Super-categories • Theoretical Analysis of Multiple Instance Learning • Conclusions & Future Work

Object Detection w/ Weak Supervision • Goal: train object detector • Only presence of object is known, not location • Can’t “just throw these into a learning alg.” – very difficult to design invariant features + + -

Multiple Instance Learning (MIL) • (set of inputs, label) pairs provided • MIL lingo: set of inputs = bag of instances • Learner does not see instance labels • Bag labeled positive if at least one instance in bag is positive [Keeler et al. ‘90, Dietterich et al. ‘97]

Object Detection w/ MIL { … } + Instance: image patch Instance Label: is face? Bag: whole image Bag Label: contains face? { … } + { … } - [Andrews et al. ’02, Viola et al. ’05, Dollar et al. 08, Galleguillos et al. 08]

MIL Notation • Training input: Bags: Bag Labels: Instance Labels: (unknown during training)

MIL • Positive bag contains at least one positive instance • Goal: learning instance classifier • Corresponding bag classifier

MIL Algorithms • Many “standard” learning algorithms have been adapted to the MIL scenario: • SVM (Andrews et al. ‘02), Boosting (Viola et al. ‘05), Logistic Regression (Ray et al. ‘05) • Some specialized algorithms also exist • DD (Maron et al. ’98), EM-DD (Zhang et al. ‘02)

MIL Algorithms • Objective: minimize bag error on training data • MILBoost (Viola et al. ‘05) • Replace max with differentiable approximation • Use functional gradient descent (Mason et al. ’00, Friedman ’01) Bag label according to , i.e.

Object Detection • Have a learning framework (MIL), and an algorithm to train classifier (MILBoost) • Question: how exactly do we form a bag? { …} Segmentation { …} Sliding Window

Forming a bag via segmentation • Pro: get more precise localization • Con: segmentation algorithms often fail; require prior knowledge (e.g. number of segments) • If segmentation fails, we might not see “the” positive instance in a positive bag • Only way to prevent this is to use ALL possible segments… not practical

Multiple Stable Segmentations (MSS) • Solution: Multiple Stable Segmentations (Rabinovich et al. ‘06) • A heuristic for picking out a few “good” segments from the huge set of all possible segments • End up with more segments, but higher chance of getting the “right” segment

{ …} Multiple Instance Learning with Stable Segmentation (MILSS) Multiple Stable Segmentation BOF BOF BOF BOF • Localization and Recognition • Features: BOF on SIFT • Classifier: MILBoost one-vs-all (for multiclass) [ Work with Carolina Galleguillos, Andrew Rabinovich & Serge Belongie – ECCV ‘08]

Results: Landmarks

More segments = better results Our System NCuts w/ k=6 NCuts w/ k=4

Object Detection with Parts • Pedestrians are non-rigid • Difficult to design features that are invariant • Decision boundary very complex • Objects parts are rigid

Object Detection with Parts • Naïve sol’n: label parts and train detectors • Labor intensive • Sub-optimal (e.g. “space between the legs”) • Better sol’n: • Use rough location of objects • Treat part locations as latent variables [Mohan et al. ’01, Mikolajczyk et al. ‘04]

Multiple Component Learning (MCL) • How to train a part detector from weakly labeled data? • How to train many, diverse part detectors • How to combine part detectors and incorporate spatial information [Work with Piotr Dollar, PietroPerona, ZhuowenTu & Serge Belongie ECCV ‘08]

MCL: One Part Detector • Fits perfectly into MIL • Which part does it learn? { … } + { … } +

MCL: Diverse Parts • Pedestrian images are “roughly aligned” • Choose random sections of the images to feed into MIL

MCL: Top 5 Learned Detectors

MCL: Combining Part Detectors • Run part detectors, get response map • Compute Haar features on top, plug into Boosting Confidence maps from each part detector

MCL: Results • INRIA Pedestrian dataset

MCL: Results

MCL: Related Work • P. Felzenszwalb, R. Girshick, D. McAllester, D. Ramanan. "Object Detection with Discriminatively Trained Part-Based Models" IEEE PAMI. Sept 2009. • Very similar model, uses SVM instead of Boosting, and an explicit shape model • L. Bourdev, S. Maji, T. Brox, J. Malik. “Detecting people using mutually consistent poselet activations” ECCV 2010.

Object Tracking • Problem: given location of object in first frame, track object through video • Tracking by Detection: alternate training detector and running it on each frame

Tracking by Detection • First frame is labeled

Tracking by Detection • First frame is labeled Online classifier (i.e. Online AdaBoost) Classifier

Tracking by Detection • Grab one positive patch, and some negative patch, and train/update the model. negative positive Classifier

Training Discriminative Computer Vision Models with Weak Supervision

Training Discriminative Computer Vision Models with Weak Supervision

Presentation Transcript

Learning and Vision: Discriminative Models

Statistical Models of Appearance for Computer Vision

Maxent Models and Discriminative Estimation

Generative Models vs. Discriminative models

Loss-based Visual Learning with Weak Supervision

Loss-based Learning with Weak Supervision

Loss-based Learning with Weak Supervision

Hierarchical Models of Vision: Machine Learning/Computer Vision

LECTURE 33: DISCRIMINATIVE TRAINING

Supervision Training

LECTURE 32: DISCRIMINATIVE TRAINING

Discriminative Learning for Hidden Markov Models

Mimetic Discrete Models with Weak Material Laws,

Linear Classification with discriminative models

Levels of supervision for training object category models

LECTURE 31: DISCRIMINATIVE TRAINING

Discriminative Models for Information Retrieval

Discriminative Models for Spoken Language Understanding

Large-Scale Object Recognition with Weak Supervision

Maxent Models and Discriminative Estimation

Levels of supervision for training object category models

Classification 2: discriminative models