Loss-based Learning with Weak Supervision

Loss-based Learning with Weak Supervision M. Pawan Kumar

Computer Vision Data Log (Size) ~ 2000 Segmentation Information

Computer Vision Data ~ 1 M Log (Size) Bounding Box ~ 2000 Segmentation Information

Computer Vision Data > 14 M ~ 1 M Log (Size) Image-Level Bounding Box ~ 2000 Segmentation Information “Chair” “Car”

Computer Vision Data > 6 B Noisy Label > 14 M ~ 1 M Log (Size) Image-Level Bounding Box ~ 2000 Segmentation Information

Computer Vision Data Detailed annotation is expensive Sometimes annotation is impossible Desired annotation keeps changing Learn with missing information (latent variables)

Outline • Two Types of Problems • Part I – Annotation Mismatch • Part II – Output Mismatch

Annotation Mismatch Action Classification x h Input x Annotation y Latent h y = “jumping” Desired outputduring test time is y Mismatch between desired and available annotations Exact value of latent variable is not “important”

Output Mismatch Action Classification x h Input x Annotation y Latent h y = “jumping”

Output Mismatch Action Detection x h Input x Annotation y Latent h y = “jumping” Desired outputduring test time is (y,h) Mismatch between output and available annotations Exact value of latent variable is important

Part I

Outline – Annotation Mismatch • Latent SVM • Optimization • Practice • Extensions Andrews et al., NIPS 2001; Smola et al., AISTATS 2005; Felzenszwalb et al., CVPR 2008; Yu and Joachims, ICML 2009

Weakly Supervised Data x Input x h Output y  {-1,+1} Hidden h y = +1

Weakly Supervised Classification x Feature Φ(x,h) h Joint Feature Vector Ψ(x,y,h) y = +1

Weakly Supervised Classification x Feature Φ(x,h) h Joint Feature Vector Φ(x,h) Ψ(x,+1,h) = 0 y = +1

Weakly Supervised Classification x Feature Φ(x,h) h Joint Feature Vector 0 Ψ(x,-1,h) = Φ(x,h) y = +1

Weakly Supervised Classification x Feature Φ(x,h) h Joint Feature Vector Ψ(x,y,h) y = +1 Score f : Ψ(x,y,h)  (-∞, +∞) Optimize score over all possible y and h

Latent SVM Scoring function Parameters wTΨ(x,y,h) Prediction y(w),h(w) = argmaxy,hwTΨ(x,y,h)

Learning Latent SVM Training data {(xi,yi), i= 1,2,…,n} (yi, yi(w)) Σi minw Empirical risk minimization No restriction on the loss function Annotation mismatch

Learning Latent SVM Find a regularization-sensitive upper bound (yi, yi(w)) Σi minw Empirical risk minimization Non-convex Parameters cannot be regularized

Learning Latent SVM (yi, yi(w)) • wT(xi,yi(w),hi(w)) + • -wT(xi,yi(w),hi(w))

Learning Latent SVM (yi, yi(w)) • wT(xi,yi(w),hi(w)) + • -maxhiwT(xi,yi,hi) y(w),h(w) = argmaxy,hwTΨ(x,y,h)

Learning Latent SVM • minw ||w||2 + C Σiξi (yi, y) • maxy,h • wT(xi,y,h) + • ≤ ξi • -maxhiwT(xi,yi,hi) Parameters can be regularized Is this also convex?

Learning Latent SVM • minw ||w||2 + C Σiξi (yi, y) • maxy,h • wT(xi,y,h) + • ≤ ξi • -maxhiwT(xi,yi,hi) Convex - Convex Difference of convex (DC) program

Recap Scoring function wTΨ(x,y,h) Prediction y(w),h(w) = argmaxy,hwTΨ(x,y,h) Learning minw ||w||2 + C Σiξi wTΨ(xi,y,h) + Δ(yi,y) - maxhiwTΨ(xi,yi,hi)≤ ξi

Outline – Annotation Mismatch • Latent SVM • Optimization • Practice • Extensions

Learning Latent SVM • minw ||w||2 + C Σiξi (yi, y) • maxy,h • wT(xi,y,h) + • ≤ ξi • -maxhiwT(xi,yi,hi) Difference of convex (DC) program

Concave-Convex Procedure + Linear upper-bound of concave part • maxy,h -maxhi (yi, y) • wT(xi,yi,hi) • wT(xi,y,h) +

Concave-Convex Procedure + Optimize the convex upper bound • maxy,h -maxhi (yi, y) • wT(xi,yi,hi) • wT(xi,y,h) +

Concave-Convex Procedure + Linear upper-bound of concave part • maxy,h -maxhi (yi, y) • wT(xi,yi,hi) • wT(xi,y,h) +

Concave-Convex Procedure + Until Convergence • maxy,h -maxhi (yi, y) • wT(xi,yi,hi) • wT(xi,y,h) +

Concave-Convex Procedure + Linear upper bound? • maxy,h -maxhi (yi, y) • wT(xi,yi,hi) • wT(xi,y,h) +

Linear Upper Bound • -maxhiwT(xi,yi,hi) Current estimate = wt • hi* = argmaxhiwtT(xi,yi,hi) • -wT(xi,yi,hi*) • ≥ -maxhiwT(xi,yi,hi)

CCCP for Latent SVM Start with an initial estimate w0 hi* = argmaxhiHwtT(xi,yi,hi) Update Update wt+1as the ε-optimal solution of min ||w||2 + C∑i i wT(xi,yi,hi*) - wT(xi,y,h) ≥ (yi, y) - i Repeat until convergence

Outline – Annotation Mismatch • Latent SVM • Optimization • Practice • Extensions

Action Classification Train Input xi Output yi Input x Jumping Phoning Playing Instrument Reading Riding Bike Output y = “Using Computer” Riding Horse Running PASCAL VOC 2011 Taking Photo 80/20 Train/Test Split UsingComputer 5 Folds Walking

Setup • 0-1 loss function • Poselet-based feature vector • 4 seeds for random initialization • Code + Data • Train/Test scripts with hyperparameter settings http://www.centrale-ponts.fr/tutorials/cvpr2013/

Objective

Train Error

Test Error

Time

Outline – Annotation Mismatch • Latent SVM • Optimization • Practice • Annealing the Tolerance • Annealing the Regularization • Self-Paced Learning • Choice of Loss Function • Extensions

Start with an initial estimate w0 hi* = argmaxhiHwtT(xi,yi,hi) Update Update wt+1as the ε-optimal solution of min ||w||2 + C∑i i wT(xi,yi,hi*) - wT(xi,y,h) ≥ (yi, y) - i Overfitting in initial iterations Repeat until convergence

Start with an initial estimate w0 hi* = argmaxhiHwtT(xi,yi,hi) Update Update wt+1as the ε’-optimal solution of min ||w||2 + C∑i i wT(xi,yi,hi*) - wT(xi,y,h) ≥ (yi, y) - i ε’ = ε/K and ε’ = ε Repeat until convergence

Objective

Train Error

Test Error

Loss-based Learning with Weak Supervision