Geremy Heitz Steve Gould Ashutosh Saxena Daphne Koller August 11, 2008 DAGS

Cascaded Classification Models:Combining Models for Holistic Scene UnderstandingHelping models play nice since 2008... Geremy Heitz Steve Gould Ashutosh Saxena Daphne Koller August 11, 2008 DAGS

Outline • Understanding Scene Understanding • Related Work • Model Desiderata • CCM Framework • Results • Extensions

Computer View of a “Scene” SKY GRASS SEASIDEPASTURE

Human View of a “Scene” She’s walking. A cow Some grass… “The cow is walking through the grass on a pasture by the sea.”

Scene Understanding • Requires combining many tasks • Object Detection • Scene Categorization • Region Labeling • Depth Reconstruction • Requires the “right” representation • Matches the questions we might ask • Operates at multiple granularities • The whole is greater than the sum… • What information can they share?

Visual Context • Context (from http://www.thefreedictionary.com): • “The words before and after a word or passage in a piece of writing that contribute to its meaning.” • Visual Context: • “The visual objects ‘near’ a particular visual object that contribute to its meaning” • Visual Context Cues: • “Signals obtained from nearby visual objects that may help a classifier classify a query object”

Context Example

Outline • Understanding Scene Understanding • Related Work • Model Desiderata • CCM Framework • Results • Extensions

3D from Line Drawings • David Waltz – “Understanding Line Drawings of Scenes with Shadows” - 1975

Intrinsic Images • Barrow and Tenenbaum – “Recovering intrinsic scene characteristics from images” - 1978 • Tappen et al. – “Recovering intrinsic images from a single image” - 2005 Original Image Reflectance Image Shading Image

Scene Understanding • Derek Hoiem – “Closing the Loop in Scene Interpretation” – CVPR 2008 • Uses “Intrinsic Image” idea • But… • Tailored specifically to his previous models • Fewer classes • Regions get generic properties • Hard to pronounce his name

Context Model Desiderata • Allow state-of-the-art subcomponents • Generic method of combining them • Limited interface into “black boxes” REGION LABELINGGould et al., 2007 DEPTH RECONSTRUCTIONSaxena et al., 2007 DETECTIONDalal & Triggs, 2006

Context Model Desiderata SKY SKY GRASS GRASS • Learn from datasets with arbitrary sets of labels • Different components improve each other MSRC MulticlassSegmentation Pascal VisualObject Classes LabelMe Stanford RangeImage Data + > ,

Cascaded Classification Models • Component modules must have 3 properties • Learning The classifier should be able to learn from a set of training instances. • Classification We should be able to obtain a classification of the output variables. • Connectivity The classifier should provide a mechanism for including features from other modules.

CCMs I ΦD ΦS ΦZ ŶD ŶD ŶS ŶS ŶD ŶZ ŶZ ŶZ ŶS L L 0 1 1 L 0 0 1 • I: Image • Φ: Image Features • Ŷ: Output labels • Features for level ℓ+1 computed from Φ and labels of level ℓ

How to use black boxes? BLACK BOX WAHOOCLASSIFIER Output Labels YWAHOO BLACK BOX SHAZAMCLASSIFIER YSHAZAM

CCMs for Scene Understanding • Scene Categorization • Object Detection • Region Labeling • Depth Reconstruction

Scene Categorization C = { ‘urban’, ‘rural’, ‘ocean’, ‘other’ } RGB Mean/StddevYCbCr Mean/Stddev From Detection: # of detections of each object From Regions Labeling: Fraction of each region type

Object Detection – HOG Features • Dalal & Triggs, 2006 SVM

Object Detection - Sliding Window • Consider every bounding box • All shifts • All scales • Possibly all rotations • Each box gets a score: • D(x,y,s,Θ) • Detections: • Local peaks in D() D = 1.5 D = -0.3

Object Detection = [1 D(x,y,s) X Y X2 Y2 XY W W2] P(Y) = LogReg(Φ,w) Y = 1{is a car} F2: Detector Scoreof window F10: Amount of “building” above window F50: Variance of depthsin window

Region Labeling CRF SKY GRASS Y = { ‘grass’, ‘road’, ‘tree’, ‘sky’, ‘water’, ‘building’, ‘foreground’ } Mean R,G,BMean H,U,VTexture ResponsesAreaAspect Ratio… Delta R,G,BOffset Vector…

Region Labeling Context Predict“grass” Relative Location Map

Depth Reconstruction

Depth Reconstruction with Context SKY GRASS Normals point out Normals point up • Find d* • Reoptimize depths with new constraints: BLACK BOX dCCM = argmin γ||d - d*|| + β||n - nCONTEXT|| + …

SU-CCM SKY GRASS SEASIDEPASTURE Grass = FlatSky = FarFG = Vertical 40% Grass,30% Sky… 1 cow, 2 boats…

Results • Experiments on 2 datasets • SU-1 • 362 images, fully labeled • Scene categorization, object detection, region labeling • Gathered by us • SU-2 • 1746 images, disjoint labels • Object detection, region labeling, depth reconstruction • Combination of PASCAL data, MSRC data, Stanford Range Image Data, other…

Methods I ΦD ΦS ΦZ ŶS ŶD ŶD ŶD ŶZ ŶS ŶZ ŶS ŶZ L L 1 0 L 0 1 0 1 • Independent • Level 0 Models • Groundtruth • Each tier is trained using the groundtruth outputs from the previous tier • 2-CCM • Parameters from tier 1 are copied to all other levels • 5-CCM

SU-1 Segment Labeling 0.75 0.73 0.71 Pixel Accuracy Independent 0.69 Groundtruth 2-CCM 0.67 5-CCM 0.65 1 2 3 4 5 6 Classification Tiers

SU-1 Object Detection 0.38 0.37 0.36 Detection AP 0.35 0.34 0.33 1 2 3 4 5 6 Classification Tiers Detection AP = Robust Area Under Precision-Recall Curve

SU-1 Scene Categorization 0.8 0.76 0.72 Scene Category Acc. 0.68 0.64 0.6 1 2 3 4 5 6 Classification Tiers

Some Examples: SU-2

SU-2 Results

Scene Understanding • Requires combining many tasks • Object Detection • Scene Categorization • Region Labeling • Depth Reconstruction • Requires the “right” representation • Matches the questions we might ask • Operates at multiple granularities • The whole is greater than the sum… • What information can they share?

Descriptive Classification Localized Test Outlines Up Down UP DOWN Descriptive Classification City walking during rush hour?ORLong walk on the beach? Object Level Scene Level?

Geremy Heitz Steve Gould Ashutosh Saxena Daphne Koller August 11, 2008 DAGS