Geometric Context from a Single Image

Geometric Context from a Single Image Derek Hoiem Alexei A. Efros Martial HebertCarnegie Mellon University June 20,2017 Presented by Hao Yang

Predefinition in our Algorithm image pixel classification • Geometric Class: Ground Vertical Sky • Vertical Subclass: planar surfaces: Left (←) Center(↑) Right (→) non-planar surfaces: Porous(○) (leafy vegetation ) Solid (×) (a person or tree trunk)

Overview of the Algorithm • Observation: • in a sampling of 300outdoor images that we collected using Google’s image search. • we find that over 97% of image pixels belong to one of three main geometric classes and vertical subclass.

Motivation Most existing computer vision systems attempt to recognize objects using local information alone. Contextual information! Goal Model geometric classes that depend on the orientation of a physical object with relation to the scene. We propose a technique to estimate the coarse orientations of large surfaces in outdoor images.

Related work Learning a Restricted Bayesian Network for Object Detection Henry Schneiderman Robust Real-Time Face Detection PAUL VIOLA MICHAEL J. JONES Detecting Pedestrians Using Patterns of Motion and Appearance PAUL VIOLA MICHAEL J. JONES Fig.3 Walking Person Detection Fig.2 Face Detection Fig.1 Eye Detection

Related work look outside the box the level of human performance Global information Local information

Related work Contextual Models for Object Detection Using Boosted Random Fields Antonio Torralba, Kevin P. Murphy Global Scene Context Fig.4 Cars , buildings and road detection Limitation : Encode contextual relationships between objects in the image plane and not in the 3D world where these objects actually reside .

Outdoor image • Focus on outdoor images! interesting challenging lack of human-imposed manhattan structure

Key idea Most of the information, like material,location,texture,gradients,shading,vanishingpoints,can be extracted only when something is know about the structure of the scene. Our solution is to slowly build our structural knowledge of the image: from pixels to superpixels to related groups of superpixels.

Pipeline 3 Multiple Hypotheses 1 Input 2 Superpixels 4 Geometric Labels

First step：over-segmentation Apply the over-segmentation method of Felzenszwalbet al. to obtain a set of “superpixels”. Each superpixel is assumed to correspond to a single label Superpixels provide the spatial support that allows us to compute some basic firstorder statistics

Training Data 300 outdoor images images are often highly cluttered and span a wide variety of natural, suburban, and urban scenes. In all, about 150,000 superpixelsare labeled 50 of these images to train segmentation algorithm 250 images are used to train and evaluate the overall system using 5-fold cross-validation

Generating Multiple Hypothesis Goal:Obtain multiple segmentations of an image into geometrically homogeneous regions Estimate the likelihood that two superpixels belong in the same region. Ideally, for a given number of regions, we would maximize the joint likelihood that all regions are homogeneous.

Generating Multiple Hypothesis Greedy algorithm based on pairwise affinities between superpixels: randomly order the superpixels; assign the first superpixels to different regions; iteratively assign each remaining superpixel based on a learned pairwise affinity function (see below); repeat step 3 several times

Training the Pairwise Affinity Function Sample pairs of same-label and different label superpixels(2,500 each) from our training set. We estimate the likelihood that two superpixels have the same label: : the absolute differences of their feature values; y means the label.

Features Features computed on superpixels (C1-C2,T1-T4,L1) and on regions (all). The “Num” column gives the number of features in each set.

Geometric Labeling We determine the superpixel label confidences by averaging the label likelihoods of the regions that contain it, weighted by the homogeneity likelihoods: the superpixel label the image data a possible label value the region that contains the ithsuperpixel for the jthhypothesis the label confidence the region label

Training the Label and Homogeneity Likelihood Functions

Applications • Object Detection Improvement in Murphy et al.’s detector [20] with our geometric context. By adding a small set of context features, we reduce false positives while achieving the same detection rate. [20] K. Murphy, A. Torralba, and W. T. Freeman, “Graphical model for recognizing scenes and objects,” in Proc. NIPS, 2003

Applications • Automatic Single-View Reconstruction two novel views from the scaled 3D model generated by our system.

Results 250 images 5-fold cross validation 86% and 52% for the main geometric classes and vertical subclasses The difficulty of classifying vertical subclasses is mostly due to ambiguity of ground truth labeling

Importance of Structure Estimation Accuracy increases with the complexity of the intermediate structure estimation Our results show that each increase in the complexity of the algorithm offers a significant gain in classification accuracy

Conclusion Analyze objects in the image within the context of the 3D world. Our results show that such context can be estimated and usefully applied, even in outdoor images that lack human-imposed structure.

Failure Three columns of {original, ground truth, test result}. Failures can be caused by reflections or shadows

Failure In difficult cases, every cue is important. When any set of features (d-g) is removed, more errors are made than when all features are used (b).

Thank you!

Question 1 • How to generate different hypothesis? Answer: we generate different hypothesis by specifying the number of regions.

Question 2 • Can you explain the algorithm in page 15 again? Answer: In short , we randomly selected some superpixels , then propagate these superpixels , group them with the nearby superpixels. Finally we can get the hypothesis.

Question 3 • What is 5-fold cross validation? Answer:

Question 4 • Can you explain the method to get the superpixel? Answer: Please look at the next four slides.

Graph-Based Image Segmentation • 本文中，初始化时每一个像素点都是一个顶点，然后逐渐合并得到一个区域，确切地说是连接这个区域中的像素点的一个MST。如图，棕色圆圈为顶点，线段为边，合并棕色顶点所生成的MST，对应的就是一个分割区域。分割后的结果其实就是森林。对于孤立的两个像素点，所不同的是颜色，自然就用颜色的距离来衡量两点的相似性，本文中是使用RGB的距离，即当然也可以用perceptually uniform的Luv或者Lab色彩空间，对于灰度图像就只能使用亮度值了，此外，还可以先使用纹理特征滤波，再计算距离，比如，先做Census Transform再计算Hamming distance距离。

全局阈值à自适应阈值 对于两个区域（子图）或者一个区域和一个像素点的相似性，最简单的方法即只考虑连接二者的边的不相似度。如图，已经形成了棕色和绿色两个区域，现在通过紫色边来判断这两个区域是否合并。那么我们就可以设定一个阈值，当两个像素之间的差异（即不相似度）小于该值时，合二为一。迭代合并，最终就会合并成一个个区域，这就是区域生长的基本思想

自适应阈值 • 但全局阈值并不合适，阈值过大会导致分割结果太粗，过小会导致分割结果太细。那么自然就得用自适应阈值。 • 对于两个区域（原文中叫Component,实质上是一个MST,单独的一个像素点也可以看成一个区域）,本文使用了非常直观，但抗干扰性并不强的方法。先来两个定义，原文依据这两个附加信息来得到自适应阈值。 the internal difference of a component the difference between two components 即连接两个区域所有边中，不相似度最小的边的不相似度，也就是两个区域最相似的地方的不相似度。那么直观的判断是否合并的标准：

算法步骤 Step 1: 计算每一个像素点与其8邻域或4邻域的不相似度。 Step 2: 将边按照不相似度non-decreasing排列（从小到大）排序得到 Step 3: 选择 Step 4: 对当前选择的边进行合并判断。设其所连接的顶点为。如果满足合并条件： • （1）不属于同一个区域；（2）不相似度不大于二者内部的不相似度。则执行选择下一条边。否则执行Step 5 Step 5: 更新阈值以及类标号。 Step 6: 选择下一条边执行Step 4。

Geometric Context from a Single Image

Geometric Context from a Single Image

Presentation Transcript

Estimating Human Shape and Pose from a Single Image

Recovering Intrinsic Images from a Single Image

IM2GPS: estimating geographic information from a single image

Image Geometry and Geometric Transformation

Single Image Super Resolution

Single Image Vignetting Correction

Planar Orientation from Blur Gradients in a Single Image

Single System Image

Visibility in Bad Weather from a Single Image

Extracting Smooth and Transparent Layers from a Single Image

Visual Detection of Lintel-Occluded Doors from a Single Image

Single Image Blind Deconvolution

Automatic Estimation and Removal of Noise from a Single Image

Image-Based Rendering of Diffuse, Specular and Glossy Surfaces from a Single Image

Image-Based Rendering from a Single Image

Noise Estimation from a Single Image

3-D Depth Reconstruction from a Single Still Image

Single System Image Clustering

Image-Based Rendering from a Single Image

Removing motion blur from a single image

Blind motion deblurring from a single image using sparse approximation

Visibility in Bad Weather from a Single Image