- 58 Views
- Uploaded on
- Presentation posted in: General

Outline

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Stefan Hinterstoisser, Stefan Holzer, Cedric Cagniart, Slobodan Ilic,KurtKonolige, Nassir Navab, Vincent Lepetit

Department of Computer Science, CAMP, TechnischeUniversit¨atM¨unchen (TUM), Germany WillowGarage, Menlo Park, CA, USA

IEEE International Conference on Computer Vision (ICCV) 2011

Multimodal Templates for Real-Time Detection of Texture-less Objects in Heavily Cluttered Scenes

- Goal &Challenges
- Related Work
- Modality Extraction
- Image Cue
- Depth Cue

- Similarity Measure
- Efficient Computation
- Experiments

- Objects under different poses over heavily cluttered background
- Online learning
- Real-time object learning and detection

- Solving the problem of multi-view 3D object detection has two main categories:
- Learning Based Methods
- Template Matching

- Learning Based Methods:
- Require a large amount of training data
- Require long offline training phase
- Expensive learning for new object

- Template Matching:
- Better adapted to low textured objects than feature point approaches
- Easily update template for new object
- Direct matching is inappropriate for real-time.

- Others:
- Matching in Range Data :
- Construct full 3D CAD model of the object

- Matching in Range Data :

- Goal &Challenges
- Related Work
- Modality Extraction
- Image Cue
- Depth Cue

- Similarity Measure
- Efficient Computation
- Experiments

- Image Cue:
- Image gradientsare proved to be discriminant and robustto illumination change and noise.
- Normalized gradients and not their magnitudes makes the measure robust to contrast changes.
- We compute the normalized gradients on each color channel for input RGB color image.
- Input image , gradient map at location x:

- Keep only the gradients whose norms are larger than a threshold.
- Assign to the gradient whose quantizedorientationoccurs most in a 3 × 3 neighborhood.

Og(r): the normalized gradient map of thereference image at location r

Ig(t): the normalized gradient map of the input image at location t

Quantizing the gradient orientations

Input color image

Gradient image computed on gray image

Gradient image computed with our approach

- Depth Cue
- We use a standard camera and a aligned depth sensor to obtain depth map.
- Use quantized surface normal computed on a dense depth field forour template representation.
- Consider the first order Taylor expansion of the depth function D(x):
- Within a patch defined around x, each pixel offset dx yields an equation.

- Estimate an optimal gradient in least-square.
- Depth gradient corresponds to a tangent plane going through three points X, X1 and X2:
vector along the line of sight that goes through pixel x (obtain from parameters of depth sensor)

- The normal to the surface can be estimated as the normalized cross-product of X1 − X and X2 − X.
- Within a patch defined around x,this would not be robust around occluding contours.
- Inspired by bilateral filtering, we ignore the pixels whose depth difference with the central pixel (X) is above a threshold.

+Z

X

Tangent

plane

D(x)

Depth sensor

Normal of X

- Quantizethe normal directions into n0 bins.
- Assign to each location the quantized value that occurs most often in a 5 × 5 neighborhood.

OD(r): the normalized surface normal of thereference image at location r

ID(t): the normalized surface normal of the input image at location t

Quantizing the surface normals

Input

image

The corresponding

depth image

Surface normalscomputed with our approach.

Details are clearly visible and depth discontinuities are well handled.

- Goal &Challenges
- Related Work
- Modality Extraction
- Image Cue
- Depth Cue

- Similarity Measure
- Efficient Computation
- Experiments

- We define a template as T = ({Om} m∈M, P ).
P: a list of pairs (r,m) made of the locations rof a discriminant feature in modality m.

- Each template is createdby extractingfor each “m” a set of its most discriminant features (P).

P:(rk, surface normals)

r: record the

feature location

with respect to

object center (C).

C

P:(ri, gradients)

- The object measurement energy function :
T: ({Om} m∈M, P )

c:the detected location (could be object center)

R(c+r):[c+r- , c+r+][c+r- , c+r+] , N∈ const.(neighborhood of size Ncentered on (c+r) in Im)

fm(Om (r), Im(t)): computes the similarity score for modality m

- We first quantize the input data for each modality into a small number of n0.
- Use a lookup table тi,m for energy response:
i: the index of the quantized value of modality m. (also use i to represent the corresponding value)

Lm: list of values of a special modality m appearing in a local neighborhood of a value i from input I.

C

C’

Lm’

Lm

- “Spread” [11] the data aroundneighborhood to obtain a robust representation Jminstead of Lm.
- For each quantized value of one modality m with index i we can now compute the response at each location c:
тi,m: the precomputedlookup table, Jmas the index

[11] S. Hinterstoisser, C. Cagniart, S. Ilic, P. Sturm, P. Fua, N. Navab, and V. Lepetit. Gradient

response maps for realtime detection of texture-less objects. under revision PAMI.

- Finally, the similarity measure can be:
- Since the maps Si,m are shared between the templates, matching several templates against the input image can be done very fast once they are computed.

- LINE-MOD: our approach (intensity & depth)
- LINE-2D: introduced in [11] (use only intensity)
- LINE-3D: use only the depth map
- Hardware:
- Performed on one processor of a standard notebook with an Intel Centrino Processor Core2Duo with 2.4 GHz and 3 GB of RAM.

- Test data:
- Six object sequences made of 2000 real images each.
- Each sequence presents illumination and large viewpoint changesover heavy cluttered background.

- Robustness:
- A threshold (about 80) separates almost all true positives for LINE-MOD.

- Speed:
- Learning new templates only requires extracting and storing features, which is almost instantaneous.
- Templates include: 360 degree tilt rotation, 90 degree inclination rotation and in-plane rotations of ± 80 degrees, scale changes from 1.0 to 2.0.
- Parse a 640×480 image with over 3000 templateswith 126 features at about 10 fps(real-time).
- The runtime of LINE-MOD is only dependent on the number of features and independent of the object/template size.

- Speed:

- Occlusion:
- Right: Average recognition score for the six objects with respect to occlusion.
- With over 30% occlusion our method is still able to recognize objects.

Cup

Toy-Car

Hole punch

Toy-Monkey

Toy-Duck

Camera

True positive rates =

False positive rates =