Multimodal templates for real time detection of texture less objects in heavily cluttered scenes
Sponsored Links
This presentation is the property of its rightful owner.
1 / 30

Outline PowerPoint PPT Presentation

  • Uploaded on
  • Presentation posted in: General

Stefan Hinterstoisser , Stefan Holzer , Cedric Cagniart , Slobodan Ilic,Kurt Konolige , Nassir Navab , Vincent Lepetit Department of Computer Science, CAMP, Technische Universit¨at M¨unchen (TUM), Germany WillowGarage , Menlo Park, CA, USA

Download Presentation


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

Stefan Hinterstoisser, Stefan Holzer, Cedric Cagniart, Slobodan Ilic,KurtKonolige, Nassir Navab, Vincent Lepetit

Department of Computer Science, CAMP, TechnischeUniversit¨atM¨unchen (TUM), Germany WillowGarage, Menlo Park, CA, USA

IEEE International Conference on Computer Vision (ICCV) 2011

Multimodal Templates for Real-Time Detection of Texture-less Objects in Heavily Cluttered Scenes


  • Goal &Challenges

  • Related Work

  • Modality Extraction

    • Image Cue

    • Depth Cue

  • Similarity Measure

  • Efficient Computation

  • Experiments



  • Objects under different poses over heavily cluttered background

  • Online learning

  • Real-time object learning and detection

Related Work

  • Solving the problem of multi-view 3D object detection has two main categories:

    • Learning Based Methods

    • Template Matching

  • Learning Based Methods:

    • Require a large amount of training data

    • Require long offline training phase

    • Expensive learning for new object

Related Work

  • Template Matching:

    • Better adapted to low textured objects than feature point approaches

    • Easily update template for new object

    • Direct matching is inappropriate for real-time.

  • Others:

    • Matching in Range Data :

      • Construct full 3D CAD model of the object


  • Goal &Challenges

  • Related Work

  • Modality Extraction

    • Image Cue

    • Depth Cue

  • Similarity Measure

  • Efficient Computation

  • Experiments

Modality Extraction-Image Cue

  • Image Cue:

  • Image gradientsare proved to be discriminant and robustto illumination change and noise.

  • Normalized gradients and not their magnitudes makes the measure robust to contrast changes.

  • We compute the normalized gradients on each color channel for input RGB color image.

  • Input image , gradient map at location x:

Modality Extraction-Image Cue

  • Keep only the gradients whose norms are larger than a threshold.

  • Assign to the gradient whose quantizedorientationoccurs most in a 3 × 3 neighborhood.

  • The similarity measurement function fg:

    Og(r): the normalized gradient map of thereference image at location r

    Ig(t): the normalized gradient map of the input image at location t

  • Modality Extraction-Image Cue

    Quantizing the gradient orientations

    Input color image

    Gradient image computed on gray image

    Gradient image computed with our approach

    Modality Extraction-Depth Cue

    • Depth Cue

    • We use a standard camera and a aligned depth sensor to obtain depth map.

    • Use quantized surface normal computed on a dense depth field forour template representation.

    • Consider the first order Taylor expansion of the depth function D(x):

    • Within a patch defined around x, each pixel offset dx yields an equation.

    Modality Extraction-Depth Cue

    • Estimate an optimal gradient in least-square.

    • Depth gradient corresponds to a tangent plane going through three points X, X1 and X2:

      vector along the line of sight that goes through pixel x (obtain from parameters of depth sensor)

    Modality Extraction-Depth Cue

    • The normal to the surface can be estimated as the normalized cross-product of X1 − X and X2 − X.

    • Within a patch defined around x,this would not be robust around occluding contours.

    • Inspired by bilateral filtering, we ignore the pixels whose depth difference with the central pixel (X) is above a threshold.






    Depth sensor

    Normal of X

    Modality Extraction-Depth Cue

    • Quantizethe normal directions into n0 bins.

    • Assign to each location the quantized value that occurs most often in a 5 × 5 neighborhood.

  • The similarity measurement function fD:

    OD(r): the normalized surface normal of thereference image at location r

    ID(t): the normalized surface normal of the input image at location t

  • Modality Extraction-Depth Cue

    Quantizing the surface normals



    The corresponding

    depth image

    Surface normalscomputed with our approach.

    Details are clearly visible and depth discontinuities are well handled.


    • Goal &Challenges

    • Related Work

    • Modality Extraction

      • Image Cue

      • Depth Cue

    • Similarity Measure

    • Efficient Computation

    • Experiments

    Similarity Measure

    • We define a template as T = ({Om} m∈M, P ).

      P: a list of pairs (r,m) made of the locations rof a discriminant feature in modality m.

    • Each template is createdby extractingfor each “m” a set of its most discriminant features (P).

    P:(rk, surface normals)

    r: record the

    feature location

    with respect to

    object center (C).


    P:(ri, gradients)

    Similarity Measure

    • The object measurement energy function :

      T: ({Om} m∈M, P )

      c:the detected location (could be object center)

      R(c+r):[c+r- , c+r+][c+r- , c+r+] , N∈ const.(neighborhood of size Ncentered on (c+r) in Im)

      fm(Om (r), Im(t)): computes the similarity score for modality m

    Efficient Computation

    • We first quantize the input data for each modality into a small number of n0.

    • Use a lookup table тi,m for energy response:

      i: the index of the quantized value of modality m. (also use i to represent the corresponding value)

      Lm: list of values of a special modality m appearing in a local neighborhood of a value i from input I.





    Efficient Computation

    • “Spread” [11] the data aroundneighborhood to obtain a robust representation Jminstead of Lm.

    • For each quantized value of one modality m with index i we can now compute the response at each location c:

      тi,m: the precomputedlookup table, Jmas the index

    [11] S. Hinterstoisser, C. Cagniart, S. Ilic, P. Sturm, P. Fua, N. Navab, and V. Lepetit. Gradient

    response maps for realtime detection of texture-less objects. under revision PAMI.

    Efficient Computation

    • Finally, the similarity measure can be:

    • Since the maps Si,m are shared between the templates, matching several templates against the input image can be done very fast once they are computed.


    • LINE-MOD: our approach (intensity & depth)

    • LINE-2D: introduced in [11] (use only intensity)

    • LINE-3D: use only the depth map

    • Hardware:

      • Performed on one processor of a standard notebook with an Intel Centrino Processor Core2Duo with 2.4 GHz and 3 GB of RAM.

    • Test data:

      • Six object sequences made of 2000 real images each.

      • Each sequence presents illumination and large viewpoint changesover heavy cluttered background.


    • Robustness:

      • A threshold (about 80) separates almost all true positives for LINE-MOD.


    • Speed:

      • Learning new templates only requires extracting and storing features, which is almost instantaneous.

      • Templates include: 360 degree tilt rotation, 90 degree inclination rotation and in-plane rotations of ± 80 degrees, scale changes from 1.0 to 2.0.

      • Parse a 640×480 image with over 3000 templateswith 126 features at about 10 fps(real-time).

      • The runtime of LINE-MOD is only dependent on the number of features and independent of the object/template size.


    • Speed:


    • Occlusion:

    • Right: Average recognition score for the six objects with respect to occlusion.

    • With over 30% occlusion our method is still able to recognize objects.




    Hole punch






    True positive rates =

    False positive rates =


  • Login