RGB-D Object Modeling and Recognition

Practical Modeling and Recognition using RGB-D Cameras Xiaofeng Ren, Dieter Fox Intel Labs, University of Washington Joint work with Liefeng Bo, Kevin Lai, Peter Henry, Evan Herbst, Mike Krainin, Hao Du and others @ University of Washington June 27, 2011

RGB-D Camera: Color+Depth 640x480, 30Hz, color + dense depth

At RGB-D 2010 Workshop: • 3D modeling of indoor environments RGBD-ICP matching + Loop closure; Flythrough visualization • 3D modeling of everyday objects Robot in-hand modeling through real-time registration and modeling • Robust recognition of everyday objects Preliminary object dataset captured with RGB-D Preliminary results on sparse distance learning

RGB-D Perception @ UW and Intel • 3D modeling of objects & environments Indoor Modeling: [Henry, Krainin, Herbst, Ren, Fox; ISER ’10] Interactive Modeling: [Hao, Henry, Ren, Fox, Seitz; Ubicomp ’11] Dynamic Scene Modeling: [Herbst, Ren, Fox; ICRA ’11, IROS ‘11] Object Manipulation: [Krainin, Henry, Ren, Fox; IJRR ’10] Interactive 3D Visualization: [Cheng, Ren; ’11] • Robust recognition of everyday objects Egocentric recognition: [Ren, Gu; CVPR ’10] Joint object-pose recognition: [Gu, Ren; ECCV ’10] Kernel Descriptors: [Bo, Ren, Fox; NIPS ’10, IROS ’11] Hierarchical Kernel Descriptors: [Bo, Lai, Ren, Fox; CVPR ’11] RGB-D Benchmark: [Lai, Bo, Ren, Fox; ICRA ’11] Sparse distance learning: [Lai, Bo, Ren, Fox; ICRA ’11] (best vision paper) Scalable and hierarchical recognition: [Lai, Bo, Ren, Fox; AAAI ’11]

RGB-D Perception @ UW and Intel • 3D modeling of objects & environments Indoor Modeling: [Henry, Krainin, Herbst, Ren, Fox; ISER ’10] Interactive Modeling: [Hao, Henry, Ren, Fox, Seitz; Ubicomp ’11] Dynamic Scene Modeling: [Herbst, Ren, Fox; ICRA ’11, IROS ’11] Object Manipulation: [Krainin, Henry, Ren, Fox; IJRR ’10] Interactive 3D Visualization: [Cheng, Ren; ’11] • Robust recognition of everyday objects Egocentric recognition: [Ren, Gu; CVPR ’10] Joint object-pose recognition: [Gu, Ren; ECCV ’10] Kernel Descriptors: [Bo, Ren, Fox; NIPS ’10] Hierarchical Kernel Descriptors: [Bo, Lai, Ren, Fox; CVPR ’11] RGB-D Benchmark: [Lai, Bo, Ren, Fox; ICRA ’11] Sparse distance learning: [Lai, Bo, Ren, Fox; ICRA ’11] (best vision paper) Scalable and hierarchical recognition: [Lai, Bo, Ren, Fox; AAAI ’11]

RGB-D Mapping: Pipeline

[Henry-Krainin-Herbst-Ren-Fox]

Comparing to Laser-based Mapping

From RGB-D to Interactive Modeling [Du-Henry-Ren-Fox-Goldman-Seitz; Ubicomp 11]

Discovering and Learning Objects [Herbst-Henry-Ren-Fox; ICRA 2011]

Discovering and Learning Objects • (Robot) capturing scenes in RGB-D over extended period of time • 3D scene reconstruction for efficient representation • Proper sensor models for both color and depth • Pairwise scene differencing with sensor models and MRF clean-up [Herbst-Henry-Ren-Fox; ICRA 2011]

Discovering and Learning Objects • Handling changed detections in multiple visits with multi-label MRF • Matching potential objects by movements and appearance • ICP for shape matching • Color image recognition with kernel descriptors • Spectral clustering for object discovery [Herbst-Ren-Fox; IROS 2011]

Discovering and Learning Objects [Herbst-Ren-Fox; IROS 2011]

Object Learning through Manipulation [Krainin-Henry-Ren-Fox IJRR 2011]

Next-Best-View Planning [Krainin-Curless-Fox ICRA 2011]

RGB-D Perception @ UW and Intel • 3D modeling of objects & environments Indoor Modeling: [Henry, Krainin, Herbst, Ren, Fox; ISER ’10] Interactive Modeling: [Hao, Henry, Ren, Fox, Seitz; Ubicomp ’11] Dynamic Scene Modeling: [Herbst, Ren, Fox; ICRA ’11, IROS ’11] Object Manipulation: [Krainin, Henry, Ren, Fox; IJRR ’10] Interactive 3D Visualization: [Cheng, Ren; ’11] • Robust recognition of everyday objects Egocentric recognition: [Ren, Gu; CVPR ’10] Joint object-pose recognition: [Gu, Ren; ECCV ’10] Kernel Descriptors: [Bo, Ren, Fox; NIPS ’10] Hierarchical Kernel Descriptors: [Bo, Lai, Ren, Fox; CVPR ’11] RGB-D Benchmark: [Lai, Bo, Ren, Fox; ICRA ’11] Sparse distance learning: [Lai, Bo, Ren, Fox; ICRA ’11] (best vision paper) Scalable and hierarchical recognition: [Lai, Bo, Ren, Fox; AAAI ’11]

RGB-D Object Dataset 300 objects from 51 categories, 250,000 RGB-D views Cluttered scenes http://www.cs.washington.edu/rgbd-dataset/ (search “rgbd”+”dataset”) [Lai-Bo-Ren-Fox; ICRA 2011]

Benchmarking RGB-D Recognition Category-Level Recognition (51 categories) Instance-Level Recognition (303 instances) [Lai-Bo-Ren-Fox; ICRA 2011]

RGB-D Object Recognition Bag-of-Words Sparse Coding (LLC,LCC) ? Your favorite model SIFT (or HOG) Spatial Pyramid Matching (SPM) Efficient Match Kernel (EMK) Feed-forward Networks Recognition Image Patch features Image features

Kernel Descriptors: Generalizing SIFT Linear kernel on SIFT descriptors = a product of two histograms = a product summed over all pairs of pixels normalized gradient magnitude gradient orientation pixel coordinates Gradient Match Kernel image patch kernels Includes SIFT as a special case Avoids any “binning” issues in histogram features [Bo-Ren-Fox; NIPS 2010]

Kernel Descriptors: Image Recognition • Low-dimensional approximations of match kernels • Explicitly compute descriptors/features from patches • Easily generalize gradient features to color, binary shape, etc • Outperform SIFT and sophisticated feature learning techniques Scene-15 KDES: 86.7% SIFT: 82.2% Caltech-101 KDES: 76.4% CDBN[2]: 65.5% SPM[1]: 64.4% LCC[4]: 73.4% CIFAR10KDES: 76.0% LCC[4]: 74.5% mcRBM-DBN[3]: 71.0% TCNN[5]: 73.1% [1] Lazebnik, Schmid, Ponce, CVPR ‘06. [2] Lee, Grosse, Ranganath, Ng, ICML ‘09. [3] Ranzato & Hinton, CVPR ‘10. [4] Yu & Zhang, ICML ‘10. [5] Le, Ngiam, Chen, Chia, Koh & Ng, NIPS ‘10. [Bo-Ren-Fox; NIPS 2010]

Kernel Descriptors: RGB-D Recognition Category-Level Recognition (51 categories) Instance-Level Recognition (303 instances) [Bo-Lai-Ren-Fox; CVPR 2011; IROS 2011]

Toward Practical Recognition • A mug? • Kevin’s mug? • A mug facing right? • A mug with orientation (90,15,0) • … …

Scalable and Hierarchical Recognition 8 discrete views continuous angles [Lai-Bo-Ren-Fox; AAAI 2011]

Joint Recognition with Object-Pose Tree • Tree structure enables efficient joint recognition • Object-Pose tree outperforms nearest neighbor and 1vsA baselines • Joint tree-based learning outperforms separate learning • Promising pose estimation results on generic objects • Natural tree structure of category-instance-pose works really well RGB-D Dataset: 300 objects, 51 categories, 250,000 color-depth pairs [Lai-Bo-Ren-Fox; AAAI 2011]

Application: Interactive LEGO RGB-D used for object recognition and hand tracking [Ziola-Harrison-Powledge-Lai-Bo-Ren-Fox]

Application: Chess Playing Robot [Matuszek-Mayton-Aimi-Bo-Deisenroth-Chu-Kung-LeGrand-Smith-Fox]

RGB-D Perception: Summary • RGB-D cameras provide synchronized color and depth, making visual perception both robust and efficient. • RGB-D mapping generates detailed 3D maps at near real-time and enables on-the-fly user interaction and feedback. • Kernel descriptors provide a principled way to extract rich features from pixel attributes, outperforming SIFT and leading to robust RGB-D recognition. • Robust RGB-D recognition and modeling enable interesting scenarios for object-aware interactions and applications.

RGB-D Perception: The Future? • Will RGB-D have a deep impact on vision applications? YES! It’s already happening, faster than we can track. • Will RGB-D start a revolution in vision applications? NO.We still need to solve recognition, segmentation, tracking, scene understanding, etc. etc. YES!RGB-D helps address two BIG issues in computer vision: loss of 3D from projection; lighting conditions. RGB-D helps “abstract away” many low-level problems. • Is RGB-D the future for smart vision-based systems? Why not? At $50 today and $10 tomorrow.

THANK YOU

RGB-D Object Modeling and Recognition