iccv cvpr paper reading
Download
Skip this Video
Download Presentation
ICCV & CVPR paper reading

Loading in 2 Seconds...

play fullscreen
1 / 66

ICCV & CVPR paper reading - PowerPoint PPT Presentation


  • 154 Views
  • Uploaded on

ICCV & CVPR paper reading. 池 晨 @ jdl.ac.cn 2009.11.27. CVPR09, # 2128 , Recognizing Indoor Scenes. Recognizing Indoor Scenes. Ariadna Quattoni & Antonio Torralba.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' ICCV & CVPR paper reading' - cruz-raymond


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
iccv cvpr paper reading

ICCV & CVPR paper reading

池 晨@jdl.ac.cn

2009.11.27

slide2

CVPR09, # 2128 ,

Recognizing Indoor Scenes

slide3

Recognizing Indoor Scenes

Ariadna Quattoni & Antonio Torralba

  • A. Quattoni, X. Carreras, M. Collins, T. Darrell, An Efficient Projection for L1,Infinity Regularization, ICML 2009.
  • A. Quattoni, A.Torralba, Recognizing Indoor Scenes, CVPR 2009.
  • A. Quattoni, M. Collins, T. Darrell, Transfer Learning for Image Classification with Sparse Prototype Representations , CVPR 2008.
  • A. Quattoni, M. Collins, T. Darrell, Learning Visual Representations using Images with Captions, CVPR 2007.
  • A. Quattoni, S. Wang, L.P. Morency, M. Collins, and T. Darrell, Hidden-state Conditional Random Fields, IEEE PAMI, 2007

Ariadna Quattoni

Ph.D student

MIT Computer Science and Artificial Intelligence Laboratory(CSAIL)

slide4

Recognizing Indoor Scenes

Ariadna Quattoni & Antonio Torralba

  • L.P. Morency, A. Quattoni, T. Darrell, Latent-Dynamic Discriminative Models for Continuous Gesture Recognition, CVPR 2007.
  • S. Wang, A. Quattoni, L.P. Morency, D. Demirdjian, T. Darrell, Hidden Conditional Random Fields for Gesture Recognition, CVPR 2006.
  • A. Quattoni, M. Collins, T. Darrell, Incorporating Semantic Constraints into a Discriminative Categorization and Labeling Model, Workshop on Semantic Knowledge in Vision, ICCV, 2005.
  • A. Quattoni, M. Collins and T. Darrell, Conditional Random Fields for Object Recognition, In Proceedings of NIPS, 2004.

Ariadna Quattoni

Ph.D student

MIT Computer Science and Artificial Intelligence Laboratory(CSAIL)

slide5

Recognizing Indoor Scenes

Ariadna Quattoni & Antonio Torralba

  • Research Interests
  • Computer vision ,
  • Machine learning,
  • Human visual perception,
  • Scene and object recognition.

Antonio Torralba

Associate ProfessorMIT Computer Science and Artificial Intelligence Laboratory(CSAIL)

slide6

Recognizing Indoor Scenes

Ariadna Quattoni & Antonio Torralba

LabelMe: online image annotation and applicationsA. Torralba, B. C. Russell, and J. Yuen,MIT CSAIL Technical Report, 2009.

How many pixels make an image?A. Torralba ,Visual Neuroscience, volume 26, issue 01, pp. 123-131, 2009.

Small codes and large databases for recognitionA. Torralba, R. Fergus, Y. Weiss,CVPR,2008. 80 million tiny images: a large dataset for non-parametric object and scene recognitionA. Torralba, R. Fergus, W. T. FreemanIEEE Transactions on PAMI, vol.30(11), pp. 1958-1970, 2008.

Sharing visual features for multiclass and multiview object detection

A. Torralba, K. P. Murphy and W. T. Freeman,PAMI,2007.

Antonio Torralba

Associate ProfessorMIT Computer Science and Artificial Intelligence Laboratory(CSAIL)

slide7

?

Most scene recognition models that work well for outdoor scenes perform poorly in the indoor domain.

Fig1. Comparison of Spatial Sift and Gist features for a scene

recognition task. Both set of features have a strong correlation in

the performance across the 15 scene categories. Average performance for the different features are: Gist: 73.0%, Pyramid matching: 73.4%, bag of words: 64.1%, and color pixels (SSD): 30.6%.

In all cases we use an SVM.

slide8

Abstract

  • Indoor scene recognition is a challenging open problem.
  • By global spatial properties orby objects they contain?
  • Aprototype based model that can successfully combine
  • both sources of information.
  • A dataset of 67 indoor scenes categories.
  • Good results.
slide10

Prototype Image

Prototype image?

slide12

A Prototype Based Model

Contained objects

ROI 1

Global spatial properties

For each scene category:

For each prototypeTp:

ROI mk

……

Prototype

Image T

ROI2

ROI 5

ROI 3

ROI 4

slide14

Image Descriptor

Contained objects

ROI 1

Global spatial properties

How to represent global spatial properties?

——Using Gist descriptor

How to represent each ROI?

——Using a spatial

pyramid of visual

words

ROI mk

……

Prototype

Image T

ROI2

ROI 5

ROI 3

ROI 4

slide15

Gist (1/2)

Magnitude of multiscale oriented filter outputs

Orginal image

orientation

Scale

Be decomposed

Be taked the magnitude and be computed the local average response over 4*4 windows.

PCA

Gist feature

Sampled filter outputs

slide16

Gist (2/2)

The gist feature encodes edges and textures information in the original image coarsely

Top row: original images.

Bottom row: noise images coerced to have the same global features (N=64) as the target image.

slide17

Image Descriptor

Contained objects

ROI 1

Global spatial properties

How to represent global spatial properties?

——Using Gist descriptor

How to represent each ROI?

——Using a spatial

pyramid of visual

words

ROI mk

……

Prototype

Image T

ROI2

ROI 5

ROI 3

ROI 4

slide18

ROI Descriptor

The visual words are obtained by creating vector quantized Sift descriptors by applying K-means to a random subset of images.

A spatial pyramid of visual words

The color of each pixel represents the visual word to which it was assigned.

slide19

Image Descriptor

Contained objects

ROI 1

Global spatial properties

How to represent global spatial properties?

——Using Gist descriptor

How to represent each ROI?

——Using a spatial

pyramid of visual

words

ROI mk

……

Prototype

Image T

ROI2

ROI 5

ROI 3

ROI 4

slide20

Model Formulation

  • Given:
  • A training set of n pairs of labeled images
  • A set of p segmented images which we called
  • prototypes.
  • Goal:
  • To use D and S to learn a mapping h : X→R
slide21

Model Formulation

Contained object information

The mapping should capture the fact that images containing similar objects must have similar scene labels and that some objects are more important than others in defining a scenes’ identity.

Distances between two regions are computed using histogram intersections.

where tkj represents the jth ROI of kth prototype image, xs represents the most similar segment with tkjin image x.

slide22

Searching Strategy

When meet a new image, how to find its ROIs that similar with the ROIs in the given prototype image T?

Histogram intersection function:

Searching around a small window relative to the original location in prototype image T.

slide23

Searching Strategy

Figure 5. Example of detection of similar image patches.

The top three images correspond to the query patterns. For each image, the algorithm tries to detect the selected region on the query image.

The next three rows show the top three matches for each region.

The last row shows the three worst matching regions.

slide24

Model Formulation

Global spatial information

For some scene categorieds global image information can be very important.

Global information is computed as L2 norm between the Gist representation of image x and the Gist representation of prototype k.

slide25

Model Formulation

Parameters

The importance of global features when considering the kth prototype.

Global spatial information

Contained object information

How relevant the similarity to a prototype k is for predicting the scene label.

Captures the importance of a particular ROI inside a given prototype.

slide26

Model Formulation

Learning

How to estimate the model parameters from a training set D?

Regularization terms and the constants Cb and Cl dictate the amount of regularization in the model

Loss function measuring the error that the classifier incurs on training example D.

slide27

Model Formulation

Learning

Using training set D and a gradient-based method to estimate the model parameters:

Δis the set of indices of examples in D that attain non-zero loss.

slide28

Model Formulation

The number in parenthesis is the classification confidence.

slide30

Indoor Database

Figure 2. Summary of the 67 indoor scene categories used in our study. To facilitate seeing the variety of different scene categories considered here we have organized them into 5 big scene groups. The database contains 15620 images. All images have a minimum resolution of 200 pixels in the smallest axis.

  • Compared with state of the art :
  • The largest one available: 67 categories, 15620 images.
  • More difficulte: In-class variability
slide31

Results (1/3)

Four different variation of the model.

Segmented ROIs

Manually Annotated ROIs

Both Local and Global features

Local features

slide32

Results (1/3)

Four different variation of the model.

  • Both local and global information are useful for the indoor scene recognition task.
  • Using automatic segmentations instead of manual segmentations cause only a small drop in performance
slide33

Results (2/3)

Figure 7. The 67 indoor categories sorted by multiclass average

precision (training with 80 images per class and test is done on 20

images per class).

slide34

Results (3/3)

How is the preformance of the proposed model affected by the number of prototypes used?

We observed a logarithmic growth of the average precision as a function of the number of prototypes.

Exploit more prototypes might be able to further improve the performance.

slide35

Conclusion (1/3)

Contained objects

ROI 1

Global spatial properties

ROI mk

……

Prototype

Image T

ROI2

ROI 5

Combination of global information and contained object information

ROI 3

ROI 4

slide36

Conclusion (2/3)

Global spatial information

Contained object information

slide38

ICCV09,

Learning to Predict Where Humans Look

slide39

Learning to Predict

Where Humans Look

Tilke Judd, Krista Ehinger, Fredo Durand, Antonio Torralba

  • Education Background
  • Massachusetts Institute of Technology, Cambridge, MA
  • Ph.D. candidate in Computer Science (Graphics) Expected graduation June 2010,
  • Masters of Science, Computer Science, Jan 2007
  • Bachelors of Science in Mathematics, June 2003.
  • École Polytechnique, Palaiseau, France
  • International Program, Computer Science Major, Sept 2003 to April 2004
  • Cambridge University, Cambridge, England
  • Junior Year Abroad, Read Part IB Mathematics Tripos, Sept 2001 to June 2002

Tilke Judd

Ph.D student

MIT Computer Science and Artificial Intelligence Laboratory(CSAIL)

slide40

Learning to Predict

Where Humans Look

Tilke Judd, Krista Ehinger, Fredo Durand, Antonio Torralba

  • Research Interests
  • Computer Graphics
  • Computational Photography
  • Image Processing
  • Perception
  • Non-Photorealistic Rendering

Tilke Judd

Ph.D student

MIT Computer Science and Artificial Intelligence Laboratory(CSAIL)

slide41

Learning to Predice

Where Humans Look

Tilke Judd, Krista Ehinger, Fredo Durand, Antonio Torralba

  • Judd, T, Ehinger, K, Durand, F, Torralba, A. Learning to Predict Where People Look, ICCV 2009.
  • Judd, T., Durand, F., Adelson, T. Apparent Ridges for Line Drawing. Proceedings of ACM Siggraph 2007
  • Judd, Tilke. Apparent Ridges for Line Drawing. Masters Thesis, Computer Science, MIT, Jan 2007
  • Ju, W., R. Hurwitz, T. Judd, B. Lee. CounterActive: An Interactive Cookbook for the Kitchen Counter. Proceedings of SIGCHI 2001, Short Papers and Abstracts, Seattle WA, April 2001. p 269
  • Ju, W., L. Bonanni, R Fletcher, R. Hurwitz, T. Judd, J. Yoon E.R. Post, M. Reynolds. Origami Desk. Exhibited SIGGRAPH 2001, Los Angeles CA. SIGRAPH Conference Abstracts and Applications, August 2001, p.280
  • Judd, Tilke. The JPEG Compression Algorithm. The MIT Undergraduate Mathematics Journal. Vol 5, p.119

Tilke Judd

Ph.D student

MIT Computer Science and Artificial Intelligence Laboratory(CSAIL)

slide42

Learning to Predict

Where Humans Look

Tilke Judd, Krista Ehinger, Fredo Durand, Antonio Torralba

  • Education Background
  • University of Edinburgh, Edinburgh, UK2007 B.Sc. Psychology
  • California Institute of Technology, Pasadena, CA, USA2003 B.S. Engineering & Applied Science

?

Erista Ehinger

Graduate Student

Department of Brain & Cognitive Sciences at MIT

slide43

Learning to Predict

Where Humans Look

Tilke Judd, Krista Ehinger, Fredo Durand, Antonio Torralba

  • Education Background
  • He received his PhD from Grenoble University, France, in 1999.
  • From 1999 till 2002, he was a post-doc in the MIT Computer Graphics Group

Frédo Durand

Associate Professor Computer Graph Group,CSAIL,MIT.

slide44

Learning to Predict

Where Humans Look

Tilke Judd, Krista Ehinger, Fredo Durand, Antonio Torralba

  • Research Interests
  • Synthetic image generation
  • Computational photography

Frédo Durand

Associate Professor Computer Graph Group,CSAIL,MIT.

slide45

Learning to Predict

Where Humans Look

Tilke Judd, Krista Ehinger, Fredo Durand, Antonio Torralba

  • Co-organized the first Symposium on Computational Photography and Video in 2005,
  • Co-organized the first International Conference on Computational Photography in 2009,
  • Was on the advisory board of the Image and Meaning 2 conference.
  • Received an inaugural Eurographics Young Researcher Award in 2004,
  • Received an NSF CAREER award in 2005,
  • Received an inaugural Microsoft Research New Faculty Fellowship in 2005,
  • Received a Sloan fellowship in 2006,
  • Received a Spira award for distinguished teaching in 2007.

Frédo Durand

Associate Professor Computer Graph Group,CSAIL,MIT.

slide46

?

How to understand where humans look in a scenes without an eye tracking?

Figure 2. Current saliency models do not accurately predict human fixations. In row one, the low-level model selects brigh spots of light as salient while viewers look at the human. In row two, the low level model selects the building’s strong edges and windows as salient while viewers fixate on the text.

slide47

Abstract

  • For many applications in graphics,design,and human
  • computer interaction,it is essential to understand where
  • humans look in a scene.
  • Models of saliency can be used to predict fixation locations.
  • A sailency model based on both the top-down information
  • and bottom up information
  • A large eye tracking database.
slide48

Database of Eye Tracking Data

15 objects

1003 random images

Free view

3 seconds per image

Recording the gaze path

slide49

Database of Eye Tracking Data

Convolve a gaussian filter across the object’s fixation.Then average all the objects’ data to obtain a continuous saliency map.

Select the top n percent salient locations to generate a binary map

Collect the object’s fixations.

slide50

Analysis of Dataset

  • For some images,all viewers fixate on the same locations,while in other images viewers’s fixations are dispersed all over the image.
  • The fixation in the database have a strong bias towards the center.
  • Fixations from the database are often on animals,cars,and human body parts like eyes and hands.
  • There is a certain size for a region of interest(ROI)that a person fixates on.
slide52

Features Used for Machine Learning

  • Low-level features:
  • Local energy of the steerable pyramid filters[3],
  • Features used in a simple salency model described by Torralba[1] and
  • Rosenholtz[2],
  • Orientation and color contrast,
  • Values of the red,green and blue channels,as well as the probabilities of
  • each of these channels as features[4].
  • The probability of each color as computed from 3D color histograms of
  • the image filtered with a median filter at 6 different scales.
  • Mid-levelfeatures
  • The location of horizon.
slide53

Features Used for Machine Learning

  • High-levelfeatures:
  • Runing the Viola Jones face detector[5] and the Felzenszwalb person
  • detector[6].
  • Centerprior:
  • The distance between each pixel to the center.
slide54

Features Used for Machine Learning

Fig 8. Features.A sample image(bottom right) and 33 of the features that we use to train the model.

slide56

Features Used for Machine Learning

Using binary map to generate positive label and negitive label

slide57

Features Used for Machine Learning

Binary saliency map

Positive labeled pixels

negtive labeled pixels

slide59

Training

Binary saliency map

Positively labeled pixels

10 positive pixels per image

10 pixels per image

negtively labeled pixels

903 training images(9030 positive and 9030 negitive training samples )and 100 testing images with a liblinear SVM.

slide60

Testing

Figure 9. Comparison of saliency maps. Each row of images compares the predictors of our SVM saliency model, the Itti saliency map, the center prior, and the human ground truth, all thresholded to show the top 10 percent salient locations.

slide61

Performance On Testing Images

Figure 10. The ROC curve of performances for SVMs trained on

each set of features individually and combined together. We also plot human performance and chance for comparison.

slide62

Application

Rendering more details at the location users fixated on and less detial in the rest of the image.

slide63

Conclusion (1/4)

Created database containing true eye data

slide64

Conclusion (2/4)

  • Low-level features
  • Mid-level features
  • High-level features
  • Center prior
slide65

Conclusion (3/4)

Compared the effect of each subset of whole features on saliency map.

slide66

Conclusion (4/4)

Given an example of the model’s application.

ad