CS564 – Lecture 7. Object Recognition
1 / 50

Bottom-Up Segmentation or Top-Down Control? - PowerPoint PPT Presentation

  • Uploaded on
  • Presentation posted in: General

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

Download Presentation

Bottom-Up Segmentation or Top-Down Control?

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

CS564 – Lecture 7. Object Recognitionand Scene AnalysisReading Assignments:TMB2: Sections 2.2, and 5.2“Handout”: Extracts from HBTNN 2e Drafts: Shimon Edelman and Nathan Intrator: Visual Processing of Object StructureGuy Wallis and Heinrich Bülthoff: Object recognition, neurophysiologySimon Thorpe and Michèle Fabre-Thorpe: Fast Visual Processing(My thanks to Laurent Itti and Bosco Tjan for permission to use the slides they prepared for lectures on this topic.)

Bottom-Up Segmentation or Top-Down Control?

Object Recognition

  • What is Object Recognition?

    • Segmentation/Figure-Ground Separation: prerequisite or consequence?

    • Labeling an object [The focus of most studies]

    • Extracting a parametric description as well

  • Object Recognition versus Scene Analysis

    • An object may be part of a scene or

    • Itself be recognized as a “scene”

  • What is Object Recognition for?

    • As a context for recognizing something else (locating a house by the tree in the garden)

    • As a target for action (climb that tree)

reach programming



How (dorsal)

grasp programming




What (ventral)


"What" versus "How” in Human

DF: Jeannerod et al.

Lesion here: Inability to Preshape

(except for objects with size “in the semantics”

Monkey Data:

Mishkin and

Ungerleider on

“What” versus


AT: Goodale and Milner

Lesion here: Inability to verbalize or pantomime size or orientation

Clinical Studies

  • Studies with patients with some visual deficits strongly argue that tight interaction between where and what/how visual streams are necessary for scene interpretation.

  • Visual agnosia: can see objects, copy drawings of them, etc., but cannot recognize or name them!

  • Dorsal agnosia: cannot recognize objects

  • if more than two are presented simulta-

  • neously: problem with localization

  • Ventral agnosia: cannot identify objects.

These studies suggest…

  • We bind features of objects into objects (feature binding)

  • We bind objects in space into some arrangement (space binding)

  • We perceive the scene.

  • Feature binding = what/how stream

  • Space binding = where stream

  • Double role of spatial relationships:

    • To relate different portions of an object or scene as a guide to recognition

    • Augmented by other “how” parameters, to guide our behavior with respect to the observed scene.

Inferotemporal Pathways

Later stages of IT (AIT/CIT) connect to the frontal lobe, whereas earlier ones (CIT/PIT) connect to the parietal lobe. This functional distinction may well be important in forming a complete picture of inter-lobe interaction.

Shape perception and scene analysis

  • Shape-selective neurons in cortex

  • Coding: one neuron per object

    or population codes?

  • Biologically-inspired algorithms

    for shape perception

  • The "gist" of a scene: how can we get

    it in 100ms or less?

  • Visual memory: how much do we remember

    of what we have seen?

  • The world as an outside memory and our eyes as a lookup tool

Face Cells in Monkey

Object recognition

  • The basic issues

  • Translation and rotation invariance

  • Neural models that do it

  • 3D viewpoint invariance (data and models)

  • Classical computer vision approaches: template matching and matched filters; wavelet transforms; correlation; etc.

  • Examples: face recognition.

  • More examples of biologically-

    inspired object recognition systems

    which work remarkably well

Extended Scene Perception

  • Attention-based analysis: Scan scene with attention, accumulate evidence from detailed local analysis at each attended location.

  • Main issues:

  • what is the internal representation?

  • how detailed is memory?

  • do we really have a detailed internal representation at all!!?

  • Gist: Can very quickly (120ms) classify entire scenes or do simple recognition tasks; can only shift attention twice in that much time!

Thorpe: Recognizing Whether a Scene Contains an Animal

Claim: This is so quick that only feedforward processing can be involved

Eye Movements: Beyond Feedforward Processing

  • 1) Examine scene freely

  • 2) estimate material

  • circumstances of family

  • 3) give ages of the people

  • 4) surmise what family has

  • been doing before arrival

  • of “unexpected visitor”

  • 5) remember clothes worn by

  • the people

  • 6) remember position of people

  • and objects

  • 7) estimate how long the “unexpected

  • visitor” has been away from family

The World as an Outside Memory

  • Kevin O’Regan, early 90s:

  • why build a detailed internal representation of the world?

  • too complex…

  • not enough memory…

  • … and useless?

  • The world is the memory. Attention and the eyes are a look-up tool!

The “Attention Hypothesis”

  • Rensink, 2000

  • No “integrative buffer”

  • Early processing extracts information up to “proto-object” complexity in massively parallel manner

  • Attention is necessary to bind the different proto-objects into complete objects, as well as to bind object and location

  • Once attention leaves an object, the binding “dissolves.” Not a problem, it can be formed again whenever needed, by shifting attention back to the object.

  • Only a rather sketchy “virtual representation” is kept in memory, and attention/eye movements are used to gather details as needed

Challenges of Object Recognition

  • The binding problem: binding different features (color, orientation, etc) to yield a unitary percept. (see next slide)

  • Bottom-up vs. top-down processing: how

  • much is assumed top-down vs. extracted

  • from the image?

  • Perception vs. recognition vs. categorization: seeing an object vs. seeing is as something. Matching views of known objects to memory vs. matching a novel object to object categories in memory.

  • Viewpoint invariance: a major issue is to recognize objects irrespective of the viewpoint from which we see them.

Four stages of representation (Marr, 1982)

  • 1) pixel-based (light intensity)

  • 2) primal sketch (discontinuities in intensity)

  • 3) 2 ½ D sketch (oriented surfaces, relative depth between surfaces)

  • 4) 3D model (shapes, spatial relationships, volumes)

  • TMB2 view: This may work in ideal cases, but in general “cooperative computation” of multiple visual cues and perceptual schemas will be required.

  • problem: computationally intractable!


  • A computer vision system from 1987 developed by

  • Allen Hanson and Edward Riseman on the basis of

  • the HEARSAY system for speech understanding (TMB2 Sec. 4.2)

  • and Arbib’s Schema Theory (TMB2 Sec. 2.2 and Chap. 5)

  • This is schema-based and can be “mapped” onto hypotheses

  • about cooperative computation in the brain.

  • Key idea: Bringing context and scene knowledge into play so that recognition of objects proceeds via islands of reliability to yield a consensus interpretation of the scene.

  • See TMB2 Sec. 5.2 for the figures.

Biederman: Recognition by Components

“geons”: units of

3D geometric structure

JIM 3 (Hummel)

Collection of Fragments (Edelman and Intrator)

Collection of Fragments 2

Viewpoint Invariance

  • Major problem for recognition.

  • Biederman & Gerhardstein, 1994:

  • We can recognize two views of an unfamiliar object as being the same object.

  • Thus, viewpoint invariance cannot only rely on matching views to memory.

Models of Object Recognition

  • See Hummel, 1995, The Handbook of Brain Theory & Neural Networks

  • Direct Template Matching:

  • Processing hierarchy yields activation of view-tuned units.

  • A collection of view-tuned units is associated with one object.

  • View tuned units are built from V4-like units,

  • using sets of weights which differ for each object.

  • e.g., Poggio & Edelman, 1990; Riesenhuber & Poggio, 1999

Computational Model of Object Recognition(Riesenhuber and Poggio, 1999)

  • the model neurons are

  • tuned for size

  • and 3D orientation

  • of object

Models of Object Recognition

  • Hierarchical Template Matching:

  • Image passed through layers of units with progressively more complex features at progressively less specific locations.

  • Hierarchical in that features at one stage are built from features at

  • earlier stages.

  • e.g., Fukushima & Miyake (1982)’s Neocognitron:

  • Several processing layers, comprising

  • simple (S) and complex (C) cells.

  • S-cells in one layer respond to conjunc-

  • tions of C-cells in previous layer.

  • C-cells in one layer are excited by

  • small neighborhoods of S-cells.

Models of Object Recognition

  • Transform & Match:

  • First take care of rotation, translation, scale, etc. invariances.

  • Then recognize based on standardized pixel representation of objects.

  • e.g., Olshausen et al, 1993,

  • dynamic routing model

  • Template match: e.g., with

  • an associative memory based on

  • a Hopfield network.

Recognition by Components

  • Structural approach to object recognition:

  • Biederman, 1987:

  • Complex objects are composed so simpler pieces

  • We can recognize a novel/unfamiliar object by parsing it in terms of its component pieces, then comparing the assemblage of pieces to those of known objects.

Recognition by components (Biederman, 1987)

  • GEONS: geometric elements of which all objects are composed (cylinders, cones, etc). On the order of 30 different shapes.

  • Skips 2 ½ D sketch: Geons are directly recognized from edges, based on their nonaccidental properties (i.e., 3D features that are usually preserved by the projective imaging process).

Basic Properties of GEONs

  • They are sufficiently different from each other to be easily discriminated

  • They are view-invariant (look identical from most viewpoints)

  • They are robust to noise (can be identified even with parts of image missing)

Support for RBC: We can recognize partially occluded

objects easily if the occlusions do not obscure the set

of geons which constitute the object.

Potential difficulties

  • Structural description not

  • enough, also need metric info

  • Difficult to extract geons

  • from real images

  • Ambiguity in the structu-

  • ral description: most often

  • we have several candidates

  • For some objects,

  • deriving a structural repre-

  • sentation can be difficult

Edelman, 1997

Geon Neurons in IT?

  • These are preferred

  • stimuli for some IT neurons.

Fusiform Face Area in Humans

Standard View on Visual Processing


visual processing

  • Image specific

  • Supports finediscrimination

  • Noise tolerant

  • Image invariant

  • Supports generalization

  • Noise sensitive

Tjan, 1999


Early visual processing



Common objects

(e.g. Kanwisher et al; Ishai et al)

primary visual processing

Multiple memory/decision sites

(Tjan, 1999)

Tjan’s “Recognition by Anarchy”

primary visual processing


















the first arriving response

A toy visual system

Task:Identify letters from arbitrarypositions & orientations

















Site 2

Site 3


Site 1



Study stimuli:5 orientations  20 positions at high SNR

Test stimuli:1) familiar (studied) views,2) new positions, 3) new position & orientations

1800 {30%}

1500 {25%}

800 {20%}

450 {15%}

210 {10%}

Signal-to-Noise Ratio {RMS Contrast}

Site 3

norm. ori.

Site 2

norm. pos.

Site 1

raw image

Processing speed for each recognition module depends

on recognition difficulty by that module.

Site 3

norm. ori.

Site 2

norm. pos.

Site 1

raw image

Novel positions

& orientations

Familiar views

Novel positions

Proportion Correct

Contrast (%)

Site 3

norm. ori.

Site 2

norm. pos.

Site 1

raw image

Novel positions

& orientations

Familiar views

Novel positions

Proportion Correct

Contrast (%)

Black curve: full model in which recognition is based

on the fastest of the responses from the three stages.

Experimental techniques in visual neuroscience

  • Recording from neurons: electrophysiology

  • Multi-unit recording using electrode arrays

  • Stimulating while recording

  • Anesthetized vs. awake animals

  • Single-neuron recording in awake humans

  • Probing the limits of vision: visual psychophysics

  • Functional neuroimaging: Techniques

  • Experimental design issues

  • Optical imaging

  • Transcranial magnetic stimulation

  • Login