CS564 – Lecture 7. Object Recognition
This presentation is the property of its rightful owner.
Sponsored Links
1 / 50

Bottom-Up Segmentation or Top-Down Control? PowerPoint PPT Presentation


  • 56 Views
  • Uploaded on
  • Presentation posted in: General

Download Presentation

Bottom-Up Segmentation or Top-Down Control?

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Bottom up segmentation or top down control

CS564 – Lecture 7. Object Recognitionand Scene AnalysisReading Assignments:TMB2: Sections 2.2, and 5.2“Handout”: Extracts from HBTNN 2e Drafts: Shimon Edelman and Nathan Intrator: Visual Processing of Object StructureGuy Wallis and Heinrich Bülthoff: Object recognition, neurophysiologySimon Thorpe and Michèle Fabre-Thorpe: Fast Visual Processing(My thanks to Laurent Itti and Bosco Tjan for permission to use the slides they prepared for lectures on this topic.)


Bottom up segmentation or top down control

Bottom-Up Segmentation or Top-Down Control?


Object recognition

Object Recognition

  • What is Object Recognition?

    • Segmentation/Figure-Ground Separation: prerequisite or consequence?

    • Labeling an object [The focus of most studies]

    • Extracting a parametric description as well

  • Object Recognition versus Scene Analysis

    • An object may be part of a scene or

    • Itself be recognized as a “scene”

  • What is Object Recognition for?

    • As a context for recognizing something else (locating a house by the tree in the garden)

    • As a target for action (climb that tree)


Bottom up segmentation or top down control

reach programming

Parietal

Cortex

How (dorsal)

grasp programming

Visual

Cortex

Inferotemporal

What (ventral)

Cortex

"What" versus "How” in Human

DF: Jeannerod et al.

Lesion here: Inability to Preshape

(except for objects with size “in the semantics”

Monkey Data:

Mishkin and

Ungerleider on

“What” versus

“Where”

AT: Goodale and Milner

Lesion here: Inability to verbalize or pantomime size or orientation


Clinical studies

Clinical Studies

  • Studies with patients with some visual deficits strongly argue that tight interaction between where and what/how visual streams are necessary for scene interpretation.

  • Visual agnosia: can see objects, copy drawings of them, etc., but cannot recognize or name them!

  • Dorsal agnosia: cannot recognize objects

  • if more than two are presented simulta-

  • neously: problem with localization

  • Ventral agnosia: cannot identify objects.


These studies suggest

These studies suggest…

  • We bind features of objects into objects (feature binding)

  • We bind objects in space into some arrangement (space binding)

  • We perceive the scene.

  • Feature binding = what/how stream

  • Space binding = where stream

  • Double role of spatial relationships:

    • To relate different portions of an object or scene as a guide to recognition

    • Augmented by other “how” parameters, to guide our behavior with respect to the observed scene.


Inferotemporal pathways

Inferotemporal Pathways

Later stages of IT (AIT/CIT) connect to the frontal lobe, whereas earlier ones (CIT/PIT) connect to the parietal lobe. This functional distinction may well be important in forming a complete picture of inter-lobe interaction.


Shape perception and scene analysis

Shape perception and scene analysis

  • Shape-selective neurons in cortex

  • Coding: one neuron per object

    or population codes?

  • Biologically-inspired algorithms

    for shape perception

  • The "gist" of a scene: how can we get

    it in 100ms or less?

  • Visual memory: how much do we remember

    of what we have seen?

  • The world as an outside memory and our eyes as a lookup tool


Face cells in monkey

Face Cells in Monkey


Object recognition1

Object recognition

  • The basic issues

  • Translation and rotation invariance

  • Neural models that do it

  • 3D viewpoint invariance (data and models)

  • Classical computer vision approaches: template matching and matched filters; wavelet transforms; correlation; etc.

  • Examples: face recognition.

  • More examples of biologically-

    inspired object recognition systems

    which work remarkably well


Extended scene perception

Extended Scene Perception

  • Attention-based analysis: Scan scene with attention, accumulate evidence from detailed local analysis at each attended location.

  • Main issues:

  • what is the internal representation?

  • how detailed is memory?

  • do we really have a detailed internal representation at all!!?

  • Gist: Can very quickly (120ms) classify entire scenes or do simple recognition tasks; can only shift attention twice in that much time!


Thorpe recognizing whether a scene contains an animal

Thorpe: Recognizing Whether a Scene Contains an Animal

Claim: This is so quick that only feedforward processing can be involved


E ye movements beyond feedforward processing

Eye Movements: Beyond Feedforward Processing

  • 1) Examine scene freely

  • 2) estimate material

  • circumstances of family

  • 3) give ages of the people

  • 4) surmise what family has

  • been doing before arrival

  • of “unexpected visitor”

  • 5) remember clothes worn by

  • the people

  • 6) remember position of people

  • and objects

  • 7) estimate how long the “unexpected

  • visitor” has been away from family


The world as an outside memory

The World as an Outside Memory

  • Kevin O’Regan, early 90s:

  • why build a detailed internal representation of the world?

  • too complex…

  • not enough memory…

  • … and useless?

  • The world is the memory. Attention and the eyes are a look-up tool!


The attention hypothesis

The “Attention Hypothesis”

  • Rensink, 2000

  • No “integrative buffer”

  • Early processing extracts information up to “proto-object” complexity in massively parallel manner

  • Attention is necessary to bind the different proto-objects into complete objects, as well as to bind object and location

  • Once attention leaves an object, the binding “dissolves.” Not a problem, it can be formed again whenever needed, by shifting attention back to the object.

  • Only a rather sketchy “virtual representation” is kept in memory, and attention/eye movements are used to gather details as needed


Challenges of object recognition

Challenges of Object Recognition

  • The binding problem: binding different features (color, orientation, etc) to yield a unitary percept. (see next slide)

  • Bottom-up vs. top-down processing: how

  • much is assumed top-down vs. extracted

  • from the image?

  • Perception vs. recognition vs. categorization: seeing an object vs. seeing is as something. Matching views of known objects to memory vs. matching a novel object to object categories in memory.

  • Viewpoint invariance: a major issue is to recognize objects irrespective of the viewpoint from which we see them.


Four stages of representation marr 1982

Four stages of representation (Marr, 1982)

  • 1) pixel-based (light intensity)

  • 2) primal sketch (discontinuities in intensity)

  • 3) 2 ½ D sketch (oriented surfaces, relative depth between surfaces)

  • 4) 3D model (shapes, spatial relationships, volumes)

  • TMB2 view: This may work in ideal cases, but in general “cooperative computation” of multiple visual cues and perceptual schemas will be required.

  • problem: computationally intractable!


Visions

VISIONS

  • A computer vision system from 1987 developed by

  • Allen Hanson and Edward Riseman on the basis of

  • the HEARSAY system for speech understanding (TMB2 Sec. 4.2)

  • and Arbib’s Schema Theory (TMB2 Sec. 2.2 and Chap. 5)

  • This is schema-based and can be “mapped” onto hypotheses

  • about cooperative computation in the brain.

  • Key idea: Bringing context and scene knowledge into play so that recognition of objects proceeds via islands of reliability to yield a consensus interpretation of the scene.

  • See TMB2 Sec. 5.2 for the figures.


Biederman recognition by components

Biederman: Recognition by Components

“geons”: units of

3D geometric structure


Jim 3 hummel

JIM 3 (Hummel)


Collection of fragments edelman and intrator

Collection of Fragments (Edelman and Intrator)


Collection of fragments 2

Collection of Fragments 2


Viewpoint invariance

Viewpoint Invariance

  • Major problem for recognition.

  • Biederman & Gerhardstein, 1994:

  • We can recognize two views of an unfamiliar object as being the same object.

  • Thus, viewpoint invariance cannot only rely on matching views to memory.


Models of object recognition

Models of Object Recognition

  • See Hummel, 1995, The Handbook of Brain Theory & Neural Networks

  • Direct Template Matching:

  • Processing hierarchy yields activation of view-tuned units.

  • A collection of view-tuned units is associated with one object.

  • View tuned units are built from V4-like units,

  • using sets of weights which differ for each object.

  • e.g., Poggio & Edelman, 1990; Riesenhuber & Poggio, 1999


Computational model of object recognition riesenhuber and poggio 1999

Computational Model of Object Recognition(Riesenhuber and Poggio, 1999)


Bottom up segmentation or top down control

  • the model neurons are

  • tuned for size

  • and 3D orientation

  • of object


Models of object recognition1

Models of Object Recognition

  • Hierarchical Template Matching:

  • Image passed through layers of units with progressively more complex features at progressively less specific locations.

  • Hierarchical in that features at one stage are built from features at

  • earlier stages.

  • e.g., Fukushima & Miyake (1982)’s Neocognitron:

  • Several processing layers, comprising

  • simple (S) and complex (C) cells.

  • S-cells in one layer respond to conjunc-

  • tions of C-cells in previous layer.

  • C-cells in one layer are excited by

  • small neighborhoods of S-cells.


Models of object recognition2

Models of Object Recognition

  • Transform & Match:

  • First take care of rotation, translation, scale, etc. invariances.

  • Then recognize based on standardized pixel representation of objects.

  • e.g., Olshausen et al, 1993,

  • dynamic routing model

  • Template match: e.g., with

  • an associative memory based on

  • a Hopfield network.


Recognition by components

Recognition by Components

  • Structural approach to object recognition:

  • Biederman, 1987:

  • Complex objects are composed so simpler pieces

  • We can recognize a novel/unfamiliar object by parsing it in terms of its component pieces, then comparing the assemblage of pieces to those of known objects.


Recognition by components biederman 1987

Recognition by components (Biederman, 1987)

  • GEONS: geometric elements of which all objects are composed (cylinders, cones, etc). On the order of 30 different shapes.

  • Skips 2 ½ D sketch: Geons are directly recognized from edges, based on their nonaccidental properties (i.e., 3D features that are usually preserved by the projective imaging process).


Basic properties of geons

Basic Properties of GEONs

  • They are sufficiently different from each other to be easily discriminated

  • They are view-invariant (look identical from most viewpoints)

  • They are robust to noise (can be identified even with parts of image missing)


Bottom up segmentation or top down control

Support for RBC: We can recognize partially occluded

objects easily if the occlusions do not obscure the set

of geons which constitute the object.


Potential difficulties

Potential difficulties

  • Structural description not

  • enough, also need metric info

  • Difficult to extract geons

  • from real images

  • Ambiguity in the structu-

  • ral description: most often

  • we have several candidates

  • For some objects,

  • deriving a structural repre-

  • sentation can be difficult

Edelman, 1997


Geon neurons in it

Geon Neurons in IT?

  • These are preferred

  • stimuli for some IT neurons.


Fusiform face area in humans

Fusiform Face Area in Humans


Standard view on visual processing

Standard View on Visual Processing

representation

visual processing

  • Image specific

  • Supports finediscrimination

  • Noise tolerant

  • Image invariant

  • Supports generalization

  • Noise sensitive

Tjan, 1999


Bottom up segmentation or top down control

Face

Early visual processing

Place

?

Common objects

(e.g. Kanwisher et al; Ishai et al)

primary visual processing

Multiple memory/decision sites

(Tjan, 1999)


Tjan s recognition by anarchy

Tjan’s “Recognition by Anarchy”

primary visual processing

...

Sensory

Memory

memory

memory

memory

Independent

Decisions

“R1”

“Ri”

“Rn”

Delays

t1

ti

tn

Homunculus’

Response

the first arriving response


A toy visual system

A toy visual system

Task:Identify letters from arbitrarypositions & orientations

“e”


Bottom up segmentation or top down control

normalize

position

normalize

orientation

Image

down-

sampling

memory


Bottom up segmentation or top down control

normalize

position

normalize

orientation

Image

down-

sampling

Site 2

Site 3

memory

Site 1

memory

memory


Bottom up segmentation or top down control

Study stimuli:5 orientations  20 positions at high SNR

Test stimuli:1) familiar (studied) views,2) new positions, 3) new position & orientations

1800 {30%}

1500 {25%}

800 {20%}

450 {15%}

210 {10%}

Signal-to-Noise Ratio {RMS Contrast}


Bottom up segmentation or top down control

Site 3

norm. ori.

Site 2

norm. pos.

Site 1

raw image

Processing speed for each recognition module depends

on recognition difficulty by that module.


Bottom up segmentation or top down control

Site 3

norm. ori.

Site 2

norm. pos.

Site 1

raw image

Novel positions

& orientations

Familiar views

Novel positions

Proportion Correct

Contrast (%)


Bottom up segmentation or top down control

Site 3

norm. ori.

Site 2

norm. pos.

Site 1

raw image

Novel positions

& orientations

Familiar views

Novel positions

Proportion Correct

Contrast (%)

Black curve: full model in which recognition is based

on the fastest of the responses from the three stages.


Experimental techniques in visual neuroscience

Experimental techniques in visual neuroscience

  • Recording from neurons: electrophysiology

  • Multi-unit recording using electrode arrays

  • Stimulating while recording

  • Anesthetized vs. awake animals

  • Single-neuron recording in awake humans

  • Probing the limits of vision: visual psychophysics

  • Functional neuroimaging: Techniques

  • Experimental design issues

  • Optical imaging

  • Transcranial magnetic stimulation


  • Login