Beyond attributes describing images
This presentation is the property of its rightful owner.
Sponsored Links
1 / 76

Beyond Attributes -> Describing Images PowerPoint PPT Presentation


  • 35 Views
  • Uploaded on
  • Presentation posted in: General

Beyond Attributes -> Describing Images. Tamara L. Berg UNC Chapel Hill. Descriptive Text.

Download Presentation

Beyond Attributes -> Describing Images

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Beyond attributes describing images

Beyond Attributes -> Describing Images

Tamara L. Berg

UNC Chapel Hill


Descriptive text

Descriptive Text

“It was an arresting face, pointed of chin, square of jaw. Her eyes were pale green without a touch of hazel, starred with bristly black lashes and slightly tilted at the ends. Above them, her thick black brows slanted upward, cutting a startling oblique line in her magnolia-white skin–that skin so prized by Southern women and so carefully guarded with bonnets, veils and mittens against hot Georgia suns”

Scarlett O’Hara described in Gone with the Wind.

Berg, Attributes Tutorial CVPR13


More nuance than traditional recognition

More Nuance than Traditional Recognition…

person

car

shoe

Berg, Attributes Tutorial CVPR13


Toward complex structured outputs

Toward Complex Structured Outputs

car

Berg, Attributes Tutorial CVPR13


Toward complex structured outputs1

Toward Complex Structured Outputs

pink car

Attributes of objects

Berg, Attributes Tutorial CVPR13


Toward complex structured outputs2

Toward Complex Structured Outputs

car on road

Relationships between objects

Berg, Attributes Tutorial CVPR13


Toward complex structured outputs3

Toward Complex Structured Outputs

Little pink smart car parked on the side of a road in a London shopping district.

… Complex structured recognition outputs

Telling the “story of an image”

Berg, Attributes Tutorial CVPR13


Learning from descriptive text

Learning from Descriptive Text

“It was an arresting face, pointed of chin, square of jaw. Her eyes were pale green without a touch of hazel, starred with bristly black lashes and slightly tilted at the ends. Above them, her thick black brows slanted upward, cutting a startling oblique line in her magnolia-white skin–that skin so prized by Southern women and so carefully guarded with bonnets, veils and mittens against hot Georgia suns”

Scarlett O’Hara described in Gone with the Wind.

Visually descriptive language provides:

  • Information about the world, especially the visual world.

  • information about how people construct natural language for imagery.

  • guidance for visual recognition.

How does the

world work?

How do people

describe the world?

What should we recognize?

Berg, Attributes Tutorial CVPR13


Methodology

Methodology

A random Pink Smart Car seen driving around Lambeth Roundabout and onto Lambeth Bridge.

Smart Car. It was so adorable and cute in the parking lot of the post office, I had to stop and take a picture.

Pink Car

Sign

Door

Motorcycle

Tree

Brick building

Dirty Road

Sidewalk

London

Shopping district

Natural language

description

Generation Methods:

Compose descriptions directly from recognized content

Retrieve relevant existing text given recognized content

Berg, Attributes Tutorial CVPR13


Related work

Related Work

  • Compose descriptions given recognized content

    Yao et al. (2010), Yang et al. (2011), Li et al. ( 2011), Kulkarniet al. (2011)

  • Generation as retrieval

    Farhadiet al. (2010), Ordonez et al (2011), Gupta et al (2012), Kuznetsova et al (2012)

  • Generation using pre-associated relevant text

     Leong et al (2010), Aker and Gaizauskas (2010), Feng and Lapata (2010a)

  • Other (image annotation, video description, etc)

    Barnard et al (2003), Pastraet al (2003), Gupta et al (2008), Gupta et al (2009), Feng and Lapata(2010b), del Pero et al (2011), Krishnamoorthyet al (2012), Barbu et al (2012),  Das et al (2013)

Berg, Attributes Tutorial CVPR13


Method 1 recognize generate

Method 1: Recognize & Generate

Berg, Attributes Tutorial CVPR13


Baby talk understanding and generating simple image descriptions

Baby Talk: Understanding and Generating Simple Image Descriptions

GirishKulkarni, VisruthPremraj, SagnikDhar, Siming Li, Yejin Choi, Alexander C Berg, Tamara L Berg

CVPR 2011


Beyond attributes describing images

“This picture shows one person, one grass, one chair, and one potted plant. The person is near the green grass, and in the chair. The green grass is by the chair, and near the potted plant.”

Kulkarni et al, CVPR11


Beyond attributes describing images

“This picture shows one person, one grass, one chair, and one potted plant. The person is near the green grass, and in the chair. The green grass is by the chair, and near the potted plant.”

Kulkarni et al, CVPR11


Beyond attributes describing images

“This picture shows one person, one grass, one chair, and one potted plant. The person is near the green grass, and in the chair. The green grass is by the chair, and near the potted plant.”

Kulkarni et al, CVPR11


Beyond attributes describing images

“This picture shows one person, one grass, one chair, and one potted plant. The person is near the green grass, and in the chair. The green grass is by the chair, and near the potted plant.”

Kulkarni et al, CVPR11


Beyond attributes describing images

“This picture shows one person, one grass, one chair, and one potted plant. The person is near the green grass, and in the chair. The green grass is by the chair, and near the potted plant.”

Kulkarni et al, CVPR11


Beyond attributes describing images

“This picture shows one person, one grass, one chair, and one potted plant. The person is near the green grass, and in the chair. The green grass is by the chair, and near the potted plant.”

Kulkarni et al, CVPR11


Beyond attributes describing images

“This picture shows one person, one grass, one chair, and one potted plant. The person is near the green grass, and in the chair. The green grass is by the chair, and near the potted plant.”

Kulkarni et al, CVPR11


Beyond attributes describing images

“This picture shows one person, one grass, one chair, and one potted plant. The person is near the green grass, and in the chair. The green grass is by the chair, and near the potted plant.”

Kulkarni et al, CVPR11


Beyond attributes describing images

“This picture shows one person, one grass, one chair, and one potted plant. The person is near the green grass, and in the chair. The green grass is by the chair, and near the potted plant.”

Kulkarni et al, CVPR11


Beyond attributes describing images

“This picture shows one person, one grass, one chair, and one potted plant. The person is near the green grass, and in the chair. The green grass is by the chair, and near the potted plant.”

Kulkarni et al, CVPR11


Methodology1

Methodology

  • Vision -- detection and classification

  • Text inputs - statistics from parsing lots of descriptive text

  • Graphical model (CRF) to predict best image labeling given vision and text inputs

  • Generation algorithms to generate natural language

Kulkarni et al, CVPR11


Vision is hard

Vision is hard!

World knowledge (from descriptive text) can be used to smooth noisy vision predictions!

Green sheep

Kulkarni et al, CVPR11


Methodology2

Methodology

  • Vision -- detection and classification

  • Text -- statistics from parsing lots of descriptive text

  • Graphical model (CRF) to predict best image labeling given vision and text inputs

  • Generation algorithms to generate natural language

Kulkarni et al, CVPR11


Learning from descriptive text1

Learning from Descriptive Text

Attributes

a very shiny car in the car museum in my hometown of upstate NY.

green green grass by the lake

Relationships

very little person in a big rocking chair

Our cat Tusik sleeping on the sofa near a hot radiator.

Kulkarni et al, CVPR11


Methodology3

Methodology

  • Vision -- detection and classification

  • Text -- statistics from parsing lots of descriptive text

  • Model (CRF) to predict best image labeling given vision and text based potentials

  • Generation algorithms to compose natural language

Kulkarni et al, CVPR11


System flow

System Flow

brown 0.01

striped 0.16

furry .26

wooden .2

feathered .06

...

near(a,b) 1

near(b,a) 1

against(a,b) .11

against(b,a) .04

beside(a,b) .24

beside(b,a) .17

...

a) dog

a) dog

a) dog

This is a photograph of one person and one brown sofa and one dog. The person is against the brown sofa. And the dog is near the person, and beside the brown sofa.

brown 0.32

striped 0.09

furry .04

wooden .2

Feathered .04

...

near(a,c) 1

near(c,a) 1

against(a,c) .3

against(c,a) .05

beside(a,c) .5

beside(c,a) .45

...

b) person

b) person

b) person

near(b,c) 1

near(c,b) 1

against(b,c) .67

against(c,b) .33

beside(b,c) .0

beside(c,b) .19

...

brown 0.94

striped 0.10

furry .06

wooden .8

Feathered .08

...

<<null,person_b>,against,<brown,sofa_c>>

<<null,dog_a>,near,<null,person_b>>

<<null,dog_a>,beside,<brown,sofa_c>>

Input Image

Generate natural language description

Predict labeling – vision potentials smoothed with text potentials

Kulkarni et al, CVPR11

c) sofa

c) sofa

c) sofa

Extract Objects/stuff

Predict prepositions

Predict attributes


Beyond attributes describing images

Some good results

This is a picture of one sky, one road and one sheep. The gray sky is over the gray road. The gray sheep is by the gray road.

This is a picture of two dogs. The first dog is near the second furry dog.

Here we see one road, one sky and one bicycle. The road is near the blue sky, and near the colorful bicycle. The colorful bicycle is within the blue sky.

Kulkarni et al, CVPR11


Some bad results

Missed detections:

False detections:

Incorrect attributes:

Some bad results

This is a photograph of two sheeps and one grass. The first black sheep is by the green grass, and by the second black sheep. The second black sheep is by the green grass.

There are one road and one cat. The furry road is in the furry cat.

Here we see one potted plant.

This is a picture of one tree, one road and one person. The rusty tree is under the red road. The colorful person is near the rusty tree, and under the red road.

This is a photograph of two horses and one grass. The first feathered horse is within the green grass, and by the second feathered horse. The second feathered horse is within the green grass.

This is a picture of one dog.

Kulkarni et al, CVPR11


Algorithm vs humans

Algorithm vs Humans

“This picture shows one person, one grass, one chair, and one potted plant. The person is near the green grass, and in the chair. The green grass is by the chair, and near the potted plant.”

H1: A Lemonaide stand is manned by a blonde child with a cookie.

H2: A small child at a lemonade and cookie stand on a city corner.

H3: Young child behind lemonade stand eating a cookie.

Sounds unnatural!

Kulkarni et al, CVPR11


Method 2 retrieval based generation

Method 2: Retrieval based generation

Berg, Attributes Tutorial CVPR13


Every picture tells a story describing images with meaningful sentences

Every picture tells a story,describing images withmeaningful sentences

Ali Farhadi, Mohsen Hejrati, Amin Sadeghi, Peter Young, Cyrus Rashtchian, Julia Hockenmaier, David Forsyth

ECCV 2010

Slides provided by Ali Farhadi


A simplified problem

A Simplified Problem

Represent image/text content as subject-verb-scene triple

Good triples:

  • (ship, sail, sea)

  • (boat, sail, river)

  • (ship, float, water)

    Bad triples:

  • (boat, smiling, sea) – bad relations

  • (train, moving, rail) – bad words

  • (dog, speaking, office) - both

Farhadi et al, ECCV10


The expanded model

The Expanded Model

  • Map from Image Space to Meaning Space

  • Map from Sentence Space to Meaning Space

  • Retrieve Sentences for Images via Meaning Space

Farhadi et al, ECCV10


Retrieval through meaning space

Retrieval through meaning space

  • Map from Image Space to Meaning Space

  • Map from Sentence Space to Meaning Space

  • Retrieve Sentences for Images via Meaning Space

Farhadi et al, ECCV10


Image space meaning space

Image Space  Meaning Space

Predict Image Content using trained classifiers

Farhadi et al, ECCV10


Retrieval through meaning space1

Retrieval through meaning space

  • Map from Image Space to Meaning Space

  • Map from Sentence Space to Meaning Space

  • Retrieve Sentences for Images via Meaning Space

Farhadi et al, ECCV10


Sentence space meaning space

Sentence Space  Meaning Space

  • Extract subject, verb and scene from sentences in the training data

Subject: Cat

Verb: Sitting

Scene: room

  • Use taxonomy trees

  • black cat over pink chair

  • A black color catsitting on chair in a room.

  • catsitting on a chair looking in a mirror.

Object

Animal

Human

Vehicle

Cat

Dog

Horse

Car

Bike

Train

Farhadi et al, ECCV10


Retrieval through meaning space2

Retrieval through meaning space

  • Map from Image Space to Meaning Space

  • Map from Sentence Space to Meaning Space

  • Retrieve Sentences for Images via Meaning Space

Farhadi et al, ECCV10


Beyond attributes describing images

Farhadi et al, ECCV10


Beyond attributes describing images

Farhadi et al, ECCV10


Beyond attributes describing images

Farhadi et al, ECCV10


Beyond attributes describing images

Data

1,000 images

20,000 images

More data needed?

Rashtchian et al 2010,

Farhadi et al 2010

5 descriptions per image

20 object categories

Image-Clef challenge

2 descriptions per image

Select image categories

Large amounts of paired data can help us study the image-language relationship

Berg, Attributes Tutorial CVPR13


Beyond attributes describing images

Data exists, but buried in junk!

Through the smoke

Duna Portrait #5

Mirror and gold

the cat lounging in the sink

Berg, Attributes Tutorial CVPR13


Sbu captioned photo dataset http tamaraberg com sbucaptions

SBU Captioned Photo Datasethttp://tamaraberg.com/sbucaptions

1 million captioned photos!

1 million captioned photos!

Little girl and her dog in northern Thailand. They both seemed interested in what we were doing

Interior design of modern white and brown living room furniture against white wall with a lamp hanging.

The Egyptian cat statue by the floor clock and perpetual motion machine in the pantheon

Man sits in a rusted car buried in the sand on Waitarere beach

Our dog Zoe in her bed

Emma in her hat looking super cute

Berg, Attributes Tutorial CVPR13


Im2text describing images using 1 million captioned photographs

“Im2Text: Describing Images Using 1 Million Captioned Photographs”

Vicente Ordonez, Girish Kulkarni, Tamara L. BergNIPS 2011


Big data driven generation

Big Data Driven Generation

An old bridge over dirty green water.

One of the many stone bridges in town that carry the gravel carriage roads.

A stone bridge over a peaceful river.

Generate natural sounding descriptions using existing captions

Ordonez et al, NIPS11


Beyond attributes describing images

Harness the Web!

Global Matching(GIST + Color)

SBU Captioned Photo Dataset

1 million captioned images!

The daintree river by boat.

Hangzhou bridge in West lake.

A walk around the lake near our house with Abby.

The water is clear enough to see fish swimming around in it.

Bridge to temple in HoanKiemlake.

Transfer Caption(s)

e.g. “The water is clear enough to see fish swimming around in it.”

Smallest house in paris between red (on right) and beige (on left).

Ordonez et al, NIPS11


Beyond attributes describing images

Use High Level Content to Rerank(Objects, Stuff, People, Scenes, Captions)

The bridge over the lake on Suzhou Street.

Iron bridgeover the Duck river.

Transfer Caption(s)

e.g. “The bridge over the lake on Suzhou Street.”

The Daintreeriver by boat.

Bridgeover Cacaponriver.

. . .

Ordonez et al, NIPS11


Beyond attributes describing images

Results

Bad

Good

A female Mallard duck in the lake at Luukki Espoo.

Amazing colours in the sky at sunset with the orange of the cloud and the blue of the sky behind.

The cat in the window.

The boat ended up a kilometre from the water in the middle of the airstrip.

Fresh fruit and vegetables at the market in Port Louis Mauritius.

Cat in sink.

Ordonez et al, NIPS11


Beyond attributes describing images

Next….

Composing novel captions from pieces of existing ones

Berg, Attributes Tutorial CVPR13


Composing captions guessing game

Composing captionsguessing game

a) monkey playing in the tree canopy, Monte Verde in the rain forest

b) capuchin monkey in front of my window

c) monkey spotted in Apenheul Netherlands under the tree

d) a white-faced or capuchin in the tree in the garden

e) the monkey sitting in a tree, posing for his picture

Berg, Attributes Tutorial CVPR13


Composing captions guessing game1

Composing captionsguessing game

a) monkey playing in the tree canopy, Monte Verde in the rain forest

b) capuchin monkey in front of my window

c) monkey spotted in Apenheul Netherlands under the tree

d) a white-faced or capuchin in the tree in the garden

e) the monkey sitting in a tree, posing for his picture

Berg, Attributes Tutorial CVPR13


Collective generation of natural image descriptions

“Collective Generation of Natural Image Descriptions”

PolinaKuznetsova, Vicente Ordonez,

Alexander C. Berg,TamaraL. Berg and Yejin Choi

ACL 2012


Composing descriptions

Composing Descriptions

Object appearance

NP: the dirty sheep

Object pose

VP: meandered along a desolate road

Scene appearance

PP: in the highlands of Scotland

Region appearance & relationship

PP: through frozen grass

Example Composed Description:

the dirty sheep meandered along a desolate road in the highlands of Scotland through frozen grass

Kuznetsova et al, ACL12


Sbu captioned photo dataset http tamaraberg com sbucaptions1

SBU Captioned Photo Datasethttp://tamaraberg.com/sbucaptions

1 million captioned photos!

1 million captioned photos!

Little girl and her dog in northern Thailand. They both seemed interested in what we were doing

Interior design of modern white and brown living room furniture against white wall with a lamp hanging.

The Egyptian cat statue by the floor clock and perpetual motion machine in the pantheon

Man sits in a rusted car buried in the sand on Waitarere beach

Our dog Zoe in her bed

Emma in her hat looking super cute

Ordonez et al, NIPS11


Data processing

Data Processing

1,000,000 images:

  • Run object detectors

  • Run region based stuff detectors (grass, sky, etc.)

  • Run global scene classifiers

  • Parse captions associated with images and retrieve phrases referring to objects (NPs, VPs), region relationships (PPstuff), and general scene context (PPscene).

Kuznetsova et al, ACL12


Image description generation

Image Description Generation

Computer Vision

Objects, Actions, Stuff, Scenes

Phrase Retrieval

Generation

Description

Kuznetsova et al, ACL12


Image description generation1

Image Description Generation

Computer Vision

Objects, Actions, Stuff, Scenes

Phrase Retrieval

Generation

Description

Kuznetsova et al, ACL12


Beyond attributes describing images

Retrieving VPs

Contented dog just laying on the edge of the road in front of a house..

Peruvian dog sleeping on city street in the city of Cusco, (Peru)

Detect: dog

Find matching detections by pose similarity

this dog was laying in the middle of the road on a back street in jaco

Closeup of my dog sleeping under my desk.

Kuznetsova et al, ACL12


Retrieving nps

Retrieving NPs

Tray of glace fruit in the market at Nice, France

Fresh fruit in the market

Detect: fruit

The street market in Santanyi, Mallorca is a must for the oranges and local crafts.

A box of oranges was just catching the sun, bringing out detail in the skin.

Find matching detections by appearance similarity

An orange tree in the backyard of the house.

Kuznetsova et al, ACL12

mandarin oranges in glass bowl


Beyond attributes describing images

Retrieving PPstuff

Find matching regions by appearance + arrangement similarity

Cordoba - lonely elephant under an orange tree...

I positioned the chairs around the lemon tree -- it's like a shrine

Mini Nike soccer ball all alone in the grass

Detect: stuff

Comfy chair under a tree.

Kuznetsova et al, ACL12


Retrieving ppscene

Retrieving PPscene

I'm about to blow the building across the street over with my massive lung power.

Pedestrian street in the Old Lyon with stairs to climb up the hill of fourviere

Extract scene descriptor

Find matching images by global scene similarity

Only in Paris will you find a bottle of wine on a table outside a bookstore

View from our B&B in this photo

Kuznetsova et al, ACL12


Image description generation2

Image Description Generation

Computer Vision

Objects, Actions, Stuff, Scenes

Phrase Retrieval

Generation

Description

Kuznetsova et al, ACL12


Beyond attributes describing images

birds

the bird

Object NPs

are standing

looking for food

Actions VPs

in water

over water

Stuff PPs

in the ocean

near Salt Pond

Scene PPs

Kuznetsova et al, ACL12


Possible assignments

Possible Assignments

Position1

Position2

Position3

Position4

birds

birds

birds

birds

the bird

the bird

the bird

the bird

are standing

are standing

are standing

are standing

in the ocean

in the ocean

in the ocean

in the ocean

Kuznetsova et al, ACL12


Possible assignments1

Possible Assignments

Position1

Position2

Position3

Position4

birds

birds

birds

birds

the bird

the bird

the bird

the bird

are standing

are standing

are standing

are standing

in the ocean

in the ocean

in the ocean

in the ocean

Kuznetsova et al, ACL12


Possible assignments2

Possible Assignments

Position1

Position2

Position3

Position4

birds

birds

birds

birds

the bird

the bird

the bird

the bird

are standing

are standing

are standing

are standing

in the ocean

in the ocean

in the ocean

in the ocean

Kuznetsova et al, ACL12


Phrases of t he s ame type

Phrases of the Same Type

Position1

Position2

Position3

Position4

birds

birds

birds

birds

the bird

the bird

the bird

the bird

are standing

are standing

are standing

are standing

in the ocean

in the ocean

in the ocean

in the ocean

Kuznetsova et al, ACL12


Singular plural relationships

Singular/Plural Relationships

Position1

Position2

Position3

Position4

birds

birds

birds

birds

the bird

the bird

the bird

the bird

are standing

are standing

are standing

are standing

in the ocean

in the ocean

in the ocean

in the ocean

Kuznetsova et al, ACL12


Ilp optimization

ILP Optimization

Vision scores

  • Visual detection/classification scores

    Phrase cohesion

  • n-gram statistics between phrases

  • Co-occurrence statistics between phrase head words

    Linguistic constraints

  • Allow at most one phrase of each type

  • Enforce plural/singular agreement between NP and VP

    Discourse constraints

  • Prevent inclusion of repeated phrasing

Optimize for:

Subject to:

Kuznetsova et al, ACL12


Good examples

Good Examples

This is a sporty little red convertible made for a great day in Key West FL.

This car was in the 4th parade of the apartment buildings.

The clock made in Korea.

This is a brass vikingboat moored on beach in Tobago by the ocean.

Kuznetsova et al, ACL12


Visual turing test

Visual Turing Test

Us vs Original Human Written Caption

In some cases (16%), ILP generated captions were preferred over human written ones!

Kuznetsova et al, ACL12


Bad results

Bad Results

Cognitive Absurdity

Not Relevant

Grammatically Incorrect

Computer Vision

Error

This is a shoulder bagwith a blended rainbow effect.

One of the most shirtin the wall of the house.

Here you can see a cross by the frog in the sky.

Kuznetsova et al, ACL12


Questions

Questions?


  • Login