Skip this Video
Download Presentation
Vision-Language Integration in AI: a reality check

Loading in 2 Seconds...

play fullscreen
1 / 17

Vision-Language Integration in AI: a reality check - PowerPoint PPT Presentation

  • Uploaded on

Vision-Language Integration in AI: a reality check. Katerina Pastra and Yorick Wilks. Department of Computer Science, Natural Language Processing Group, University of Sheffield, U.K. Setting the context. Artificial Intelligence : From technical integration of modalities

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'Vision-Language Integration in AI: a reality check' - wentworth

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Vision-Language Integration in AI:

a reality check

Katerina Pastra and Yorick Wilks

Department of Computer Science, Natural Language Processing Group,

University of Sheffield, U.K.

Setting the context

Artificial Intelligence:

From technical integration of modalities

 multimodal meaning integration

From Multimedia  Intellimedia + Intelligent Interfaces

Purpose: intelligent, natural, coherent communication

  • We focus on:

vision and language integration

  • Visual modalities = images

(visual perception and/or visualisation representations

physically realised as e.g. 2D/3D graphics, photos…)

  • Linguistic modalities = text and/or speech

lack of an AI study of V-L integration,

lack of a reality check

The problem

Multimodal Integration: an old AI aspiration

(cf. Kirsch 1964)

 A wide variety of V-L integration prototypes in AI

  • What is computational V-L integration? (definition)
  • How is it achieved computationally?

(state of the art, practices, tendencies, needs)

  • How far can we go?

(implementation suggestions, the VLEMA prototype)

criteria for such a review???

In search of a definition

 Defining computational V-L Integration:

could a review of related applied AI research hold

the answer ?

Related work:

  • Srihari 1994: review of V-L integration prototypes

limited number of prototypes reviewed

 suggestions and implementations are mixed

 no clear focus on how integration is achieved

 system classification according to input type

 includes cases of quasi-integration

The basketball player...


(key phrase identification)

Our champion came first...

and the soccer player.

Video summary


(key frame identification from frames that correspond to the key sentence(s) extracted)

The notion of quasi-integration

 Quasi-integration:

fusion of results obtained by modality-dependent

processes (= intersection or combination of results, or even the

results of one process constrain the search space

for another)

Defining integration through classification
  • Main criterion for considering a prototype for

review: V-L integration to be essential for the task

the prototype is built for.

Specifics of the review:

 It is diachronic: from SHRDLU (Winograd ´72)

to conversational robots of the new millennium

(e.g. Shapiro and Ismail 2003, Roy et al. 2003)

 It crosses over into diverse AI areas and applications:

more than 60 prototypes reviewed from IR to Robotics

 System classification criterion:

the integration purpose served

Beyond differences

 different visual and linguistic modalities involved

 different tasks performed

 different integration purposes served, but

similarintegration resources are used

(though represented and instantiated differently)

Integration resources = Associations between :

Visual and corresponding linguistic information e.g. words/concepts and visual features or image models

Form: lists, integrated KB, scene/event models in KR

Integration mechanisms = KR instantiation, translation rules, media selection, coordination…

A descriptive definition

Descriptive Definition =

a) Intensional Definition (what the term is e.g. its genus et differentia)

b)Extensional Definition (what the term applies to)

a) Computational Vision-Language Integration is a

process of associating visual and corresponding

linguistic pieces of information

(indirect back-up from Cognitive Science: cf. notion of learned associations in

Minsky´s "Society of Mind" 1986, and Jackendoff´s theory of associating

concepts and 3D models, 1987)

b) Computational Vision-Language Integration may

take the form of one of 4 integration processes

according to the integration purpose to be served

The AI quest for V-L Integration

Argument :

In relying on human created data, state of the art V-L integration systems avoid core integration challenges and therefore fail to perform real integration

  • Simulated or manually abstracted visual input is used

 to avoid difficulties in image analysis

  • Applications are restricted to blocksworlds/miniworlds

 scaling issues

  • Manually constructed integration resources used

 toavoid difficulties in associating V-L

Difficulties in integration: correspondence problem etc.

but, difficulties lie there where developers intervene...

How far can we go?

Challenging current practices in V-L integration system development requires that an ambitious system specification is formulated

A prototype should:

  • work with real visual scenes
  • analyse its visual data automatically
  • associate images and language automatically

Is it feasible to develop such a prototype ???

An optimistic answer

VLEMA: A Vision-Language intEgration MechAnism

  • Input: automatically re-constructed static scenes in

3D (VRML format) from RESOLV (robot-surveyor)

  • Integration task: Medium Translation

from images (3D sitting rooms) to text (what and where in EN)

  • Domain: estates surveillance
  • Horizontal prototype
  • Implemented in shell programming and ProLog

“…a heater … and a sofa with 3 seats…”

System Architecture


+ KB

Object Segmentation

Object Naming

Data Transformations

The Output

Wed Jul 7 13:22:22 GMTDT 2004


Katerina [email protected] of Sheffield

Description of the automatically constructed VRML file


This is a general view of a room.

We can see the front wall, the left-side wall, the floor,

A heateronthe lower part of the front-wall and a sofawith 3 seats.

The heater is shorter in length than the sofa.

It is on the right of the sofa.


*** Could occasional reality checks re-direct (part of) AI research ? ***

  • Descriptive definition of V-L integration in AI

 a theoretical explanatory one in:

K. Pastra (2004),“Viewing Vision-Language Integration as a Double-Grounding Case”, Proceedings of the AAAI Fall Symposium Series, Washington DC.

  • Review and critique of the state of the art in AI
  • The VLEMA prototype – a baseline for future

research that will challenge current practices