Slide1 l.jpg
Sponsored Links
This presentation is the property of its rightful owner.
1 / 17

Vision-Language Integration in AI: a reality check PowerPoint PPT Presentation

  • Uploaded on
  • Presentation posted in: General

Vision-Language Integration in AI: a reality check. Katerina Pastra and Yorick Wilks. Department of Computer Science, Natural Language Processing Group, University of Sheffield, U.K. Setting the context. Artificial Intelligence : From technical integration of modalities

Download Presentation

Vision-Language Integration in AI: a reality check

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

Slide1 l.jpg

Vision-Language Integration in AI:

a reality check

Katerina Pastra and Yorick Wilks

Department of Computer Science, Natural Language Processing Group,

University of Sheffield, U.K.

Slide2 l.jpg

Setting the context

Artificial Intelligence:

From technical integration of modalities

 multimodal meaning integration

From Multimedia  Intellimedia + Intelligent Interfaces

Purpose: intelligent, natural, coherent communication

  • We focus on:

    vision and language integration

  • Visual modalities = images

    (visual perception and/or visualisation representations

    physically realised as e.g. 2D/3D graphics, photos…)

  • Linguistic modalities = text and/or speech

Slide3 l.jpg


lack of an AI study of V-L integration,

lack of a reality check

The problem

Multimodal Integration: an old AI aspiration

(cf. Kirsch 1964)

 A wide variety of V-L integration prototypes in AI

  • What is computational V-L integration? (definition)

  • How is it achieved computationally?

    (state of the art, practices, tendencies, needs)

  • How far can we go?

    (implementation suggestions, the VLEMA prototype)

Slide4 l.jpg

criteria for such a review???

In search of a definition

 Defining computational V-L Integration:

could a review of related applied AI research hold

the answer ?

Related work:

  • Srihari 1994: review of V-L integration prototypes

    limited number of prototypes reviewed

     suggestions and implementations are mixed

     no clear focus on how integration is achieved

     system classification according to input type

     includes cases of quasi-integration

Slide5 l.jpg

The basketball player...


(key phrase identification)

Our champion came first...

and the soccer player.

Video summary


(key frame identification from frames that correspond to the key sentence(s) extracted)

The notion of quasi-integration

 Quasi-integration:

fusion of results obtained by modality-dependent

processes (= intersection or combination of results, or even the

results of one process constrain the search space

for another)

Slide6 l.jpg

Defining integration through classification

  • Main criterion for considering a prototype for

    review: V-L integration to be essential for the task

    the prototype is built for.

Specifics of the review:

 It is diachronic: from SHRDLU (Winograd ´72)

to conversational robots of the new millennium

(e.g. Shapiro and Ismail 2003, Roy et al. 2003)

 It crosses over into diverse AI areas and applications:

more than 60 prototypes reviewed from IR to Robotics

 System classification criterion:

the integration purpose served

Slide7 l.jpg

Classification of V-L integration prototypes

Slide8 l.jpg


Slide9 l.jpg

Beyond differences

 different visual and linguistic modalities involved

 different tasks performed

 different integration purposes served, but

similarintegration resources are used

(though represented and instantiated differently)

Integration resources = Associations between :

Visual and corresponding linguistic information e.g. words/concepts and visual features or image models

Form: lists, integrated KB, scene/event models in KR

Integration mechanisms = KR instantiation, translation rules, media selection, coordination…

Slide10 l.jpg

A descriptive definition

Descriptive Definition =

a) Intensional Definition (what the term is e.g. its genus et differentia)

b)Extensional Definition (what the term applies to)

a) Computational Vision-Language Integration is a

process of associating visual and corresponding

linguistic pieces of information

(indirect back-up from Cognitive Science: cf. notion of learned associations in

Minsky´s "Society of Mind" 1986, and Jackendoff´s theory of associating

concepts and 3D models, 1987)

b) Computational Vision-Language Integration may

take the form of one of 4 integration processes

according to the integration purpose to be served

Slide11 l.jpg

The AI quest for V-L Integration

Argument :

In relying on human created data, state of the art V-L integration systems avoid core integration challenges and therefore fail to perform real integration

  • Simulated or manually abstracted visual input is used

     to avoid difficulties in image analysis

  • Applications are restricted to blocksworlds/miniworlds

     scaling issues

  • Manually constructed integration resources used

     toavoid difficulties in associating V-L

Difficulties in integration: correspondence problem etc.

but, difficulties lie there where developers intervene...

Slide12 l.jpg

How far can we go?

Challenging current practices in V-L integration system development requires that an ambitious system specification is formulated

A prototype should:

  • work with real visual scenes

  • analyse its visual data automatically

  • associate images and language automatically

Is it feasible to develop such a prototype ???

Slide13 l.jpg

An optimistic answer

VLEMA: A Vision-Language intEgration MechAnism

  • Input: automatically re-constructed static scenes in

    3D (VRML format) from RESOLV (robot-surveyor)

  • Integration task: Medium Translation

    from images (3D sitting rooms) to text (what and where in EN)

  • Domain: estates surveillance

  • Horizontal prototype

  • Implemented in shell programming and ProLog

Slide14 l.jpg

The Input

Slide15 l.jpg


“…a heater … and a sofa with 3 seats…”

System Architecture


+ KB

Object Segmentation

Object Naming

Data Transformations

Slide16 l.jpg

The Output

Wed Jul 7 13:22:22 GMTDT 2004


Katerina Pastra@University of Sheffield

Description of the automatically constructed VRML file


This is a general view of a room.

We can see the front wall, the left-side wall, the floor,

A heateronthe lower part of the front-wall and a sofawith 3 seats.

The heater is shorter in length than the sofa.

It is on the right of the sofa.

Slide17 l.jpg


*** Could occasional reality checks re-direct (part of) AI research ? ***

  • Descriptive definition of V-L integration in AI

 a theoretical explanatory one in:

K. Pastra (2004),“Viewing Vision-Language Integration as a Double-Grounding Case”, Proceedings of the AAAI Fall Symposium Series, Washington DC.

  • Review and critique of the state of the art in AI

  • The VLEMA prototype – a baseline for future

    research that will challenge current practices

  • Login