Vision-Language Integration in AI: a reality check. Katerina Pastra and Yorick Wilks. Department of Computer Science, Natural Language Processing Group, University of Sheffield, U.K. Setting the context. Artificial Intelligence : From technical integration of modalities
Vision-Language Integration in AI:
a reality check
Katerina Pastra and Yorick Wilks
Department of Computer Science, Natural Language Processing Group,
University of Sheffield, U.K.
Setting the context
From technical integration of modalities
multimodal meaning integration
From Multimedia Intellimedia + Intelligent Interfaces
Purpose: intelligent, natural, coherent communication
vision and language integration
(visual perception and/or visualisation representations
physically realised as e.g. 2D/3D graphics, photos…)
lack of an AI study of V-L integration,
lack of a reality check
Multimodal Integration: an old AI aspiration
(cf. Kirsch 1964)
A wide variety of V-L integration prototypes in AI
(state of the art, practices, tendencies, needs)
(implementation suggestions, the VLEMA prototype)
criteria for such a review???
In search of a definition
Defining computational V-L Integration:
could a review of related applied AI research hold
the answer ?
limited number of prototypes reviewed
suggestions and implementations are mixed
no clear focus on how integration is achieved
system classification according to input type
includes cases of quasi-integration
The basketball player...
(key phrase identification)
Our champion came first...
and the soccer player.
(key frame identification from frames that correspond to the key sentence(s) extracted)
The notion of quasi-integration
fusion of results obtained by modality-dependent
processes (= intersection or combination of results, or even the
results of one process constrain the search space
Defining integration through classification
review: V-L integration to be essential for the task
the prototype is built for.
Specifics of the review:
It is diachronic: from SHRDLU (Winograd ´72)
to conversational robots of the new millennium
(e.g. Shapiro and Ismail 2003, Roy et al. 2003)
It crosses over into diverse AI areas and applications:
more than 60 prototypes reviewed from IR to Robotics
System classification criterion:
the integration purpose served
Classification of V-L integration prototypes
different visual and linguistic modalities involved
different tasks performed
different integration purposes served, but
similarintegration resources are used
(though represented and instantiated differently)
Integration resources = Associations between :
Visual and corresponding linguistic information e.g. words/concepts and visual features or image models
Form: lists, integrated KB, scene/event models in KR
Integration mechanisms = KR instantiation, translation rules, media selection, coordination…
A descriptive definition
Descriptive Definition =
a) Intensional Definition (what the term is e.g. its genus et differentia)
b)Extensional Definition (what the term applies to)
a) Computational Vision-Language Integration is a
process of associating visual and corresponding
linguistic pieces of information
(indirect back-up from Cognitive Science: cf. notion of learned associations in
Minsky´s "Society of Mind" 1986, and Jackendoff´s theory of associating
concepts and 3D models, 1987)
b) Computational Vision-Language Integration may
take the form of one of 4 integration processes
according to the integration purpose to be served
The AI quest for V-L Integration
In relying on human created data, state of the art V-L integration systems avoid core integration challenges and therefore fail to perform real integration
to avoid difficulties in image analysis
toavoid difficulties in associating V-L
Difficulties in integration: correspondence problem etc.
but, difficulties lie there where developers intervene...
How far can we go?
Challenging current practices in V-L integration system development requires that an ambitious system specification is formulated
A prototype should:
Is it feasible to develop such a prototype ???
An optimistic answer
VLEMA: A Vision-Language intEgration MechAnism
3D (VRML format) from RESOLV (robot-surveyor)
from images (3D sitting rooms) to text (what and where in EN)
“…a heater … and a sofa with 3 seats…”
Wed Jul 7 13:22:22 GMTDT 2004
Katerina [email protected] of Sheffield
Description of the automatically constructed VRML file
This is a general view of a room.
We can see the front wall, the left-side wall, the floor,
A heateronthe lower part of the front-wall and a sofawith 3 seats.
The heater is shorter in length than the sofa.
It is on the right of the sofa.
*** Could occasional reality checks re-direct (part of) AI research ? ***
a theoretical explanatory one in:
K. Pastra (2004),“Viewing Vision-Language Integration as a Double-Grounding Case”, Proceedings of the AAAI Fall Symposium Series, Washington DC.
research that will challenge current practices