Vision-Language Integration in AI: a reality check. Katerina Pastra and Yorick Wilks. Department of Computer Science, Natural Language Processing Group, University of Sheffield, U.K. Setting the context. Artificial Intelligence : From technical integration of modalities
a reality check
Katerina Pastra and Yorick Wilks
Department of Computer Science, Natural Language Processing Group,
University of Sheffield, U.K.
From technical integration of modalities
multimodal meaning integration
From Multimedia Intellimedia + Intelligent Interfaces
Purpose: intelligent, natural, coherent communication
vision and language integration
(visual perception and/or visualisation representations
physically realised as e.g. 2D/3D graphics, photos…)
lack of an AI study of V-L integration,
lack of a reality check
Multimodal Integration: an old AI aspiration
(cf. Kirsch 1964)
A wide variety of V-L integration prototypes in AI
(state of the art, practices, tendencies, needs)
(implementation suggestions, the VLEMA prototype)
criteria for such a review???
In search of a definition
Defining computational V-L Integration:
could a review of related applied AI research hold
the answer ?
limited number of prototypes reviewed
suggestions and implementations are mixed
no clear focus on how integration is achieved
system classification according to input type
includes cases of quasi-integration
(key phrase identification)
Our champion came first...
and the soccer player.
(key frame identification from frames that correspond to the key sentence(s) extracted)
The notion of quasi-integration
fusion of results obtained by modality-dependent
processes (= intersection or combination of results, or even the
results of one process constrain the search space
review: V-L integration to be essential for the task
the prototype is built for.
Specifics of the review:
It is diachronic: from SHRDLU (Winograd ´72)
to conversational robots of the new millennium
(e.g. Shapiro and Ismail 2003, Roy et al. 2003)
It crosses over into diverse AI areas and applications:
more than 60 prototypes reviewed from IR to Robotics
System classification criterion:
the integration purpose served
different visual and linguistic modalities involved
different tasks performed
different integration purposes served, but
similarintegration resources are used
(though represented and instantiated differently)
Integration resources = Associations between :
Visual and corresponding linguistic information e.g. words/concepts and visual features or image models
Form: lists, integrated KB, scene/event models in KR
Integration mechanisms = KR instantiation, translation rules, media selection, coordination…
Descriptive Definition =
a) Intensional Definition (what the term is e.g. its genus et differentia)
b)Extensional Definition (what the term applies to)
a) Computational Vision-Language Integration is a
process of associating visual and corresponding
linguistic pieces of information
(indirect back-up from Cognitive Science: cf. notion of learned associations in
Minsky´s "Society of Mind" 1986, and Jackendoff´s theory of associating
concepts and 3D models, 1987)
b) Computational Vision-Language Integration may
take the form of one of 4 integration processes
according to the integration purpose to be served
In relying on human created data, state of the art V-L integration systems avoid core integration challenges and therefore fail to perform real integration
to avoid difficulties in image analysis
toavoid difficulties in associating V-L
Difficulties in integration: correspondence problem etc.
but, difficulties lie there where developers intervene...
Challenging current practices in V-L integration system development requires that an ambitious system specification is formulated
A prototype should:
Is it feasible to develop such a prototype ???
VLEMA: A Vision-Language intEgration MechAnism
3D (VRML format) from RESOLV (robot-surveyor)
from images (3D sitting rooms) to text (what and where in EN)
“…a heater … and a sofa with 3 seats…”
Wed Jul 7 13:22:22 GMTDT 2004
Katerina [email protected] of Sheffield
Description of the automatically constructed VRML file
This is a general view of a room.
We can see the front wall, the left-side wall, the floor,
A heateronthe lower part of the front-wall and a sofawith 3 seats.
The heater is shorter in length than the sofa.
It is on the right of the sofa.
*** Could occasional reality checks re-direct (part of) AI research ? ***
a theoretical explanatory one in:
K. Pastra (2004),“Viewing Vision-Language Integration as a Double-Grounding Case”, Proceedings of the AAAI Fall Symposium Series, Washington DC.
research that will challenge current practices