Vision-Language Integration in AI: a reality check

Vision-Language Integration in AI: a reality check Katerina Pastra and Yorick Wilks Department of Computer Science, Natural Language Processing Group, University of Sheffield, U.K.

Setting the context Artificial Intelligence: From technical integration of modalities  multimodal meaning integration From Multimedia  Intellimedia + Intelligent Interfaces Purpose: intelligent, natural, coherent communication • We focus on: vision and language integration • Visual modalities = images (visual perception and/or visualisation representations physically realised as e.g. 2D/3D graphics, photos…) • Linguistic modalities = text and/or speech

but lack of an AI study of V-L integration, lack of a reality check  The problem Multimodal Integration: an old AI aspiration (cf. Kirsch 1964)  A wide variety of V-L integration prototypes in AI • What is computational V-L integration? (definition) • How is it achieved computationally? (state of the art, practices, tendencies, needs) • How far can we go? (implementation suggestions, the VLEMA prototype)

criteria for such a review??? In search of a definition  Defining computational V-L Integration: could a review of related applied AI research hold the answer ? Related work: • Srihari 1994: review of V-L integration prototypes limited number of prototypes reviewed  suggestions and implementations are mixed  no clear focus on how integration is achieved  system classification according to input type  includes cases of quasi-integration

The basketball player... NLP (key phrase identification) Our champion came first... and the soccer player. Video summary IP (key frame identification from frames that correspond to the key sentence(s) extracted) The notion of quasi-integration  Quasi-integration: fusion of results obtained by modality-dependent processes (= intersection or combination of results, or even the results of one process constrain the search space for another)

Defining integration through classification • Main criterion for considering a prototype for review: V-L integration to be essential for the task the prototype is built for. Specifics of the review:  It is diachronic: from SHRDLU (Winograd ´72) to conversational robots of the new millennium (e.g. Shapiro and Ismail 2003, Roy et al. 2003)  It crosses over into diverse AI areas and applications: more than 60 prototypes reviewed from IR to Robotics  System classification criterion: the integration purpose served

Classification of V-L integration prototypes

Examples

Beyond differences  different visual and linguistic modalities involved  different tasks performed  different integration purposes served, but similarintegration resources are used (though represented and instantiated differently) Integration resources = Associations between : Visual and corresponding linguistic information e.g. words/concepts and visual features or image models Form: lists, integrated KB, scene/event models in KR Integration mechanisms = KR instantiation, translation rules, media selection, coordination…

A descriptive definition Descriptive Definition = a) Intensional Definition (what the term is e.g. its genus et differentia)  b)Extensional Definition (what the term applies to) a) Computational Vision-Language Integration is a process of associating visual and corresponding linguistic pieces of information (indirect back-up from Cognitive Science: cf. notion of learned associations in Minsky´s "Society of Mind" 1986, and Jackendoff´s theory of associating concepts and 3D models, 1987) b) Computational Vision-Language Integration may take the form of one of 4 integration processes according to the integration purpose to be served

The AI quest for V-L Integration Argument : In relying on human created data, state of the art V-L integration systems avoid core integration challenges and therefore fail to perform real integration • Simulated or manually abstracted visual input is used  to avoid difficulties in image analysis • Applications are restricted to blocksworlds/miniworlds  scaling issues • Manually constructed integration resources used  toavoid difficulties in associating V-L Difficulties in integration: correspondence problem etc. but, difficulties lie there where developers intervene...

How far can we go? Challenging current practices in V-L integration system development requires that an ambitious system specification is formulated A prototype should: • work with real visual scenes • analyse its visual data automatically • associate images and language automatically Is it feasible to develop such a prototype ???

An optimistic answer VLEMA: A Vision-Language intEgration MechAnism • Input: automatically re-constructed static scenes in 3D (VRML format) from RESOLV (robot-surveyor) • Integration task: Medium Translation from images (3D sitting rooms) to text (what and where in EN) • Domain: estates surveillance • Horizontal prototype • Implemented in shell programming and ProLog

The Input

Description “…a heater … and a sofa with 3 seats…” System Architecture OntoVis + KB Object Segmentation Object Naming Data Transformations

The Output Wed Jul 7 13:22:22 GMTDT 2004 VLEMA V1.0 Katerina Pastra@University of Sheffield Description of the automatically constructed VRML file “development-scene.wrl” This is a general view of a room. We can see the front wall, the left-side wall, the floor, A heateronthe lower part of the front-wall and a sofawith 3 seats. The heater is shorter in length than the sofa. It is on the right of the sofa.

Conclusion *** Could occasional reality checks re-direct (part of) AI research ? *** • Descriptive definition of V-L integration in AI  a theoretical explanatory one in: K. Pastra (2004),“Viewing Vision-Language Integration as a Double-Grounding Case”, Proceedings of the AAAI Fall Symposium Series, Washington DC. • Review and critique of the state of the art in AI • The VLEMA prototype – a baseline for future research that will challenge current practices

Vision-Language Integration in AI: a reality check