CONFUCIUS: An Intelligent MultiMedia Storytelling Interpretation and Presentation System

CONFUCIUS:An Intelligent MultiMedia Storytelling Interpretation and Presentation System Minhua Eunice Ma Supervisor: Prof. Paul Mc Kevitt School of Computing and Intelligent Systems Faculty of Engineering University of Ulster, Magee

Outline • Related research • Overview of CONFUCIUS • Automatic generation of 3D animation • Semantic representation • Natural language processing • Current state of implementation • Relation to other work • Conclusion & Future work Faculty Research Student Conference Jordanstown, 15 Jan 2004

Related research • 3D visualisation • Virtual humans & embodied agents: Jack, Improv, BEAT • MultiModal interactive storytelling: AesopWorld, KidsRoom, Larsen & Petersen’s Interactive Storytelling, computer games • Automatic Text-to-Graphics Systems: WordsEye, CD-based language animation • Related research in NLP • Lexical semantics • Levin’s verb classes • Jackendoff’s Lexical Conceptual Structure • Schank’s scripts Faculty Research Student Conference Jordanstown, 15 Jan 2004

Story in natural language Storywriter /playwright Speech (dialogue) User /story listener Movie/drama script CONFUCIUS 3D animation non-speech audio Tailored menu for script input Objectives of CONFUCIUS • To interpret natural language sentences/stories and to extract conceptual semantics from the natural language • To generate 3D animation and virtual worlds automatically from natural language • To integrate 3D animation with speech and non-speech audio, to form an intelligent multimedia storytelling system Faculty Research Student Conference Jordanstown, 15 Jan 2004

Architecture of CONFUCIUS Natural language stories Script writer Script parser Prefabricated objects (knowledge base) LCS lexicon grammar Natural Language Processing Text To Speech Sound effects Language knowledge 3D authoring tools, existing 3D models & character models semantic representations mapping visual knowledge Animation generation visual knowledge (3D graphic library) Synchronizing & fusion 3D world with audio in VRML Faculty Research Student Conference Jordanstown, 15 Jan 2004

Software & Standards • Java • parsing semantic representation • changing VRML code to add/modify animation • integrating modules • Natural language processing tools • Connexor Machinese DFG parser (morphologic and syntax parsing) • WordNet (lexicon, semantic inference) • 3D graphic modelling • Existing 3D models (virtual human/object) on Internet • Authoring tools • Humanoid characters: Character Studio • Props & stage: 3D Studio Max • Narrator: Microsoft Agent • Modelling language & standard • VRML 97 for modelling geometry of objects, props, environment • H-Anim specifications for humanoid modelling Faculty Research Student Conference Jordanstown, 15 Jan 2004

Agents and Avatars—How much autonomy? • Autonomous agents have higher requirements for sensing, memory, reasoning, planning, behaviour control & emotion (sense-emotion-control-action structure) • “User-controlled” avatars require fewer autonomous actions-- basic naïve physics such as collision detection and reaction still required • Virtual character in non-interactive storytelling between agents and avatars--its behaviours, emotion, responses to changing environment described in story input characters in non-interactive storytelling autonomous agents Virtual humans: avatars interface agents low high Autonomy & intelligence: Faculty Research Student Conference Jordanstown, 15 Jan 2004

Graphics library objects/props characters geometry & joint hierarchy Files (H-Anim) Simple geometry files instantiation motions animation library (key frames) Faculty Research Student Conference Jordanstown, 15 Jan 2004

Level of Articulation (LOA) of H-Anim • CONFUCIUS adopts LOA1 in human animation • animation engine adds ROUTEs dynamically based on H-anim’s joints & animation keyframes • CONFUCIUS’ human animation adapted for other LOAs. pushing objects holding objects Joints and segments of LOA1 Example site nodes on hands Faculty Research Student Conference Jordanstown, 15 Jan 2004

Semantic representations Faculty Research Student Conference Jordanstown, 15 Jan 2004

Lexical Visual Semantic Representation • Lexical Visual Semantic Representation (LVSR): semantic representation between language syntax and 3D models • LVSR based on Jackendoff’s LCS adapted to task of language visualization (enhancement with Schank’s scripts) • Ontological categories: OBJ, HUMAN, EVENT, STATE, PLACE, PATH, PROPERTY • OBJ -- props/places (e.g. buildings) • HUMAN -- human being/other articulated animated characters (e.g. animals) as long as their skeleton hierarchy is defined • EVENT -- actions, movements and manners • STATE -- static existence • PROPERTY -- attributes of OBJ/HUMAN Faculty Research Student Conference Jordanstown, 15 Jan 2004

PATH & PLACE predicates • interpret spatial movement of OBJ/HUMANs • 62 common English prepositions • 7 PATH predicates & 11 PLACE predicates Faculty Research Student Conference Jordanstown, 15 Jan 2004

Pre-processing Coreference resolution Part-of-speech tagger Syntactic parser Morphological parser Temporal reasoning Semantic inference NLP in CONFUCIUS Connexor FDG parser FEATURES Disambiguation WordNet LCS database Post-lexical temporal relations Lexical temporal relations Faculty Research Student Conference Jordanstown, 15 Jan 2004

Visual valency & verb ontology 2.2.1. Human action verbs 2.2.1.1. One visual valency (the role is a human, (partial) movement) 2.2.1.1.1. Biped kinematics: arm actions (wave, scratch), leg actions (walk, jump, kick), torso actions (bow), combined actions (climb) 2.2.1.1.2. Facial expressions & lip movement, e.g. laugh, fear, say, sing, order 2.2.1.2. Two visual valency (at least one role is human) 2.2.1.2.1. One human and one object (vt. or vi.+instrument) e.g. throw, push, kick, open, eat, drink, bake, trolley 2.2.1.2.2. Two humans, e.g. fight, chase, guide 2.2.1.3. Visual valency ≥ 3 (at least one role is human) 2.2.1.3.1. Two humans and one object (inc. ditransitive verbs), e.g. give, show 2.2.1.3.2. One human and 2+ objects (vt. + object + implicit instr./goal/theme) e.g. cut, write, butter, pocket, dig, cook 2.2.1.4. Verbs without distinct visualisation when out of context: verbs of trying, helping, letting, creating/destroying 2.2.1.5. High level behaviours (routine events), political and social activities e.g. interview, eat out (go to restaurant), go shopping Faculty Research Student Conference Jordanstown, 15 Jan 2004

Level-of-Detail (LOD)basic-level verbs & troponyms EVENT … go cause event level verbs … walk climb run jump manner level verbs limp stride trot swagger troponym level verbs jog romp skip bounce hop Faculty Research Student Conference Jordanstown, 15 Jan 2004

Current status of implementation • Collision detection example (contact verbs: hit, collide, scratch, touch) • The car collided with a wall. • using ParallelGraphics’ VRML extension--object-to-object collision • non-speech sound effects • H-Anim examples: • 3 visual valency verbs • John put a cup of coffee on the table. • H-Anim Site node • locative tags of object (on_table tag for table object) • 2 visual valency verbs • John pushed the door. • John ate the bread. • Nancy sat on the chair. • 1 visual valency verbs • The waiter came to me: “Can I help you? Sir.” • speech modality & lip synchronization • camera direction (avatar’s point-of-view) Faculty Research Student Conference Jordanstown, 15 Jan 2004

Relation to other work • Domain-independent general purpose humanoid character animation • CONFUCIUS’ character animation focuses on language-to-humanoid animation process rather than considering human modelling & motion solely • Implementable semantic representation LVSR connecting linguistic semantics to visual semantics & suitable for action execution (animation) • Categorization and visualisation of eventive verbs based on visual valency • Reusable common sense knowledge base to elicit implied actions, instruments, goals, themes underspecified in language input Faculty Research Student Conference Jordanstown, 15 Jan 2004

Conclusion & Future work • Humanoid animation explores problems in language visualization & automatic animation production • Formalizes meaning of action verbs and spatial prepositions • Maps language primitives with visual primitives • Reusable common senses knowledge base for other systems • Further work • Discourse level interpretation • Action composition for simultaneous activities • Verbs concerning multiple characters’ synchronization & coordination (e.g. introduce) Prospective applications • Children’s education • Multimedia presentation • Movie/drama production • Computer games • Virtual Reality Faculty Research Student Conference Jordanstown, 15 Jan 2004

CONFUCIUS: An Intelligent MultiMedia Storytelling Interpretation and Presentation System