Extensible Language Interface for Robot Manipulation

Extensible Language Interfacefor Robot Manipulation Jonathan Connell Exploratory Computer Vision Group Etienne Marcheret Speech Algorithms & Engines Group Sharath Pankanti (IBM Yorktown) Michiharu Kudoh (IBM Tokyo) Risa Nishiyama (IBM Tokyo) Title slide

Much of “Intelligence” Based on Two Illusions • Animal part = mobility, perception, and reaction • People flock around robots and readily anthropomorphize them • Real-world action seems to convey a feeling of “aliveness” • Responsiveness to changes in environment conveys sense of “mind” • Key point in the embodied / situated agents viewpoint • Human part = learning by being told • Bulk of human knowledge contained in culture, largely passed verbally • No one discovers how to cook macaroni and cheese – someone explains • Lack of communication makes even people (e.g. foreigners) seem less “human” Goal is to “fuse” these two parts into a harmonious whole • Analogy to aTuring machine • Core is a simple finite state machine controller (= language interpreter) • Addition of tape vastly increases computational power (= learning from language)

Required Innate Mechanisms • Segmentation • Division of the world into spatial regions (partial segmentation okay) • Positive space regions are objects, people, and surface • Negative space regions are places and passages • Comparison • Objects have properties like color and size that are different • Objects have relations to other objects such as position • Actions • Operators can be indexed to operate on certain objects • Most have expected continuation and / or end conditions • Time • Physical motions have expected durations • Actions can be sequenced based on completion • More complex actions can be built from simpler ones • Language interpretation ties into all these pre-existing (animal) abilities • Nouns, adjectives, prepositions, verbs, adverbs, conjunctions

Example dialog: Round up my mug. I don’t know how to “round up” your mug. Walk around the house and look for it. When you find it bring it back to me. I don’t know what your “mug” looks like. It is like this <shows another mug> but sort of orange-ish. OK … I could not find your mug. Try looking on the table in the living room. OK … Here it is! command following verb learning noun learning advice taking ELI: A Fetch-and-Carry Robot Use speech, language, and vision to learn objects & actions • But not from lowest level like “what is a word” or “what visual properties signal an object” • Build in as much as is practical • Save learning for terms not knowable a priori • Names for particular items or rooms in a house • How to perform special tasks like “clean up” Potential use in eldercare scenario – a service dog with less slobber

camera arm OTC medications (Advil & Gaviscon) Capabilities Illustrated Through 4 Part Video • Arm and camera removed from robot and mounted on table • Simplifies problem by reducing the degrees of freedom

Multi-Modal Interaction (video part 1) • Features: • Automatically finds objects • Selects by position, size, color • Grabs selected object • Understands pronoun reference • Can ask clarifying questions • Handles user pointing • Robot points for emphasis

Model = size + shape + colors Matching = nearest neighbor dist = Σ w[i] * | v[i] – m[i] | Noun Learning Scenario (video part 2) • Features: • Builds visual models • Adds new nouns to grammar • Identifies objects from models • Passes object to/from user

Eli Robot at Watson Brainy Response System at Tokyo Vision Objects Archive Visual models context update ASR Parser Vocabulary Network Reasoning Semantic memory Lifelog Action models vetoes, recommendation Talk Kinematics Sequencer Retrieve Once objects have names, more properties are available • Oversee operation of physical robot to provide more intelligent action Could envision a similar extension using RoboEarth online resource

antacid Tums (present) Rolaids (requested) lifelog history 7:14 AM xxxxx 8:39 AM zzzzz 9:01 AM took Tylenol “Alice” aspirin NO DB Manipulation with Intelligent Backend (video part 3) • Features: • Vetoes actions based on DB • Picks alternates using ontology • Checks for valid dose interval • Real-time cloud connection Gavagai problem

“poke” point 1.0 out 1.0 out -1.0 Verb Learning Scenario (video part 4) • Features: • Learns action sequences • Handles relative motion commands • Responds to incremental positioning • Applies new actions to other objects

ELI Arm Demos Video Also available on YouTube: http://www.youtube.com/watch?v=M2RXDI3QYNU

Summary of Abilities • Perception • Automatically detects and counts visual objects • Understands colors, sizes, and overall positions • Action • Can successfully reach for seen objects • Can grasp and deposit objects in real world • Language • Parses and responds appropriately to speech commands • Understands pointing and uses pointing itself • Properly interprets object passing interactions • Reasoning • Knows limitation about what it can see, reach, and grab • Asks clarifying questions when there are ambiguities • Can alter actions based on known facts, histories, and ontologies • Learning • Acquires new visual object models and corresponding words • Can verbally train and name a sequence of indexical actions • Differences from some AGI work • Complete approach attacking core problem (language as tape) • Concrete, physical, and implemented system (all integrated)

Extensions • What is still missing? • Acquiring new data by observation & interaction • Filling in holes in learned representations & procedures • Fixing inaccuracies in taught knowledge • Free the robot from top-down imperatives! • Add initiative – a smart assistant will look for answers himself • Improvisation – if something does not match perfectly, try a variation • Experiential learning – better to pick up a cup by rim instead of base

Extensible Language Interface for Robot Manipulation

Extensible Language Interface for Robot Manipulation

Presentation Transcript

XSLT – Extensible Style Language for Transformation

eXtensible Markup Language

Polynomial Manipulation Language

An Extensible Choices System Interface

Extensible Stylesheet Language (XSL)

Language Support for Extensible Web Browsers

Extensible Stylesheet Language for Transformations

eXtensible Stylesheet Language

Extensible Markup Language

Data Manipulation Language

Language Support for Extensible Web Browsers

Polynomial Manipulation Language

Extensible Markup Language