Cobot: A Social Reinforcement Learning Agent

Cobot: A Social Reinforcement Learning Agent Charles Lee Isbell, Jr. Christian R. Shelton Michael Kearns Satinder Singh Peter Stone Presented by Josh Waxman

Applications of RL • Control • Game playing • Optimization Recently • Human-computer interaction • prev systems encounter humans one at a time • E.g. spoken dialog systems • Challenges • Data sparsity  • Inevitable violations of Markov property • Irreproducibility of experiments (happening in a MOO) • Variability in user’s understanding of Cobot’s working • Drift or user’s desires, Inconsistency of Reward • Choosing appropriate state space

LambdaMOO (λμ) • MUD – Multi-User Dungeon • A class of online worlds with roots in text-based multiplayer role-playing games. • Virtual world, oft created by participants • Users choose characters to represent them • Mechanisms of social interaction reinforce illusion that user is present in the virtual space • MOO –Multi-user Object Oriented - MUD that uses an object-oriented programming language to manipulate objects in the virtual world • Complex, open ended, multiuser chat environment, populated by a community of human users with rich and often enduring social relationships.

LambdaMOO (λμ) (2) • Interconnected rooms • Rooms contain users and objects that can move between them • Each room has chat channel (people in a room can talk to each other) • Room (and objects) has text description that gives it a “look and feel”

Verbs and Speech in λμ Users can talk and also have a series of verbs, allowing rich set of actions and expression of emotional states • Buster is overwhelmed by all these deadlines. • Buster begins to slowly tear his hair out, one strand at a time. • HFh comforts Buster. [standard verb comfort] • HFh [to Buster]: Remember, the mighty oak was once a nut like you. • Buster [to HFh]: Right, but his personal growth was assured. Thanks anyway, though. • Buster feels better now. verbs speech

LambdaMOO (λμ) (3) • Rooms created by users • Descriptions • Control access by other users • Can create objects • 4836 Active User Accounts • 118,154 objects • Oldest continuously operated MUD • Founded in 1990 • Good environment for AI experiments, including learning

Cobot • Cobot is RL-based agent for LambdaMOO • Long term goal: to build an agent who can learn to perform useful, interesting and entertaining actionsin LambdaMOO on the basis of user feedback.

Cobot (2) • Originally Social Statistics Agent • How freq, in what ways users interact • Provided service of these statistics • Rudimentary chatting capabilities • Reactive – did not initiate interaction • Very popular with LambdaMOO users

Cobot (3) • Modifications • Not just reactive. + Proactive • Take actions under own initiative: • Propose conversation topics • Introduce users • Word play • Hope: will eventually take unprompted actions that are meaningful, useful or amusing to users.

Reinforcement Learning • In RL, often model decision making by agents in uncertain environment as MDPs. • Markov Decision Process – if environment has the Markov property, in which only need look at current state to make a decision • At time t, agent senses environment, chooses an action a from A, set of actions available in state s. • Action causes change in environment, and agent receives a scalar reward from the environment

Reinforcement Learning (2) • Goal: Maximize expected rewards over some time horizon • A policy π is a mapping of a state s and action a to the probability of taking action a from state s. π(s, a)  p(s,a) • π*– the optimal policy • A value function is a function of states (V) or state-action pairs (Q) that tells how good it is to be in a specific state, where goodness is defined in terms of expected future return. • Qπ(s, a), the action value function for policy π is the expected return when taking action a from state s and afterwards following policy π.

Reinforcement Learning (3) • π* denotes the optimal policy whose value function Q is greater than or equal to that of any other policy for all states I and actions a in the set. • Q*– optimal action-value function • Most RL algorithms use experience to π* from the agents experience in its environment, by learning Q* • The learned value function is used to choose actions stochastically, so that in each state, actions with higher value are chosen with higher probability. • Many RL algorithms use function approximations, (parametric representations of complex value functions) both to map state-action features to their values and to map states to distributions over actions (i.e., the policy).

Used a linear function approximator: for each state feature, maintain vector of real-valued weights indexed by possible actions + weight: feature increases prob of taking that action – weight: decreases Linear Function Approximator State feature 1 State feature 2 Action 1 Action 1 Action 2 Action 2 Action 3 Action 3 Action 4 Action 4 Action 5 Action 5 Action 6 Action 6 Action 7 Action 7 Action 8 Action 8 Action 9 Action 9

Reactive Actions (Social Statistics) Tells the questioner various facts about herself, including the verbs she like to use most, and the verbs most often directed at her. • who are your playmates: Lists those who interact with Cobot the most. • who loves me: Tells the questioner those with whom she interacts the most. • who acts like me: Tells the questioner users who perform actions similar to hers. • who does not act like me: Tells the questioner users who do not perform actions similar to hers. • relate me to user: Indicates how the questioner interacts with user, providing a ranking on each other’s list of playmates, how many playmates they have in common, and how similarly they act. • tell me my verb-buddies: Tells the questioner the users she uses verb on and who use verb on her. • who is spammy: Lists those users who generate and are the targets of the most verbs. http://cobot.research.att.com/papers/cobot.pdf

Example HFh [to cobot]: relate me to Buster cobot whispers, “Here are your relationships with Buster. You like to use: - (62%), poke (7%), hug (3%), eye (3%), nod (2%), hi5, h5, zap, comfort, and grin on each other. Buster is ranked #14 on your list of playmates. You are ranked #1 on Buster’s list. Your socializing overlap is 75.4% and your playmate overlap is 33.7%. Your actions have a similarity of 95.9% but ignoring common speech verbs it’s 58.3%. Others act on you with a similarity of 96.6% but ignoring common speech verbs it’s 81.9%.”

Technicolor Guest [to cobot]: What’s your religion? cobot [to Technicolor Guest]: technology unless you REALLY believe in that religion yourself and find that it arouses a deep, strong, genuine response in many other people. … cobot [to DragonBoi]: The line of conflict should be drawn between the mass of the people and the power-holding elite of industrial society. DragonBoi eyes cobot warily. cobot [to DragonBoi]: You have something cute in your eye. DragonBoi [to HFh]: you’re bot is turning into a communist. DragonBoi [to cobot]: so what are you gonna do about it? DragonBoi [to HFh]: did you see what it told me? you are turning it into a communist! Wikked [to cobot]: are you a communist? DragonBoi [to cobot]: are you now, or ever been a member of the communist party? cobot mumbles something in binary. DragonBoi [to cobot]: so you’re taking the fifth are you? well this is a special commision, that won’t work here!

9 Proactive Actions (back to chart) • Null Action Choose to remain silent for this time period. • Topic Starters (4) Introduce a conversational topic. Cobot declares that he wants to discuss sports or politics, or he utters a sentence from either the sports section or political section of the Boston Globe. • Roll Call (2) Initiate a “roll call,” a common word play routine in LambdaMOO. For example, someone may declare that she is tired of Monica Lewinsky by announcing “TIRED OF LEWINSKY ROLL CALL.” Each user feeling the same will agree with the roll call. Cobot initiates a roll call by taking a recent utterance, and extracting either a single noun, or a verb phrase. These are treated as two separate RL actions. • Social Commentary Make a comment describing the current social state of the Living Room, such as “It sure is quiet” or “Everyone here is friendly.” These statements are based on Cobot’s statistics from recent activity. Several different utterances possible, but they are treated as a single action for RL purposes. • Introductions Introduce two users who have not yet interacted with one another in front of Cobot.

Actions (2) • These actions were chosen to fit in with what goes on in LambdaMOO. So as not to irritate. • Most common routines • Conversation • Wordplay • Emoting • Infinite range of actions since based on utterance from recent conversation (ROLL CALL) or from Boston Globe online

Reinforcement Learning • At set time intervals, Cobot chooses an action according to a distribution based on Q values in current state. • Rewards and punishments between time t and t+1 apply to action at time t. • Possible erroneous reward/punishment – if user rewarded a reactive rather than proactive action = noise in training process

Feedback Actions • Explicit • reward and punish verbs • give numeric training signal to Cobot • immed feedback to current state, action • Backed up to prev. state and actions • Implicit • standard LambdaMOO verbs • e.g. hug and spank, kiss, spit, • numerically weaker than explicit

Train for individual useror community? • Design Choice • Train for entire community • Or each individual user • Combine value functions for those present • Thus, like several RL processes in parallel, with each process with different state space • Why? • If just store which users present as another state feature, Cobot would have to learn on own this feature primacy • Learning should be fast, significant. If users don’t get feedback that they influenced Cobot’s behavior, will be discouraged • Curse of dimensionality, size of state space increases exponentially with num of state features. Don’t want to represent presence/absence of ~250 users, maintain small state space, speed up learning • Certain users interact much more often with Cobot than others. Don’t want their input to dwarf the impact of others.

State space for generic user • Social Summary Vector (4) • rate user produces events • rate events produced by others directed at user • % other users are amongst user’s “playmates” • % others users that user is their playmate • Playmate = top 10 interact with • Mood Vector – Recent use of eight groups of common words • e.g. grin and smile in a single group • Rates vector – rate at which events produced by users present, including Cobot • Current Room – which room Cobot is currently in • Roll Call Vector • Has saved Roll Call text been used by Cobot before • Has someone done a roll call since last time Cobot did a roll call • Has there been roll call since last time Cobot grabbed text • Bias vector – always on – means user is present

State space for single user too complex to model based on table representation • Linear function approximator used for each user • Mix policies of users present

Experimental Procedure • Cobot in LambdaMOO since Sept 1999 • RL Cobot in May 2000 • Cobot is real working system with real human users, conducted experiment in this context • Launched RL functionality in Living Room • Cobot logged RL-related data from May 10 – October 10, 2000 • States visited, actions taken, rewards from each user, params of value function, etc. • 63123 RL actions taken (+ reactive actions) • 3171 reward, punishment events • From 254 users

Findings • Inappropriateness of average reward • Successful RL would have increase in avg reward over time

Not because users more dissatisfied as Cobot learns • Humans fickle, preferences change over time (indeed, novelty highly valued in LambdaMOO) • popular, exciting  irritatin • Trying to hit (learn) a moving target • So perhaps average reward shouldn’t be primary measure of performance • Users with fixed preferences • Tend to give less feedback of reward/punishment as learns preferences accurately (good enough) • Didn’t mention users get bored • Typical RL, consistently gives reward, punishment • M and S, dedicated users. Explore other measures later

Users M and S

Findings • Small set of dedicated parents • 254 users • 218 gave 20 – • 15 gave 50+ • Many had passing interest, a few willing to invest signif time to teach preferences to Cobot • M: 594 S: 69

Findings • Some parents have strong opinions • Majority of users, policy learned was close to uniform distribution • Policies dependant on state, but for most users, this dependence was weak, and thus near uniform distribution • Most users did not provide enough feedback, and maybe were not consistent and strong in feedback they provided • Small group, did learn a non-uniform policy • M, S: policies relatively independent of state; other users, not as dramatic, but non-uniform • Makes sense: if does not like sports, does not matter what room, or what others users are doing • M: likes Roll call – Cobot selects with prob 0.99. S: likes social commentary, selects with prob 0.38 (S interacted less, at 69)

Findings Cobot learns matching policies Policy for user M reflects empirical pattern of rewards over time

Action 6: roll call – see earlier chart: recall M likes Roll call Blue bars: average reward given by User M for each action {note: relative, see 8}Yellow bars: Policy learned for User MRed Bars: empirical frequency at which the action was taken

Findings • Cobot responds to dedicated parents • For those users who train him, those users have strong impact. Shifts towards M’s preferences when M is present. [Of course! No one else trained him, so here is where reward/punishment will have most impact. Need only say this because so few actually trained him.] • Some preferences depend on state • Deduce which features relevant to a given user • By construction, bias feature indep of state (always on) • (All weights initialized to 0, so only nonzero features contribute. Feature relevant if far from bias feature weight vector, and all 0 vector)

Findings – some do in fact rely on state

Conclusions Reported on efforts to apply RL in a complex human online social environment (a MOO) where many of the standard assumptions (stationary rewards, Markovian behavior, appropriateness of average reward) are clearly violated. We feel that the results obtained with Cobot so far are compelling, and offer promise for the application of RL in such open-ended social settings. Cobot continues to take RL actions and receive rewards and punishments from LambdaMOO users, and we plan to continue and embellish this work as part of our overall efforts on Cobot.

Cobot: A Social Reinforcement Learning Agent