Multi agent social learning in large repeated games

Multiagent social learning in large repeated games Jean Oh

Motivation Approach Theoretical Empirical Conclusion far Selfish solutions can be suboptimal. If short-sighted,

Individual objective: to minimize cost agent2 agent1 agentn Multiagent resource selection problem Strategy of agent i Cost ci(si, s-i) ? Strategy profile A={ resource1, resource2… resourcem } statet strategy … N={ }

Congestion cost depends on: the number of agents that have chosen the same resource. Congestion game! Selfish solution: the cost of every path becomes more or less indifferent; thus no one wants to deviate from current path. social welfare: average cost of all agents • Individual objective: to minimize congestion cost • “Selfish solutions” can be arbitrarily suboptimal[Roughgarden 2007]. • Important subject in transportation science, computer networks, and algorithmic game theory.

Example: Inefficiency of selfish solution Selfish solution Average cost = 1 1 1 Central administrator n Optimal average cost [n/2  1 + n/2½]/n = ¾ Metro vs. Driving [Pigou 1920, Roughgarden 2007] Depends on # of drivers: 1 metro Stationary algorithms (e.g. no regret, fictious play) driving Nonlinear cost function? nagents Constant cost: 1 Objective: minimize average cost

If a few agents take alternative route, everyone else is better off. Just a few altruistic agents to sacrifice, any volunteers? Excellent! as long as it’s not me.

Related work Braess’ paradox Coping with the inefficiency of selfish solution • Increase resource capacity [Korilis 1999] • Redesign network structure [Roughgarden 2001a] • Algorithmic mechanism design[Ronen 2000,Calliese&Gordon 2008] • Centralization [Shenker 1995, Chakrabarty 2005, Blumrosen 2006] • Periodic policy under “homo-egualis” principle [Nowé et al. 2003] • Taking the worst-performing agent into consideration (to avoid inequality) • Collective Intelligence (COIN) [Wolpert & Tumer 1999] • WLU: Wonderful Life Utility! • Altruistic Stackelberg strategy [Roughgarden 2001b] • (Market) leaders make first moves, hoping to induce desired actions from the followers • LLF (centralized + selfish) agents • “Explicit coordination is necessary to achieve system optimal solution in congestion games” [Milchtaich 2004] Can self-interested agents support mutually beneficial solution without external intervention?

Related work: non-stationary strategy Complete monitoring NP-hard (Meyers 2006) NP-complete (Borgs et al. 2008) Congestion cost Coordination overhead As long as you stay If you deviate Explicit threat: grim-trigger Whatever you do from then on Can “self-interested” agents learn to support mutually beneficial solution efficiently? We’ll be mutually beneficial I’ll punish you with minimax value forever Minimax value: as good as [i] can get when the rest of the world turns against [i]. • Computational intractability • “Significant coordination overhead” • Existing algorithms limited to 2-player games (Stimpson 2001, • Littman & Stone 2003, Sen et al. 2003, Crandall 2005)

Motivation Approach Theoretical Empirical Conclusion IMPRESImplicit Reciprocal Strategy Learning

IMPRES may be Assumptions The other agents are _______________. • opponents • sources of uncertainty • sources of knowledge The agents are _________ in their ability. • symmetric • asymmetric “sources of knowledge” “asymmetric”

IMPRES stop Go Intuition: social learning “Learn to act more rationally by giving strategy to others” “Learn to act more rationally by using strategy given by others”

IMPRES Inner- layer Meta- layer -strategist -subscriber -solitary Agent k Overview: 2-layered decision making congestion cost Environment path Agent i 1. whose strategy? Agent j 2. which path? Agent k 3. Learn strategies using cost “Take route 2” Agent j Agent i

IMPRES how to select path from P = {p1,…} path Environment Agent i Current meta-action a cost strategy  Strategist lookup table L A = {-strategist, -solitary } Q 0 0 s 0.5 0.5 strategy … how to select action from A More probability mass to low cost actions Meta-learning -subscriber 0 • LOOP: • p  selectPath(a); take path p; receive congestion cost c • Update Q value of action a using cost c: Q(a)  (1-)Q(a) +  (MaxCost - c) • new action  randomPick(strategist lookup table L); A  A  {} • Update meta-strategy s • a  select action according to meta-strategy s; if a = -strategist, L  L  {i}

IMPRES f = 0 f = 2 Inner-learning e1 e2 e3 e4 • Symmetric network congestion games • f: number of subscribers (to this strategy) when f = 0, no inner-learning • : joint strategy for f agents •   path p; take path p; observe # of agents on edges of p • Predict traffic on each edge generated by others • Select best joint strategy  for f agents (exploration with small probability) • Shuffle joint strategy 

IMPRES exploit explore Cost(C)  Cost(I) Cost(C) ≥ Cost(I) -subscriber strategy -solitary strategy Cost(C)  Cost(I) Cost(I)  Cost(C) An IMPRES strategy Non-stationary strategy [Correlated strategy] mutually beneficial strategies for -strategist and its -subscribers [Independent strategy]-solitary, implicit reciprocity = break from correlation Correlated strategy Independent strategy

Non-stationary strategies exploit explore Grim-trigger vs. IMPRES Perfect monitoring Imperfect monitoring Intractable Tractable Coordination overhead Efficient coordination Deterministic Stochastic The other player obeys Observe a deviator Whatever Correlated strategy Minimax strategy A grim-trigger strategy Cost(C)  Cost(I) Cost(C) ≥ Cost(I) Correlated strategy Independent strategy Cost(C)  Cost(I) Cost(I)  Cost(C) An IMPRES strategy

Motivation Approach Theoretical Empirical Conclusion Minimax strategy independent strategy Explicit threat Implicit threat “IMPRES” “without” Main result General belief correlated strategy Rational agents can support mutually beneficial outcome with explicit threat.

Motivation Approach Theoretical Empirical Conclusion Selfish solutions Congestion cost: arbitrarily suboptimal Coordination overhead: none Congestion cost Centralized solutions (1-to-n) Congestion cost: optimal Coordination overhead: high Coordination overhead Empirical evaluation Quantifying “mutually beneficial” and “efficient” IMPRES

Cost (solutionp) Cost (optimump) Evaluation criteria • Individual rationality: minimax-safety • Average congestion cost of all agents (social welfare); for problem p • Coordination overhead (size of subgroups) relative to a 1-to-n centrally administrated system. • Agent demographic (based on meta-strategy), e.g. percentage of solitaries, strategists, and subscribers. overhead (solutionp) overhead (maxp)

Experimental setup • Number of agents n = 100; (n = 2 ~ 1000) • All agents use IMPRES (self-play) • Number of iterations = 20,000 ~ 50,000 • Averaged over 10-30 trials • Learning parameters:

The lower, the better Free riders: always driving Metro vs. Driving (n=100) Agent demographic metro driving metro driving

Polynomial cost functions, average number of paths=5 C(s) C(optimum) C(selfish solution) C(optimum) C(s): congestion cost of solution s Selfish solution: the cost of every path becomes more or less indifferent; thus no one wants to deviate from current path. (data is based on average cost after 20,000 iterations) y=x For this problem: 3 1 1.2 Selfish solution Optimum IMPRES Selfish baseline [Fabrikant 2004] Optimal baseline [Meyers 2006]

Polynomial cost functions, average number of paths=5 C(s) C(optimum) O(solution) O(1-to-n solution) o(s): coordination overhead of solution s Average communication bandwidth Congestion cost worse better 1-to-n solution Optimum Coordination overhead

40 problems with mixed convex cost functions, average number of paths=5 C(s) C(optimum) C(selfish solution) C(optimum) On dynamic population (data is based on average cost after 50,000 iterations) 1 agent in every ith round, randomly selected, replaced with new one Selfish baseline Optimal baseline

Motivation Approach Theoretical Empirical Conclusion Summary of experiments • Symmetric network congestion games • Well-known examples • Linear, polynomial, exponential, & discrete cost functions • Scalability • number of alternative paths (|S| = 2 ~ 15) • Population size (n = 2 ~ 1000) • Robustness under dynamic population assumption • 2-player matrix games • Inefficiency of solution based on 121 problems: • Selfish solutions: 120% higher than optimum • IMPRES solutions: 30% higher than optimum 25% coord. overhead of 1-to-n model

Motivation Approach Theoretical Empirical Conclusion Contributions • Discovery of social norm (strategies) that can support mutually beneficial solutions • Investigated “social learning” in multiagent context • Proposed IMPRES: 2-layered learning algorithm • significant extension to classical reinforcement learning models • the first algorithm that learns non-stationary strategies for more than 2 players under imperfect monitoring • Demonstrated IMPRES agents self-organize: • Every agent is individually rational (minimax-safety) • Social welfare is improved by approx. 4 times from selfish solutions • Efficient coordination (overhead within 25% of 1-to-n model)

Motivation Approach Theoretical Empirical Conclusion Future work • Short-term goals: more asymmetry • Strategists – give more incentive • Individual threshold (sightseers vs. commuters) • Tradeoffs between multiple criteria (weight) • Free rider problem • Long-term goals: • Establish the notion of social learning in artificial agent learning context • Learning by copying actions of others • Learning by observing consequences of other agents

Conclusion Rationally bounded agents adopting social learning can support mutually beneficial outcomes without the explicit notion of threat.

Thank you.

Selfish solutions • Nash equilibrium • Wardrop's first principle (a.k.a. user equilibrium) : travel times of all routes that are in use are equal; and less than the travel time of single user on any of those routes that are not in current use. • (Wardrop’s second principle: system optimal solution based on average cost of all agents)

Multi agent social learning in large repeated games