1 / 32

A reinforcement learning scheme for a multi-agent card game: het leren van een POMDP

A reinforcement learning scheme for a multi-agent card game: het leren van een POMDP. 1,3. 1. 2. Hajime Fujita, Yoichiro Matsuno, and Shin Ishii. 1. Nara Institute of Science and Technology 2. Ricoh Co. Ltd. 3. CREST, Japan Science and Technology Corporation

Download Presentation

A reinforcement learning scheme for a multi-agent card game: het leren van een POMDP

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A reinforcement learning schemefor a multi-agent card game:het leren van een POMDP 1,3 1 2 Hajime Fujita, Yoichiro Matsuno, and Shin Ishii 1. Nara Institute of Science and Technology 2. Ricoh Co. Ltd. 3. CREST, Japan Science and Technology Corporation Met aanpassingen door L. Schomaker tbv KI2

  2. Contents • Introduction • Preparation • Card game “Hearts” • Outline of our RL scheme • Proposed method • State transition on the observation state • Mean-field approximation • Action control • Action predictor • Computer simulation results • Summary 2003 IEEE International Conference on SMC

  3. Completely observable problems Background • Games are well-defined test-beds for studying reinforcement learning (RL) schemes in various multi-agent environments • Black Jack (A.Perez-Uribe and A.Sanchez, 1998) • Othello (T.Yoshioka, S.Ishii and M.Ito, 1999) • Backgammon (G.Tesauro, 1994) • ook: het spel GO, afstudeerproject Reindert-Jan Ekker 2003 IEEE International Conference on SMC

  4. Completely observable problems Background • Games are well-defined test-beds for studying reinforcement learning (RL) schemes in various multi-agent environments • Black Jack (A.Perez-Uribe and A.Sanchez, 1998) • Othello (T.Yoshioka, S.Ishii and M.Ito, 1999) • Backgammon (G.Tesauro, 1994) • What about partially observable problems? • estimate missing information? • predict environmental behaviors? 2003 IEEE International Conference on SMC

  5. Challenging study Research field: Reinf. Learning • RL scheme applicable to a multi-agent environment which is partially observable • The card game “Hearts” (Hartenjagen) • Multi-agent (four players) environment • Objective is well-defined • Partially Observable Markov Decision Process (POMDP) • Cards in opponents’ hands are unobservable • Realistic problem • Huge state space • Number of unobservable variables is large. • Competitive game with four agents 2003 IEEE International Conference on SMC

  6. 13 penalty points 1 penalty point Card game “Hearts” • Hearts is a 4-player game (multi-agent environment). • Each player has 13 cards at the beginning of the game (partially observable) • Each player plays a card clock-wise • Particular cards have penalty points • Object : to score as few points as possible. • Players must contrive strategies to avoid these penalty cards (competitive situation) 2003 IEEE International Conference on SMC

  7. Outline of learning scheme • Agent (player) predicts opponents’ actions using acquired environmental model The next player will probably not discard a spade. So my best action is … 2003 IEEE International Conference on SMC

  8. Outline of learning scheme • Agent (player) predicts opponents’ actions using acquired environmental model The next player will probably not discard a spade. So my best action is … Computable by brute force? 2003 IEEE International Conference on SMC

  9. Outline of learning scheme • Agent (player) predicts opponents’ actions using acquired environmental model The next player will probably not discard a spade. So my best action is … Computable by brute force? No!  size of search space  unknown utility of actions  unknown opponent strategies 2003 IEEE International Conference on SMC

  10. Outline of Reinf. Learning scheme • Agent (player) predicts opponents’ actions using acquired environmental model The next player will probably not discard a spade. So my best action is … Predicted using acquired environmental model 2003 IEEE International Conference on SMC

  11. Outline of our RL scheme • Agent (player) predicts opponents’ actions using acquired environmental model The next player will probably not discard a spade. So my best action is … Predicted using acquired environmental model .. (how?).. estimate unobservable part, reinforcement learning, simulated game training 2003 IEEE International Conference on SMC

  12. Proposed method • State transition on the observation state • Mean-field approximation • Action control • Action predictor

  13. State transition on the observation state • State transition on the observation state in the game can be calculated by: 2003 IEEE International Conference on SMC

  14. State transition on the observation state • State transition on the observation state in the game can be calculated by: x observation (cards in hand+cards on table) a action (card to be played) s state (all observable and onobservable cards) Ф strategies of each of the opponents Hthistory of all xandauntil time t K knowledge of the game 2003 IEEE International Conference on SMC

  15. Voorbeelden • a: “harten-2 opgooien” • s: • [niet observeerbaar deel] • Oost heeft kaarten u,v,w,…,z • West heeft kaarten a,b,… • Noord heeft kaarten r,s,… • [observeerbaar deel= x] • Ik heb kaarten f,g,… • op tafel liggen kaarten k,l,… • Ht: {{s0,a0}west,{s1,a1}noord,…,{st,at}oost } 2003 IEEE International Conference on SMC

  16. State transition on the observation state • State transition on the observation state in the game can be calculated by: De kans op een bepaalde hand en uitgegooide kaarten op t+1 is: het produkt van de {som van de kans op alle mogelijke kaartverdelingen gegeven de historie op t en spelkennis K} met de {som van de producten van de kansen op alle mogelijke acties voor opponenten 1-3, gegeven elk hun strategie en de historie) 2003 IEEE International Conference on SMC

  17. State transition on the observation state • State transition on the observation state in the game can be calculated by: De kans op een bepaalde hand en uitgegooide kaarten op t+1 is: het produkt van de {som van de kans op alle mogelijke kaartverdelingen gegeven de historie op t en spelkennis K} met de {som van de producten van de kansen op alle mogelijke acties voor opponenten 1-3, gegeven elk hun strategie en de historie) 2003 IEEE International Conference on SMC

  18. Summation of all states … (?)…. Need approximation State transition on the observation state • State transition on the observation state can be calculated by: • Calculation is intractable • Hearts has very huge state space. • About states ! 2003 IEEE International Conference on SMC

  19. Summation of all states … (?)…. Need approximation State transition on the observation state • State transition on the observation state about game of Hearts can be calculated by: • Calculation is intractable • Hearts has very huge state space. • About states ! aantal manieren om 52 kaarten over 4 spelers te verdelen zodat elk 13 kaarten heeft 2003 IEEE International Conference on SMC

  20. Mean-field approximation • Calculate mean estimated observation state for the opponent agent. • Een geschatte observatietoestand voor een opponent i is een gewogen som van de kans op observaties xt, gegeven een actie, een historie (en spelkennis K) • de (deel)kansen worden bekend gedurende het spel 2003 IEEE International Conference on SMC

  21. mean observation state Mean-field approximation • Calculate mean estimated observation state for the opponent agent. • Transition probability is approximated as 2003 IEEE International Conference on SMC

  22. Mean-field approximation • Calculate mean estimated observation state for the opponent agent. • Transition probability is approximated as mean observation state • zodat de kansverdeling van de voorwaardelijke kans op een actie door opponent i kan worden bepaald, dwz: gegeven diens geschatte “unobservable state” 2003 IEEE International Conference on SMC

  23. Action control: TD Reinforcement Learning • An action is selected based on the expected TD error where • Using the expected TD error, action selection probability is given by 2003 IEEE International Conference on SMC

  24. Action prediction • We use a function approximator (NGnet) for the utility function which is likely to be non-linear • Function approximators can be trained by using past games 2003 IEEE International Conference on SMC

  25. ・・・ ・・・ ・・・ Summary of proposed method • RL scheme based on • Estimation of unobservable state variables • Prediction of opponent agents’ actions • Estimation of unobservable state variables by mean-field approximation ・・・ • Learning agent determines its action based on prediction by environmental behaviors ・・・ ・・・ 2003 IEEE International Conference on SMC

  26. Computer simulations • Rule-based agent • Single agent learning in a stationary environment • Learning by multiple agents in a multi-agent environment

  27. Computer simulations • Three experiments to evaluate learning agent by using a rule-based agent • Single agent learning in a stationary environment • (A) learning agent, rule-based agent x3 • Learning by multiple agents in a multi-agent environment • (B) learning agent, actor-critic agent, rule-based agent x2 • (C) learning agent x2, rule-based agent x2 • A rule-based agent has more than 50 rules, and it is an “experienced” level Hearts player. 2003 IEEE International Conference on SMC

  28. better player Proposed RL agent Average penalty ratio Rule-based agent x3 2003 IEEE International Conference on SMC Number of games

  29. Actor-critic agent Average penalty ratio Proposed RL agent better player Rule-based agent x2 Number of games 2003 IEEE International Conference on SMC

  30. better player Proposed RL agent x2 Average penalty ratio Rule-based agent x2 2003 IEEE International Conference on SMC Number of games

  31. Summary • We proposed a RL scheme for making an autonomous learning agent that plays the multi-player card game “Hearts”. • Our RL agent estimates unobservable state variables using mean-field approximation,learns and predicts environmental behaviors. • Computer simulations showed our method is applicable to a realistic multi-agent problem. 2003 IEEE International Conference on SMC

  32. NAra Institute of Science and Technology (NAIST) Hajime FUJITA hajime-f@is.aist-nara.ac.jp http://hawaii.aist-nara.ac.jp/~hajime-f/

More Related