A reinforcement learning scheme for a multi agent card game het leren van een pomdp
Download
1 / 32

A reinforcement learning scheme for a multi-agent card game: het leren van een POMDP - PowerPoint PPT Presentation


  • 75 Views
  • Uploaded on

A reinforcement learning scheme for a multi-agent card game: het leren van een POMDP. 1,3. 1. 2. Hajime Fujita, Yoichiro Matsuno, and Shin Ishii. 1. Nara Institute of Science and Technology 2. Ricoh Co. Ltd. 3. CREST, Japan Science and Technology Corporation

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' A reinforcement learning scheme for a multi-agent card game: het leren van een POMDP' - rhoslyn-bronwen


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
A reinforcement learning scheme for a multi agent card game het leren van een pomdp

A reinforcement learning schemefor a multi-agent card game:het leren van een POMDP

1,3

1

2

Hajime Fujita, Yoichiro Matsuno, and Shin Ishii

1. Nara Institute of Science and Technology

2. Ricoh Co. Ltd.

3. CREST, Japan Science and Technology Corporation

Met aanpassingen door L. Schomaker tbv KI2


Contents
Contents

  • Introduction

  • Preparation

    • Card game “Hearts”

    • Outline of our RL scheme

  • Proposed method

    • State transition on the observation state

    • Mean-field approximation

    • Action control

    • Action predictor

  • Computer simulation results

  • Summary

2003 IEEE International Conference on SMC


Background

Completely observable problems

Background

  • Games are well-defined test-beds for studying reinforcement learning (RL) schemes in various multi-agent environments

    • Black Jack (A.Perez-Uribe and A.Sanchez, 1998)

    • Othello (T.Yoshioka, S.Ishii and M.Ito, 1999)

    • Backgammon (G.Tesauro, 1994)

    • ook: het spel GO, afstudeerproject Reindert-Jan Ekker

2003 IEEE International Conference on SMC


Background1

Completely observable problems

Background

  • Games are well-defined test-beds for studying reinforcement learning (RL) schemes in various multi-agent environments

    • Black Jack (A.Perez-Uribe and A.Sanchez, 1998)

    • Othello (T.Yoshioka, S.Ishii and M.Ito, 1999)

    • Backgammon (G.Tesauro, 1994)

  • What about partially observable problems?

    • estimate missing information?

    • predict environmental behaviors?

2003 IEEE International Conference on SMC


Research field reinf learning

Challenging study

Research field: Reinf. Learning

  • RL scheme applicable to a multi-agent environment which is partially observable

  • The card game “Hearts” (Hartenjagen)

    • Multi-agent (four players) environment

      • Objective is well-defined

    • Partially Observable Markov Decision Process (POMDP)

      • Cards in opponents’ hands are unobservable

    • Realistic problem

      • Huge state space

      • Number of unobservable variables is large.

      • Competitive game with four agents

2003 IEEE International Conference on SMC


Card game hearts

13 penalty points

1 penalty point

Card game “Hearts”

  • Hearts is a 4-player game (multi-agent environment).

  • Each player has 13 cards at the beginning of the game (partially observable)

  • Each player plays a card clock-wise

  • Particular cards have penalty points

    • Object : to score as few points as possible.

  • Players must contrive strategies to avoid these penalty cards (competitive situation)

2003 IEEE International Conference on SMC


Outline of learning scheme
Outline of learning scheme

  • Agent (player) predicts opponents’ actions using acquired environmental model

The next player will probably not discard a spade. So my best action is …

2003 IEEE International Conference on SMC


Outline of learning scheme1
Outline of learning scheme

  • Agent (player) predicts opponents’ actions using acquired environmental model

The next player will probably not discard a spade. So my best action is …

Computable by brute force?

2003 IEEE International Conference on SMC


Outline of learning scheme2
Outline of learning scheme

  • Agent (player) predicts opponents’ actions using acquired environmental model

The next player will probably not discard a spade. So my best action is …

Computable by brute force? No!

 size of search space

 unknown utility of actions

 unknown opponent strategies

2003 IEEE International Conference on SMC


Outline of reinf learning scheme
Outline of Reinf. Learning scheme

  • Agent (player) predicts opponents’ actions using acquired environmental model

The next player will probably not discard a spade. So my best action is …

Predicted using acquired environmental model

2003 IEEE International Conference on SMC


Outline of our rl scheme
Outline of our RL scheme

  • Agent (player) predicts opponents’ actions using acquired environmental model

The next player will probably not discard a spade. So my best action is …

Predicted using acquired environmental model

.. (how?).. estimate unobservable part,

reinforcement learning, simulated

game training

2003 IEEE International Conference on SMC


Proposed method

Proposed method

  • State transition on the observation state

  • Mean-field approximation

  • Action control

  • Action predictor


State transition on the observation state
State transition on the observation state

  • State transition on the observation state in the game can be calculated by:

2003 IEEE International Conference on SMC


State transition on the observation state1
State transition on the observation state

  • State transition on the observation state in the game can be calculated by:

x observation (cards in hand+cards on table)

a action (card to be played)

s state (all observable and onobservable cards)

Ф strategies of each of the opponents

Hthistory of all xandauntil time t

K knowledge of the game

2003 IEEE International Conference on SMC


Voorbeelden
Voorbeelden

  • a: “harten-2 opgooien”

  • s:

    • [niet observeerbaar deel]

      • Oost heeft kaarten u,v,w,…,z

      • West heeft kaarten a,b,…

      • Noord heeft kaarten r,s,…

    • [observeerbaar deel= x]

      • Ik heb kaarten f,g,…

      • op tafel liggen kaarten k,l,…

  • Ht: {{s0,a0}west,{s1,a1}noord,…,{st,at}oost }

2003 IEEE International Conference on SMC


State transition on the observation state2
State transition on the observation state

  • State transition on the observation state in the game can be calculated by:

De kans op een bepaalde hand en uitgegooide kaarten op t+1 is:

het produkt van de

{som van de kans op alle mogelijke kaartverdelingen gegeven

de historie op t en spelkennis K}

met de

{som van de producten van de kansen op alle mogelijke acties

voor opponenten 1-3, gegeven elk hun strategie en de historie)

2003 IEEE International Conference on SMC


State transition on the observation state3
State transition on the observation state

  • State transition on the observation state in the game can be calculated by:

De kans op een bepaalde hand en uitgegooide kaarten op t+1 is:

het produkt van de

{som van de kans op alle mogelijke kaartverdelingen gegeven

de historie op t en spelkennis K}

met de

{som van de producten van de kansen op alle mogelijke acties

voor opponenten 1-3, gegeven elk hun strategie en de historie)

2003 IEEE International Conference on SMC


State transition on the observation state4

Summation of all states … (?)….

Need approximation

State transition on the observation state

  • State transition on the observation state can be calculated by:

  • Calculation is intractable

    • Hearts has very huge state space.

    • About states !

2003 IEEE International Conference on SMC


State transition on the observation state5

Summation of all states … (?)….

Need approximation

State transition on the observation state

  • State transition on the observation state about game of Hearts can be calculated by:

  • Calculation is intractable

    • Hearts has very huge state space.

    • About states !

aantal manieren om 52 kaarten over 4 spelers te verdelen zodat elk 13 kaarten heeft

2003 IEEE International Conference on SMC


Mean field approximation
Mean-field approximation

  • Calculate mean estimated observation state for the opponent agent.

  • Een geschatte observatietoestand voor een opponent i is een gewogen som van de kans op observaties xt, gegeven een actie, een historie (en spelkennis K)

  • de (deel)kansen worden bekend gedurende het spel

2003 IEEE International Conference on SMC


Mean field approximation1

mean observation state

Mean-field approximation

  • Calculate mean estimated observation state for the opponent agent.

  • Transition probability is approximated as

2003 IEEE International Conference on SMC


Mean field approximation2
Mean-field approximation

  • Calculate mean estimated observation state for the opponent agent.

  • Transition probability is approximated as

mean observation state

  • zodat de kansverdeling van de voorwaardelijke kans

    op een actie door opponent i kan worden bepaald,

    dwz: gegeven diens geschatte “unobservable state”

2003 IEEE International Conference on SMC


Action control td reinforcement learning
Action control: TD Reinforcement Learning

  • An action is selected based on the expected TD error

    where

  • Using the expected TD error, action selection probability is given by

2003 IEEE International Conference on SMC


Action prediction
Action prediction

  • We use a function approximator (NGnet) for the utility function which is likely to be non-linear

  • Function approximators can be trained by using past games

2003 IEEE International Conference on SMC


Summary of proposed method

・・・

・・・

・・・

Summary of proposed method

  • RL scheme based on

    • Estimation of unobservable state variables

    • Prediction of opponent agents’ actions

  • Estimation of unobservable state variables by mean-field approximation

・・・

  • Learning agent determines its action based on prediction by environmental behaviors

・・・

・・・

2003 IEEE International Conference on SMC


Computer simulations

Computer simulations

  • Rule-based agent

  • Single agent learning in a stationary environment

  • Learning by multiple agents in a multi-agent environment


Computer simulations1
Computer simulations

  • Three experiments to evaluate learning agent by using a rule-based agent

    • Single agent learning in a stationary environment

      • (A) learning agent, rule-based agent x3

    • Learning by multiple agents in a multi-agent environment

      • (B) learning agent, actor-critic agent, rule-based agent x2

      • (C) learning agent x2, rule-based agent x2

  • A rule-based agent has more than 50 rules, and it is an “experienced” level Hearts player.

2003 IEEE International Conference on SMC


A reinforcement learning scheme for a multi agent card game het leren van een pomdp

better player

Proposed RL agent

Average penalty ratio

Rule-based agent x3

2003 IEEE International Conference on SMC

Number of games


A reinforcement learning scheme for a multi agent card game het leren van een pomdp

Actor-critic agent

Average penalty ratio

Proposed RL agent

better player

Rule-based agent x2

Number of games

2003 IEEE International Conference on SMC


A reinforcement learning scheme for a multi agent card game het leren van een pomdp

better player

Proposed RL agent x2

Average penalty ratio

Rule-based agent x2

2003 IEEE International Conference on SMC

Number of games


Summary
Summary

  • We proposed a RL scheme for making an autonomous learning agent that plays the multi-player card game “Hearts”.

  • Our RL agent estimates unobservable state variables using mean-field approximation,learns and predicts environmental behaviors.

  • Computer simulations showed our method is applicable to a realistic multi-agent problem.

2003 IEEE International Conference on SMC


A reinforcement learning scheme for a multi agent card game het leren van een pomdp

NAra Institute of Science and Technology (NAIST)

Hajime FUJITA

hajime-f@is.aist-nara.ac.jp

http://hawaii.aist-nara.ac.jp/~hajime-f/