Temporal difference learning with expectimax search for the threes bot
This presentation is the property of its rightful owner.
Sponsored Links
1 / 16

Temporal Difference Learning with Expectimax Search for the Threes-bot PowerPoint PPT Presentation


  • 103 Views
  • Uploaded on
  • Presentation posted in: General

Temporal Difference Learning with Expectimax Search for the Threes-bot. National Chiao Tung University Department of Computer Science Computer Games and Intelligence (CGI) Lab Advisor : I-Chen Wu Author: Han Chiang. Reference. “Threes!”, http://asherv.com/threes/

Download Presentation

Temporal Difference Learning with Expectimax Search for the Threes-bot

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Temporal difference learning with expectimax search for the threes bot

Temporal Difference Learning with Expectimax Search for the Threes-bot

National Chiao Tung University

Department of Computer Science

Computer Games and Intelligence (CGI) Lab

Advisor: I-Chen Wu

Author: Han Chiang


Reference

Reference

“Threes!”, http://asherv.com/threes/

“Taiwan 2048 Bot “, http://2048-botcontest.twbbs.org/

CGI-2048 http://2048.aigames.nctu.edu.tw/replay.php

“Threesus!”, http://blog.waltdestler.com/2014/04/threesus.html

Albert L. Zobrist. A New Hashing Method With Application For Game Playing. Technical Report #88, April 1970/

Bruce W. Ballard, “The ,-Minimax Search Procedure for Trees Containing Chance Nodes “

MarcinSzubert, WojciehJaskowaski, Institute of Computing Science, Poznan University of Technology, Poznan, Poland, “Temporal Difference Learning of N-tuple Networks for the Game 2048”, CIG2014

J. Baxter, A. Tridgell, and L. Weaver, “Learning to Play Chess Using Temporal Differences,” Machine Learning, vol. 40, no. 3, pp. 243–263, 2000.

Temporal-Difference Learning, Section II-6, “An Introduction to Reinforcement Learning”


Outline

Outline

  • Background knowledge

    • Expectimax

    • TD-Learning

      • Formula

      • Tuple network

  • Our algorithm

    • Features

    • Apply to expectimax

    • Result


Outline1

Outline

  • Background knowledge

    • Expectimax

    • TD-Learning

      • Formula

      • Tuple network

  • Our algorithm

    • Features

    • Apply to expectimax

    • Result


Expectimax

Expectimax


Outline2

Outline

  • Background knowledge

    • Expectimax

    • TD-Learning

      • Formula

      • Tuple network

  • Our algorithm

    • Features

    • Apply to expectimax

    • Result


Td learning in game threes

TD-Learning in Game Threes

  • TD-learning can be successfully applied to game 2048.[Szubert & Jakowski 2014]

  • We designed our Threes program,

    • Different definitions to game board. (Threes! vs. 2048)

    • Use our own features.

    • Use expectimax search.


Td learning in game threes1

TD-Learning in Game Threes

  • Use TD(0) learning method:

    • : the expected cumulative reward for a board, implemented using N-tuple networks

    • : the learning rate

    • other variables are defined at the next page

  • Minimize the difference between the current prediction of cumulative future reward and one-step-ahead prediction.


Td learning

TD-Learning

Add a new random tile

Move right

s

s'

s''

Learning the expected cumulative result for the board


Tuple networks

Tuple Networks

Implement the function mentioned before.

is the function shown below:


Outline3

Outline

  • Background knowledge

    • Expectimax

    • TD-Learning

      • Formula

      • Tuple network

  • Our algorithm

    • Features

    • Apply to expectimax

    • Result


Feature

Feature

  • Feature:

    • Max tile value and position

    • Possible new tile

    • 3 different parts of board with rotate and symmetric


Td learning with expectimax

TD-Learning with Expectimax

At the leaf nodes of the expectimax search tree, we return the heuristic of the board.

We replace the heuristic with the value, V(s), we retrieve in TD learning


Result in our environment

Result (in our environment)

Highest Score: 255531

Average Score: 107833

Max Tile: 3072

192 Rate: 100%

384 Rate: 100%

768 Rate: 97%

1536 Rate: 86%

3072 Rate: 29%

6144 Rate: 0%

Move Count: 81097

Time: 199.35


Result in contest server

Result (in contest server)

Max Score : 246297​

Avg. Score : 110931​

192 rate :100%​

384 rate :100%​

768 rate :99%​

1536 rate : 86%​

3072 rate : 31% 


Thank you

Thank you


  • Login