Randomized strategies and temporal difference learning in poker
Download
1 / 22

- PowerPoint PPT Presentation


  • 181 Views
  • Uploaded on

Randomized Strategies and Temporal Difference Learning in Poker. Michael Oder April 4, 2002 Advisor: Dr. David Mutchler. Overview. Perfect vs. Imperfect Information Games Poker as Imperfect Information Game Randomization Neural Nets and Temporal Difference Experiments Conclusions

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about '' - adie


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Randomized strategies and temporal difference learning in poker

Randomized Strategies and Temporal Difference Learning in Poker

Michael Oder

April 4, 2002

Advisor: Dr. David Mutchler


Overview
Overview Poker

  • Perfect vs. Imperfect Information Games

  • Poker as Imperfect Information Game

  • Randomization

  • Neural Nets and Temporal Difference

  • Experiments

  • Conclusions

  • Ideas for Further Study


Perfect vs imperfect information
Perfect vs. Imperfect Information Poker

  • World-class AI agents exist for many popular games

    • Checkers

    • Chess

    • Othello

  • These are games of perfect information

  • All relevant information is available to each player

  • Good understanding of imperfect information games would be a breakthrough


Poker as an imperfect information game
Poker as an Imperfect Information Game Poker

  • Other players’ hands affect how much will be won or lost.However, each player is not aware of this vital information.

  • Non-deterministic aspects as well


Enter loki
Enter Loki Poker

  • One of the most successful computer poker players created

  • Produced at University of Alberta by Jonathan Schaeffer et al

  • Employs randomized strategy

    • Makes player less predictable

    • Allows for bluffing


Probability triples
Probability Triples Poker

  • At any point in a poker game, player has 3 choices

    • Bet/Raise

    • Check/Call

    • Fold

  • Assign a probability to each possible move

  • Single move is now a probability triple

  • Problem: Associate payoff with hand, betting history, and triple (move selected)


Neural nets
Neural Nets Poker

  • One promising way to learn such functions is with a neural network

  • Neural Networks consist of connected neurons

  • Each connection has a weight

  • Input game state, output a prediction of payoff

  • Train by modifying weights

  • Weights are modified by an amount proportional to learning rate


Neural net example
Neural Net Example Poker

hand

P(2)

P(1)

P(-1)

P(-2)

history

triple


Temporal difference
Temporal Difference Poker

  • Most common way to train multiple layer neural net is with backpropagation

  • Relies on simple input-output pairs.

  • Problem: need to know correct answer right away in order to train nets

  • Solution: Temporal Difference (TD) learning.

  • TD(λ) algorithm developed by Richard Sutton


Temporal difference cont d
Temporal Difference (cont’d) Poker

  • Trains responses over the course of a game over many time steps

  • Tries to make each prediction closer to the prediction in the next time step

P1 P2 P3 P4 P5


University of mauritius group
University of Mauritius Group Poker

  • TD Poker program produced by group supervised by Dr. Mutchler

  • Provides environment for playing poker variants and testing agents


Simple poker game
Simple Poker Game Poker

  • Experiments were conducted on extremely simple variant of Poker

  • Deck consists of 2, 3, and 4 of Hearts

  • Each player gets one card

  • One round of betting

  • Player with highest card wins the pot

  • Goal: Get the net to produce accurate payoff values as outputs


Early results
Early Results Poker

  • Started by pitting a neural net player against a random one

  • Results were inconsistant

  • Problem: Innappropriate value for learning rate

  • Too low: Outputs never approach true payoffs

  • Too high: Outputs fluctuate between too high and too low


Experiment set i
Experiment Set I Poker

  • Conjecture: Learning should occur with very small learning rate over many games

  • Learning Rate = 0.01

  • Train for 50,000 games

  • Only set to train when card is a 4

  • First player always bets, second player tested

  • Two Choices

    • call 80%, fold 20% -> avg. payoff = 1.4

    • call 20%, fold 80% -> avg. payoff = -0.4

  • Want payoffs to settle in on average values


Results
Results Poker

  • 3 out of 10 trials came within 0.1 of the correct result for the highest payoff

  • 2 out of 10 trials came within 0.1 of the correct result for the lowest payoff

  • None of the trials came within 0.1 of the correct result for both

  • The results were in the correct order in only half of the trials


More distributions
More Distributions Poker

  • Repeated experiment with six choices instead of two

    • call 100% -> avg. payoff = 2.0

    • call 80%, fold 20% -> avg. payoff = 1.4

    • call 60%, fold 40% -> avg. payoff = 0.8

    • call 40%, fold 60% -> avg. payoff = 0.2

    • call 20% fold 80% -> avg. payoff = -0.4

    • fold 100% -> avg. payoff = -1.0

  • Using more distributions did help the program learn to order value of the distributions correctly

  • All six distributions were ranked correctly 7 out of 10 times (0.14% chance for any one trial)


Output encoding
Output Encoding Poker

  • Distributions are ranked correctly, but many output values are still inaccurate.

  • Seems to be largely caused by the encoding of outputs

  • Network has four outputs, each representing probability of a specific payoff

  • This encoding is not expandable, and four outputs must all be correct for good payoff prediction.


Relative payoff encoding
Relative Payoff Encoding Poker

  • Replace four outputs with single number

  • The number represents the payoff relative to highest payoff possibleP = 0.5 + (winnings/total possible)

  • Total possible winnings determined at beginning of game (sum of other players’ holdings)

  • Repeated previous experiments using this encoding


Results experiment set 2
Results (Experiment Set 2) Poker

  • Payoff predictions were generally more accurate using this encoding

  • 5 out of 10 trials got exact payoff (0.502) for best distribution choice with six choices available

  • Most trials had very close value for payoff associated with one of the distributions

  • However, no trial was significantly close on multiple probability distributions


Observations conclusions
Observations/Conclusions Poker

  • Neural Net player can learn strategies based on probability

  • Payoff is successfully learned as a function of betting action

  • Consistency is still a problem

  • Trouble learning correct payoffs for more than one distribution


Further study
Further Study Poker

  • Issues of expandability

    • Coding for multiple-round history

    • Can previous learning be extended?

  • Variable learning rate

  • Study distribution choices

  • Sample some bad distribution choices

  • Test against a variety of other players


Questions

Questions? Poker


ad