1 / 22

# - PowerPoint PPT Presentation

Randomized Strategies and Temporal Difference Learning in Poker. Michael Oder April 4, 2002 Advisor: Dr. David Mutchler. Overview. Perfect vs. Imperfect Information Games Poker as Imperfect Information Game Randomization Neural Nets and Temporal Difference Experiments Conclusions

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

### Randomized Strategies and Temporal Difference Learning in Poker

Michael Oder

April 4, 2002

Overview Poker

• Perfect vs. Imperfect Information Games

• Poker as Imperfect Information Game

• Randomization

• Neural Nets and Temporal Difference

• Experiments

• Conclusions

• Ideas for Further Study

• World-class AI agents exist for many popular games

• Checkers

• Chess

• Othello

• These are games of perfect information

• All relevant information is available to each player

• Good understanding of imperfect information games would be a breakthrough

• Other players’ hands affect how much will be won or lost.However, each player is not aware of this vital information.

• Non-deterministic aspects as well

Enter Loki Poker

• One of the most successful computer poker players created

• Produced at University of Alberta by Jonathan Schaeffer et al

• Employs randomized strategy

• Makes player less predictable

• Allows for bluffing

Probability Triples Poker

• At any point in a poker game, player has 3 choices

• Bet/Raise

• Check/Call

• Fold

• Assign a probability to each possible move

• Single move is now a probability triple

• Problem: Associate payoff with hand, betting history, and triple (move selected)

Neural Nets Poker

• One promising way to learn such functions is with a neural network

• Neural Networks consist of connected neurons

• Each connection has a weight

• Input game state, output a prediction of payoff

• Train by modifying weights

• Weights are modified by an amount proportional to learning rate

Neural Net Example Poker

hand

P(2)

P(1)

P(-1)

P(-2)

history

triple

Temporal Difference Poker

• Most common way to train multiple layer neural net is with backpropagation

• Relies on simple input-output pairs.

• Problem: need to know correct answer right away in order to train nets

• Solution: Temporal Difference (TD) learning.

• TD(λ) algorithm developed by Richard Sutton

• Trains responses over the course of a game over many time steps

• Tries to make each prediction closer to the prediction in the next time step

P1 P2 P3 P4 P5

• TD Poker program produced by group supervised by Dr. Mutchler

• Provides environment for playing poker variants and testing agents

Simple Poker Game Poker

• Experiments were conducted on extremely simple variant of Poker

• Deck consists of 2, 3, and 4 of Hearts

• Each player gets one card

• One round of betting

• Player with highest card wins the pot

• Goal: Get the net to produce accurate payoff values as outputs

Early Results Poker

• Started by pitting a neural net player against a random one

• Results were inconsistant

• Problem: Innappropriate value for learning rate

• Too low: Outputs never approach true payoffs

• Too high: Outputs fluctuate between too high and too low

Experiment Set I Poker

• Conjecture: Learning should occur with very small learning rate over many games

• Learning Rate = 0.01

• Train for 50,000 games

• Only set to train when card is a 4

• First player always bets, second player tested

• Two Choices

• call 80%, fold 20% -> avg. payoff = 1.4

• call 20%, fold 80% -> avg. payoff = -0.4

• Want payoffs to settle in on average values

Results Poker

• 3 out of 10 trials came within 0.1 of the correct result for the highest payoff

• 2 out of 10 trials came within 0.1 of the correct result for the lowest payoff

• None of the trials came within 0.1 of the correct result for both

• The results were in the correct order in only half of the trials

More Distributions Poker

• Repeated experiment with six choices instead of two

• call 100% -> avg. payoff = 2.0

• call 80%, fold 20% -> avg. payoff = 1.4

• call 60%, fold 40% -> avg. payoff = 0.8

• call 40%, fold 60% -> avg. payoff = 0.2

• call 20% fold 80% -> avg. payoff = -0.4

• fold 100% -> avg. payoff = -1.0

• Using more distributions did help the program learn to order value of the distributions correctly

• All six distributions were ranked correctly 7 out of 10 times (0.14% chance for any one trial)

Output Encoding Poker

• Distributions are ranked correctly, but many output values are still inaccurate.

• Seems to be largely caused by the encoding of outputs

• Network has four outputs, each representing probability of a specific payoff

• This encoding is not expandable, and four outputs must all be correct for good payoff prediction.

• Replace four outputs with single number

• The number represents the payoff relative to highest payoff possibleP = 0.5 + (winnings/total possible)

• Total possible winnings determined at beginning of game (sum of other players’ holdings)

• Repeated previous experiments using this encoding

• Payoff predictions were generally more accurate using this encoding

• 5 out of 10 trials got exact payoff (0.502) for best distribution choice with six choices available

• Most trials had very close value for payoff associated with one of the distributions

• However, no trial was significantly close on multiple probability distributions

• Neural Net player can learn strategies based on probability

• Payoff is successfully learned as a function of betting action

• Consistency is still a problem

• Trouble learning correct payoffs for more than one distribution

Further Study Poker

• Issues of expandability

• Coding for multiple-round history

• Can previous learning be extended?

• Variable learning rate

• Study distribution choices

• Sample some bad distribution choices

• Test against a variety of other players