Loading in 2 Seconds...

Randomized Strategies and Temporal Difference Learning in Poker

Loading in 2 Seconds...

73 Views

Download Presentation
##### Randomized Strategies and Temporal Difference Learning in Poker

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -

**Randomized Strategies and Temporal Difference Learning in**Poker Michael Oder April 4, 2002 Advisor: Dr. David Mutchler**Overview**• Perfect vs. Imperfect Information Games • Poker as Imperfect Information Game • Randomization • Neural Nets and Temporal Difference • Experiments • Conclusions • Ideas for Further Study**Perfect vs. Imperfect Information**• World-class AI agents exist for many popular games • Checkers • Chess • Othello • These are games of perfect information • All relevant information is available to each player • Good understanding of imperfect information games would be a breakthrough**Poker as an Imperfect Information Game**• Other players’ hands affect how much will be won or lost.However, each player is not aware of this vital information. • Non-deterministic aspects as well**Enter Loki**• One of the most successful computer poker players created • Produced at University of Alberta by Jonathan Schaeffer et al • Employs randomized strategy • Makes player less predictable • Allows for bluffing**Probability Triples**• At any point in a poker game, player has 3 choices • Bet/Raise • Check/Call • Fold • Assign a probability to each possible move • Single move is now a probability triple • Problem: Associate payoff with hand, betting history, and triple (move selected)**Neural Nets**• One promising way to learn such functions is with a neural network • Neural Networks consist of connected neurons • Each connection has a weight • Input game state, output a prediction of payoff • Train by modifying weights • Weights are modified by an amount proportional to learning rate**Neural Net Example**hand P(2) P(1) P(-1) P(-2) history triple**Temporal Difference**• Most common way to train multiple layer neural net is with backpropagation • Relies on simple input-output pairs. • Problem: need to know correct answer right away in order to train nets • Solution: Temporal Difference (TD) learning. • TD(λ) algorithm developed by Richard Sutton**Temporal Difference (cont’d)**• Trains responses over the course of a game over many time steps • Tries to make each prediction closer to the prediction in the next time step P1 P2 P3 P4 P5**University of Mauritius Group**• TD Poker program produced by group supervised by Dr. Mutchler • Provides environment for playing poker variants and testing agents**Simple Poker Game**• Experiments were conducted on extremely simple variant of Poker • Deck consists of 2, 3, and 4 of Hearts • Each player gets one card • One round of betting • Player with highest card wins the pot • Goal: Get the net to produce accurate payoff values as outputs**Early Results**• Started by pitting a neural net player against a random one • Results were inconsistant • Problem: Innappropriate value for learning rate • Too low: Outputs never approach true payoffs • Too high: Outputs fluctuate between too high and too low**Experiment Set I**• Conjecture: Learning should occur with very small learning rate over many games • Learning Rate = 0.01 • Train for 50,000 games • Only set to train when card is a 4 • First player always bets, second player tested • Two Choices • call 80%, fold 20% -> avg. payoff = 1.4 • call 20%, fold 80% -> avg. payoff = -0.4 • Want payoffs to settle in on average values**Results**• 3 out of 10 trials came within 0.1 of the correct result for the highest payoff • 2 out of 10 trials came within 0.1 of the correct result for the lowest payoff • None of the trials came within 0.1 of the correct result for both • The results were in the correct order in only half of the trials**More Distributions**• Repeated experiment with six choices instead of two • call 100% -> avg. payoff = 2.0 • call 80%, fold 20% -> avg. payoff = 1.4 • call 60%, fold 40% -> avg. payoff = 0.8 • call 40%, fold 60% -> avg. payoff = 0.2 • call 20% fold 80% -> avg. payoff = -0.4 • fold 100% -> avg. payoff = -1.0 • Using more distributions did help the program learn to order value of the distributions correctly • All six distributions were ranked correctly 7 out of 10 times (0.14% chance for any one trial)**Output Encoding**• Distributions are ranked correctly, but many output values are still inaccurate. • Seems to be largely caused by the encoding of outputs • Network has four outputs, each representing probability of a specific payoff • This encoding is not expandable, and four outputs must all be correct for good payoff prediction.**Relative Payoff Encoding**• Replace four outputs with single number • The number represents the payoff relative to highest payoff possibleP = 0.5 + (winnings/total possible) • Total possible winnings determined at beginning of game (sum of other players’ holdings) • Repeated previous experiments using this encoding**Results (Experiment Set 2)**• Payoff predictions were generally more accurate using this encoding • 5 out of 10 trials got exact payoff (0.502) for best distribution choice with six choices available • Most trials had very close value for payoff associated with one of the distributions • However, no trial was significantly close on multiple probability distributions**Observations/Conclusions**• Neural Net player can learn strategies based on probability • Payoff is successfully learned as a function of betting action • Consistency is still a problem • Trouble learning correct payoffs for more than one distribution**Further Study**• Issues of expandability • Coding for multiple-round history • Can previous learning be extended? • Variable learning rate • Study distribution choices • Sample some bad distribution choices • Test against a variety of other players