randomized strategies and temporal difference learning in poker n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Randomized Strategies and Temporal Difference Learning in Poker PowerPoint Presentation
Download Presentation
Randomized Strategies and Temporal Difference Learning in Poker

Loading in 2 Seconds...

  share
play fullscreen
1 / 22
deacon-wong

Randomized Strategies and Temporal Difference Learning in Poker - PowerPoint PPT Presentation

73 Views
Download Presentation
Randomized Strategies and Temporal Difference Learning in Poker
An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. Randomized Strategies and Temporal Difference Learning in Poker Michael Oder April 4, 2002 Advisor: Dr. David Mutchler

  2. Overview • Perfect vs. Imperfect Information Games • Poker as Imperfect Information Game • Randomization • Neural Nets and Temporal Difference • Experiments • Conclusions • Ideas for Further Study

  3. Perfect vs. Imperfect Information • World-class AI agents exist for many popular games • Checkers • Chess • Othello • These are games of perfect information • All relevant information is available to each player • Good understanding of imperfect information games would be a breakthrough

  4. Poker as an Imperfect Information Game • Other players’ hands affect how much will be won or lost.However, each player is not aware of this vital information. • Non-deterministic aspects as well

  5. Enter Loki • One of the most successful computer poker players created • Produced at University of Alberta by Jonathan Schaeffer et al • Employs randomized strategy • Makes player less predictable • Allows for bluffing

  6. Probability Triples • At any point in a poker game, player has 3 choices • Bet/Raise • Check/Call • Fold • Assign a probability to each possible move • Single move is now a probability triple • Problem: Associate payoff with hand, betting history, and triple (move selected)

  7. Neural Nets • One promising way to learn such functions is with a neural network • Neural Networks consist of connected neurons • Each connection has a weight • Input game state, output a prediction of payoff • Train by modifying weights • Weights are modified by an amount proportional to learning rate

  8. Neural Net Example hand P(2) P(1) P(-1) P(-2) history triple

  9. Temporal Difference • Most common way to train multiple layer neural net is with backpropagation • Relies on simple input-output pairs. • Problem: need to know correct answer right away in order to train nets • Solution: Temporal Difference (TD) learning. • TD(λ) algorithm developed by Richard Sutton

  10. Temporal Difference (cont’d) • Trains responses over the course of a game over many time steps • Tries to make each prediction closer to the prediction in the next time step P1 P2 P3 P4 P5

  11. University of Mauritius Group • TD Poker program produced by group supervised by Dr. Mutchler • Provides environment for playing poker variants and testing agents

  12. Simple Poker Game • Experiments were conducted on extremely simple variant of Poker • Deck consists of 2, 3, and 4 of Hearts • Each player gets one card • One round of betting • Player with highest card wins the pot • Goal: Get the net to produce accurate payoff values as outputs

  13. Early Results • Started by pitting a neural net player against a random one • Results were inconsistant • Problem: Innappropriate value for learning rate • Too low: Outputs never approach true payoffs • Too high: Outputs fluctuate between too high and too low

  14. Experiment Set I • Conjecture: Learning should occur with very small learning rate over many games • Learning Rate = 0.01 • Train for 50,000 games • Only set to train when card is a 4 • First player always bets, second player tested • Two Choices • call 80%, fold 20% -> avg. payoff = 1.4 • call 20%, fold 80% -> avg. payoff = -0.4 • Want payoffs to settle in on average values

  15. Results • 3 out of 10 trials came within 0.1 of the correct result for the highest payoff • 2 out of 10 trials came within 0.1 of the correct result for the lowest payoff • None of the trials came within 0.1 of the correct result for both • The results were in the correct order in only half of the trials

  16. More Distributions • Repeated experiment with six choices instead of two • call 100% -> avg. payoff = 2.0 • call 80%, fold 20% -> avg. payoff = 1.4 • call 60%, fold 40% -> avg. payoff = 0.8 • call 40%, fold 60% -> avg. payoff = 0.2 • call 20% fold 80% -> avg. payoff = -0.4 • fold 100% -> avg. payoff = -1.0 • Using more distributions did help the program learn to order value of the distributions correctly • All six distributions were ranked correctly 7 out of 10 times (0.14% chance for any one trial)

  17. Output Encoding • Distributions are ranked correctly, but many output values are still inaccurate. • Seems to be largely caused by the encoding of outputs • Network has four outputs, each representing probability of a specific payoff • This encoding is not expandable, and four outputs must all be correct for good payoff prediction.

  18. Relative Payoff Encoding • Replace four outputs with single number • The number represents the payoff relative to highest payoff possibleP = 0.5 + (winnings/total possible) • Total possible winnings determined at beginning of game (sum of other players’ holdings) • Repeated previous experiments using this encoding

  19. Results (Experiment Set 2) • Payoff predictions were generally more accurate using this encoding • 5 out of 10 trials got exact payoff (0.502) for best distribution choice with six choices available • Most trials had very close value for payoff associated with one of the distributions • However, no trial was significantly close on multiple probability distributions

  20. Observations/Conclusions • Neural Net player can learn strategies based on probability • Payoff is successfully learned as a function of betting action • Consistency is still a problem • Trouble learning correct payoffs for more than one distribution

  21. Further Study • Issues of expandability • Coding for multiple-round history • Can previous learning be extended? • Variable learning rate • Study distribution choices • Sample some bad distribution choices • Test against a variety of other players

  22. Questions?