rl successes and challenges in high dimensional games l.
Skip this Video
Loading SlideShow in 5 Seconds..
RL Successes and Challenges in High-Dimensional Games PowerPoint Presentation
Download Presentation
RL Successes and Challenges in High-Dimensional Games

Loading in 2 Seconds...

play fullscreen
1 / 26

RL Successes and Challenges in High-Dimensional Games - PowerPoint PPT Presentation

  • Uploaded on

RL Successes and Challenges in High-Dimensional Games. Gerry Tesauro IBM T.J.Watson Research Center. Outline. Overview/Definition of “Games” Why Study Games? Commonalities of RL successes RL in Classic Board Games TD-Gammon, KnightCap, TD-Chinook, RLGO RL in Robotics Games

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'RL Successes and Challenges in High-Dimensional Games' - Sophia

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
rl successes and challenges in high dimensional games

RL Successes and Challenges in High-Dimensional Games

Gerry Tesauro

IBM T.J.Watson Research Center

  • Overview/Definition of “Games”
    • Why Study Games?
    • Commonalities of RL successes
  • RL in Classic Board Games
    • TD-Gammon, KnightCap, TD-Chinook, RLGO
  • RL in Robotics Games
    • Attacker/Defender Robots
    • Robocup Soccer
  • RL in Video/Online Games
    • AI Fighters
  • Open Discussion / Lessons Learned
what do we mean by games
What Do We Mean by “Games” ??
  • Some Definitions of “Game”
    • A structured activity, usually undertaken for enjoyment (Wikipedia)
    • Activity among decision-makers in seeking to achieve objectives in a limiting context (Clark Abt)
    • A form of play with goals and structure (Kevin Maroney)
  • Single-Player Game = “Puzzle”
  • “Competition” if players can’t interfere with other players’ performance
    • Olympic Hockey vs. Olympic Figure Skating
  • Common Ingredients: Players, Rules, Objective
    • But: Games with modifiable rules, no clear object (MOOs)
why use games for rl ai
Why Use Games for RL/AI ??
  • Clean, Idealized Models of Reality
    • Rules are clear and known (Samuel: not true in economically important problems)
    • Can build very good simulators
  • Clear Metric to Measure Progress
    • Tournament results, Elo ratings, etc.
    • Danger: Metric takes on a life of its own
  • Competition spurs progress
    • DARPA Grand Challenge, Netflix competition
  • Man vs. Machine Competition
    • “adds spice to the study” (Samuel)
    • “provides a convincing demonstration for those who do not believe that machines can learn” (Samuel)
how games extend classic rl
How Games Extend “Classic RL”



  • Fourth dimension: non-stationarity

“Motivated” RL


game strategy





AI Fighters




chess, etc.

ingredients for rl success
Ingredients for RL success
  • Several commonalities:
    • Problems are more-or-less MDPs (near full observability, little history dependence)
    • |S| is enormous  can’t do DP
    • State-space representation critical: use of “features” based on domain knowledge
    • Train in a simulator! Need lots of experience, but still << |S|
    • Smooth function approximation (linear or NN) → very aggressive generalization/extrapolation
    • Only visit plausible states; only generalize to plausible states
rl gradient parameter training
RL + Gradient Parameter Training
  • Recall incremental Bellman updates (TD(0))
  • If instead V(s) = V (s), adjust  to reduce MSE (R-V(s))2 by gradient descent:
learning backgammon using td
Learning backgammon using TD()
  • Neural net observes a sequence of input patterns x1, x2, x3, …, xf : sequence of board positions occurring during a game
  • Representation: Raw board description (# of White or Black checkers at each location) using simple truncated unary encoding (“hand-crafted features” added in later versions)
    • 1-D geometry → 28 board locations → 200 “raw” input units → 300 input units incl. features
  • Train neural net using gradient version of TD()
  • Trained NN output Vt = V (xt , w) should estimate prob (White wins | xt )
TD-Gammon can teach itself by playing games against itself and learning from the outcome
    • Works even starting from random initial play and zero initial expert knowledge (surprising!)  achieves strong intermediate play
    • add hand-crafted features: advanced level of play (1991)
    • 2-ply search: strong master play (1993)
    • 3-ply search: superhuman play (1998)
extending td to tdleaf
Extending TD(λ) to TDLeaf
  • Checkers and Chess: 2-D geometry, 64 board locations, dozens to thousands (Deep Blue) of features, linear function approximation
  • Samuel had the basic idea: train value of current state to match minimax backed-up value
  • Proper mathematical formulation proposed by Beal & Smith; Baxter et al.
  • Baxter’s Chess program KnightCap showed rapid learning in play vs. humans: 1650→2150 Elo in only 300 games!
  • Schaeffer et al. retrained weights of Checkers program Chinook using TDLeaf + self-play; as strong as manually tuned weights (5 year effort)
rl in computer go
RL in Computer Go
  • Go: 2-D geometry, 361 board locations, hundreds to millions (RLGO) of features, linear or NN function approximation
  • NeuroGo (M. Enzenberger, 1996; 2003)
    • Multiple reward signals: single-point eyes, connections and live points
    • Rating ~1880 in 9x9 Go using 3-ply α-β search
  • RLGO (D. Silver, 2008) uses only primitive local features and a linear value function. Can do live on-the-fly training for each new position encountered in a Go game!
    • Rating ~2130 in 9x9 Go using α-β search (avg. depth ~6): strongest program not based on Monte-Carlo Tree Search
robot air hockey
Robot Air Hockey
  • video at: http://www.cns.atr.jp/~dbent/mpeg/hockeyfullsmall.avi
  • D. Bentivegna & C. Atkeson, ICRA 2001
      • 2-D spatial problem
      • 30 degree-of-freedom arm, 420 decisions/sec
      • hand-built primitives, supervised learning + RL
wolf in adversarial robot learning
WoLF in Adversarial Robot Learning
  • Gra-WoLF (Bowling & Veloso): Combines WoLF (“Win or Learn Fast”) principle with policy gradient RL (Sutton et al., 2000)
      • again 2-D spatial geometry, 7 input features, 16 CMAC tiles
    • video at: http://webdocs.cs.ualberta.ca/~bowling/videos/AdversarialRobotLearning.mp4
rl in robocup soccer
RL in Robocup Soccer
  • Once again, 2-D spatial geometry
  • Much good work by Peter Stone et al.
    • TPOT-RL: Learned advanced team strategies given limited observability – key to CMUnited victories in late 90s
    • Fast Gait for Sony Aibo dogs
    • Ball Acquisition for Sony Aibo dogs
    • Keepaway in Robocup simulation league
robocup keepaway game stone et al
Robocup “Keepaway” Game (Stone et al.)
  • Uses Robocup simulator, not real robots
  • Task: one team (“keepers”) tries to maintain possession of the ball as long as possible, other team (“takers”) try to take away
  • Keepers are trained using continuous-time, semi-Markov version of Sarsa algorithm
  • Represent Q(s,a) using CMAC (coarse tile coding) function approximation
  • State representation: small # of distances and angles between teammates, opponents, and ball
  • Reward = time of possession
  • Results: learned policies do much better than either random or hand-coded policies, e.g. on 25x25 field:
    • learned TOP 15.0 sec, hand-coded 8.0 sec, random 6.4 sec
ai fighters
AI Fighters
  • Graepel, Herbrich & Gold, 2004 – used commercial game platform Tao Feng (runs on Xbox): real time simulator (3D!)
    • basic feature set + SARSA + linear value function
    • multiple challenges of environment (real time, concurrency,…):
      • opponent state not known exactly
      • agent state and reward not known exactly
      • due to game animation, legal moves are not known
links to ai fighters videos

Links to AI Fighters videos:

before training:


after training:


discussion lessons learned
Discussion / Lessons Learned ??
  • Winning formula: hand-designed features (fairly small number) + smooth function approx.
    • hand-designed features (fairly small number)
    • aggressive smooth function approx.
    • Researchers should try raw-input comparisons and try nonlinear function approx.
  • Many/most state variables in real problems seem pretty irrelevant
    • Opportunity to try recent linear and/or nonlinear Dimensionality Reduction algorithms
    • Sparsity constraints (L1 regularization etc.) also promising
  • Brain/retina architecture impressively suited for 2-D spatial problems
    • More studies using Convolutional Neural Nets etc.