Rl successes and challenges in high dimensional games
Download
1 / 26

RL Successes and Challenges in High-Dimensional Games - PowerPoint PPT Presentation


  • 170 Views
  • Uploaded on

RL Successes and Challenges in High-Dimensional Games. Gerry Tesauro IBM T.J.Watson Research Center. Outline. Overview/Definition of “Games” Why Study Games? Commonalities of RL successes RL in Classic Board Games TD-Gammon, KnightCap, TD-Chinook, RLGO RL in Robotics Games

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' RL Successes and Challenges in High-Dimensional Games' - yves


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Rl successes and challenges in high dimensional games

RL Successes and Challenges in High-Dimensional Games

Gerry Tesauro

IBM T.J.Watson Research Center


Outline
Outline

  • Overview/Definition of “Games”

    • Why Study Games?

    • Commonalities of RL successes

  • RL in Classic Board Games

    • TD-Gammon, KnightCap, TD-Chinook, RLGO

  • RL in Robotics Games

    • Attacker/Defender Robots

    • Robocup Soccer

  • RL in Video/Online Games

    • AI Fighters

  • Open Discussion / Lessons Learned


What do we mean by games
What Do We Mean by “Games” ??

  • Some Definitions of “Game”

    • A structured activity, usually undertaken for enjoyment (Wikipedia)

    • Activity among decision-makers in seeking to achieve objectives in a limiting context (Clark Abt)

    • A form of play with goals and structure (Kevin Maroney)

  • Single-Player Game = “Puzzle”

  • “Competition” if players can’t interfere with other players’ performance

    • Olympic Hockey vs. Olympic Figure Skating

  • Common Ingredients: Players, Rules, Objective

    • But: Games with modifiable rules, no clear object (MOOs)


Why use games for rl ai
Why Use Games for RL/AI ??

  • Clean, Idealized Models of Reality

    • Rules are clear and known (Samuel: not true in economically important problems)

    • Can build very good simulators

  • Clear Metric to Measure Progress

    • Tournament results, Elo ratings, etc.

    • Danger: Metric takes on a life of its own

  • Competition spurs progress

    • DARPA Grand Challenge, Netflix competition

  • Man vs. Machine Competition

    • “adds spice to the study” (Samuel)

    • “provides a convincing demonstration for those who do not believe that machines can learn” (Samuel)


How games extend classic rl
How Games Extend “Classic RL”

Complex

motivation

  • Fourth dimension: non-stationarity

“Motivated” RL

Multi-agent

game strategy

Poker

Robocup

Soccer

Chicken

AI Fighters

Lifelike

environment

backgammon,

chess, etc.


Ingredients for rl success
Ingredients for RL success

  • Several commonalities:

    • Problems are more-or-less MDPs (near full observability, little history dependence)

    • |S| is enormous  can’t do DP

    • State-space representation critical: use of “features” based on domain knowledge

    • Train in a simulator! Need lots of experience, but still << |S|

    • Smooth function approximation (linear or NN) → very aggressive generalization/extrapolation

    • Only visit plausible states; only generalize to plausible states


Rl gradient parameter training
RL + Gradient Parameter Training

  • Recall incremental Bellman updates (TD(0))

  • If instead V(s) = V (s), adjust  to reduce MSE (R-V(s))2 by gradient descent:


TD() training of neural networks (episodic; =1 and intermediate r = 0):




Learning backgammon using td
Learning backgammon using TD( 9 8 7 6 5 4 3 2 1 0 Wbar)

  • Neural net observes a sequence of input patterns x1, x2, x3, …, xf : sequence of board positions occurring during a game

  • Representation: Raw board description (# of White or Black checkers at each location) using simple truncated unary encoding (“hand-crafted features” added in later versions)

    • 1-D geometry → 28 board locations → 200 “raw” input units → 300 input units incl. features

  • Train neural net using gradient version of TD()

  • Trained NN output Vt = V (xt , w) should estimate prob (White wins | xt )


  • TD-Gammon can teach itself by playing games against itself and learning from the outcome

    • Works even starting from random initial play and zero initial expert knowledge (surprising!)  achieves strong intermediate play

    • add hand-crafted features: advanced level of play (1991)

    • 2-ply search: strong master play (1993)

    • 3-ply search: superhuman play (1998)


New td gammon results
New TD-Gammon Results! and learning from the outcome

(Tesauro, 1992)


Extending td to tdleaf
Extending TD( and learning from the outcomeλ) to TDLeaf

  • Checkers and Chess: 2-D geometry, 64 board locations, dozens to thousands (Deep Blue) of features, linear function approximation

  • Samuel had the basic idea: train value of current state to match minimax backed-up value

  • Proper mathematical formulation proposed by Beal & Smith; Baxter et al.

  • Baxter’s Chess program KnightCap showed rapid learning in play vs. humans: 1650→2150 Elo in only 300 games!

  • Schaeffer et al. retrained weights of Checkers program Chinook using TDLeaf + self-play; as strong as manually tuned weights (5 year effort)


Rl in computer go
RL in Computer Go and learning from the outcome

  • Go: 2-D geometry, 361 board locations, hundreds to millions (RLGO) of features, linear or NN function approximation

  • NeuroGo (M. Enzenberger, 1996; 2003)

    • Multiple reward signals: single-point eyes, connections and live points

    • Rating ~1880 in 9x9 Go using 3-ply α-β search

  • RLGO (D. Silver, 2008) uses only primitive local features and a linear value function. Can do live on-the-fly training for each new position encountered in a Go game!

    • Rating ~2130 in 9x9 Go using α-β search (avg. depth ~6): strongest program not based on Monte-Carlo Tree Search


Rl in robotics games

RL in Robotics Games and learning from the outcome


Robot air hockey
Robot Air Hockey and learning from the outcome

  • video at: http://www.cns.atr.jp/~dbent/mpeg/hockeyfullsmall.avi

  • D. Bentivegna & C. Atkeson, ICRA 2001

    • 2-D spatial problem

    • 30 degree-of-freedom arm, 420 decisions/sec

    • hand-built primitives, supervised learning + RL


Wolf in adversarial robot learning
WoLF in Adversarial Robot Learning and learning from the outcome

  • Gra-WoLF (Bowling & Veloso): Combines WoLF (“Win or Learn Fast”) principle with policy gradient RL (Sutton et al., 2000)

    • again 2-D spatial geometry, 7 input features, 16 CMAC tiles

  • video at: http://webdocs.cs.ualberta.ca/~bowling/videos/AdversarialRobotLearning.mp4


Rl in robocup soccer
RL in Robocup Soccer and learning from the outcome

  • Once again, 2-D spatial geometry

  • Much good work by Peter Stone et al.

    • TPOT-RL: Learned advanced team strategies given limited observability – key to CMUnited victories in late 90s

    • Fast Gait for Sony Aibo dogs

    • Ball Acquisition for Sony Aibo dogs

    • Keepaway in Robocup simulation league


Robocup keepaway game stone et al
Robocup “Keepaway” Game (Stone et al.) and learning from the outcome

  • Uses Robocup simulator, not real robots

  • Task: one team (“keepers”) tries to maintain possession of the ball as long as possible, other team (“takers”) try to take away

  • Keepers are trained using continuous-time, semi-Markov version of Sarsa algorithm

  • Represent Q(s,a) using CMAC (coarse tile coding) function approximation

  • State representation: small # of distances and angles between teammates, opponents, and ball

  • Reward = time of possession

  • Results: learned policies do much better than either random or hand-coded policies, e.g. on 25x25 field:

    • learned TOP 15.0 sec, hand-coded 8.0 sec, random 6.4 sec


Rl in video games

RL in Video Games and learning from the outcome


Ai fighters
AI Fighters and learning from the outcome

  • Graepel, Herbrich & Gold, 2004 – used commercial game platform Tao Feng (runs on Xbox): real time simulator (3D!)

    • basic feature set + SARSA + linear value function

    • multiple challenges of environment (real time, concurrency,…):

      • opponent state not known exactly

      • agent state and reward not known exactly

      • due to game animation, legal moves are not known


Links to ai fighters videos

Links to AI Fighters videos: and learning from the outcome

before training:

http://research.microsoft.com/en-us/projects/mlgames2008/taofengearlyaggressive.wmv

after training:

http://research.microsoft.com/en-us/projects/mlgames2008/taofenglateaggressive.wmv


Discussion lessons learned
Discussion / Lessons Learned ?? and learning from the outcome

  • Winning formula: hand-designed features (fairly small number) + smooth function approx.

    • hand-designed features (fairly small number)

    • aggressive smooth function approx.

    • Researchers should try raw-input comparisons and try nonlinear function approx.

  • Many/most state variables in real problems seem pretty irrelevant

    • Opportunity to try recent linear and/or nonlinear Dimensionality Reduction algorithms

    • Sparsity constraints (L1 regularization etc.) also promising

  • Brain/retina architecture impressively suited for 2-D spatial problems

    • More studies using Convolutional Neural Nets etc.


ad