1 / 15

Odds & Ends

Odds & Ends. Administrivia. Reminder: Q3 Nov 10 CS outreach: UNM SOE holding open house for HS seniors Want CS dept participation We want to show off the coolest things in CS Come demo your P1 and P2 code! Contact me or Lynne Jacobson. The bird of time. Last time: Eligibility traces

Download Presentation

Odds & Ends

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Odds & Ends

  2. Administrivia • Reminder: Q3 Nov 10 • CS outreach: • UNM SOE holding open house for HS seniors • Want CS dept participation • We want to show off the coolest things in CS • Come demo your P1 and P2 code! • Contact me or Lynne Jacobson

  3. The bird of time... • Last time: • Eligibility traces • The SARSA(λ) algorithm • Design exercise • This time: • Tip o’ the day • Notes on exploration • Design exercise, cont’d.

  4. Tip o’ the day • Micro-experiments • Often, often, often when hacking: • “How the heck does that function work?” • “The docs don’t say what happens when you hand null to the constructor...” • “Uhhh... Will this work if I do it this way?” • “WTF does that mean?” • Could spend a bunch of time in the docs • Or... • Could just go and try it

  5. Tip o’ the day • Answer: micro-experiments • Write a very small (<50 line) test program to make sure you understand what the thing does • Think: homework assignment from CS152 • Quick to write • Answers question better than docs can • Builds your intuition about what the machine is doing • Using the debugger to watch is also good

  6. Action selection in RL

  7. Q learning in code... • public class MyAgent implements Agent { • public void updateModel(SARSTuple s) { • State2d start=s.getInitState(); • State2d end=s.getNextState(); • Action act=s.getAction(); • double r=s.getReward(); • Action nextAct=_policy.argmaxAct(end); • double Qnow=_policy.get(start,act); • double Qnext=_policy.get(end,nextAct); • double Qrevised=Qnow+getAlpha()* • (r+getGamma()*Qnext-Qnow); • _policy.set(start,act,Qrevised); • } • }

  8. The SARSA(λ) code • public class SARSAlAgent implements Agent { • public void updateModel(SARSTuple s) { • State2d start=s.getInitState(); • State2d end=s.getNextState(); • Action act=s.getAction(); • double r=s.getReward(); • Action nextAct=pickAction(end); • double Qnow=_policy.get(start,act); • double Qnext=_policy.get(end,nextAct); • double delta=r+_gamma*Qnext-Qnow; • setElig(start,act,getElig(start,act)+1.0); • for (SAPair p : getEligiblePairs()) { • currQ=_policy.get(p.getS(),p.getA()); • _policy.set(p.getS(),p.getA(), • currQ+getElig(p.getS(),p.getA())*_alpha*delta); • setElig(p.getS(),p.getA(), • getElig(p.getS(),p.getA())*_gamma*_lambda); • } • } • }

  9. Q & SARSA(λ): Key diffs • Use of eligibility traces • Q updates single step of history • SARSA(λ) keeps record of visited state/action pairs: e(s,a) • Updates Q(s,a) value in proportion to e(s,a) • Decays e(s,a) by λ each step

  10. Q & SARSA(λ): Key diffs • How “next state” action is picked • Q: nextAct=_policy.argmaxAct(end) • Picks “best” next state • SARSA: nextAct=RLAgent.pickAction(end) • Picks next state that agent would pick • Huh? What’s the difference?

  11. Exploration vs. exploitation • Sometimes, agent wants to do something other than “best currently known action” • Why? • If agent never tries anything new, it may never discover that there’s a better answer out there... • Called the “exploration vs. exploitation” tradeoff • Is it better to “explore” to find new stuff, or to “exploit” what you already know?

  12. ε-Greedy exploration • Answer: • “Most of the time” do the best known thing • act=argmaxa(Q(s,a)) • “Rarely” try something random • act=pickAtRandom(allActionSet) • ε-greedy exploration policies: • “rarely”==prob ε • “most of the time”==prob 1-ε

  13. ε-Greedy in code • public class eGreedyAgent implements RLAgent { • // implements the e-greedy exploration policy • public Action pickAction(State2d s) { • final double rVal=_rand.nextDouble(); • if (rVal<_epsilon) { • return randPick(_ASet); • } • return _policy.argmaxAct(s); • } • private final Set<Action> _ASet; • private final double _epsilon; • }

  14. Design Exercise:Experimental Rig

  15. Design exercise • For M4/Rollout, need to be able to: • Train agent for many trials/steps per trial • Generate learning curves for agent’s learning • Run some trials w/ learning turned on • Freeze learning • Run some trials w/ learning turned off • Average steps-to-goal over those trials • Save average as one point in curve • Design: objects/methods to support this learning framework • Support: diff learning algs, diff environments, diff params, variable # of trials/steps, etc.

More Related