Odds & Ends

Odds & Ends

Administrivia • Reminder: Q3 Nov 10 • CS outreach: • UNM SOE holding open house for HS seniors • Want CS dept participation • We want to show off the coolest things in CS • Come demo your P1 and P2 code! • Contact me or Lynne Jacobson

The bird of time... • Last time: • Eligibility traces • The SARSA(λ) algorithm • Design exercise • This time: • Tip o’ the day • Notes on exploration • Design exercise, cont’d.

Tip o’ the day • Micro-experiments • Often, often, often when hacking: • “How the heck does that function work?” • “The docs don’t say what happens when you hand null to the constructor...” • “Uhhh... Will this work if I do it this way?” • “WTF does that mean?” • Could spend a bunch of time in the docs • Or... • Could just go and try it

Tip o’ the day • Answer: micro-experiments • Write a very small (<50 line) test program to make sure you understand what the thing does • Think: homework assignment from CS152 • Quick to write • Answers question better than docs can • Builds your intuition about what the machine is doing • Using the debugger to watch is also good

Action selection in RL

Q learning in code... • public class MyAgent implements Agent { • public void updateModel(SARSTuple s) { • State2d start=s.getInitState(); • State2d end=s.getNextState(); • Action act=s.getAction(); • double r=s.getReward(); • Action nextAct=_policy.argmaxAct(end); • double Qnow=_policy.get(start,act); • double Qnext=_policy.get(end,nextAct); • double Qrevised=Qnow+getAlpha()* • (r+getGamma()*Qnext-Qnow); • _policy.set(start,act,Qrevised); • } • }

The SARSA(λ) code • public class SARSAlAgent implements Agent { • public void updateModel(SARSTuple s) { • State2d start=s.getInitState(); • State2d end=s.getNextState(); • Action act=s.getAction(); • double r=s.getReward(); • Action nextAct=pickAction(end); • double Qnow=_policy.get(start,act); • double Qnext=_policy.get(end,nextAct); • double delta=r+_gamma*Qnext-Qnow; • setElig(start,act,getElig(start,act)+1.0); • for (SAPair p : getEligiblePairs()) { • currQ=_policy.get(p.getS(),p.getA()); • _policy.set(p.getS(),p.getA(), • currQ+getElig(p.getS(),p.getA())*_alpha*delta); • setElig(p.getS(),p.getA(), • getElig(p.getS(),p.getA())*_gamma*_lambda); • } • } • }

Q & SARSA(λ): Key diffs • Use of eligibility traces • Q updates single step of history • SARSA(λ) keeps record of visited state/action pairs: e(s,a) • Updates Q(s,a) value in proportion to e(s,a) • Decays e(s,a) by λ each step

Q & SARSA(λ): Key diffs • How “next state” action is picked • Q: nextAct=_policy.argmaxAct(end) • Picks “best” next state • SARSA: nextAct=RLAgent.pickAction(end) • Picks next state that agent would pick • Huh? What’s the difference?

Exploration vs. exploitation • Sometimes, agent wants to do something other than “best currently known action” • Why? • If agent never tries anything new, it may never discover that there’s a better answer out there... • Called the “exploration vs. exploitation” tradeoff • Is it better to “explore” to find new stuff, or to “exploit” what you already know?

ε-Greedy exploration • Answer: • “Most of the time” do the best known thing • act=argmaxa(Q(s,a)) • “Rarely” try something random • act=pickAtRandom(allActionSet) • ε-greedy exploration policies: • “rarely”==prob ε • “most of the time”==prob 1-ε

ε-Greedy in code • public class eGreedyAgent implements RLAgent { • // implements the e-greedy exploration policy • public Action pickAction(State2d s) { • final double rVal=_rand.nextDouble(); • if (rVal<_epsilon) { • return randPick(_ASet); • } • return _policy.argmaxAct(s); • } • private final Set<Action> _ASet; • private final double _epsilon; • }

Design Exercise:Experimental Rig

Design exercise • For M4/Rollout, need to be able to: • Train agent for many trials/steps per trial • Generate learning curves for agent’s learning • Run some trials w/ learning turned on • Freeze learning • Run some trials w/ learning turned off • Average steps-to-goal over those trials • Save average as one point in curve • Design: objects/methods to support this learning framework • Support: diff learning algs, diff environments, diff params, variable # of trials/steps, etc.

Odds & Ends

Odds & Ends

Presentation Transcript

Unit 4: Analytic Epidemiology

Leadership Begins and Ends…. With You

Punctuation: Commas In A Series

Supporting Children Who Take Us to the Ends of Your Rope….

Sociology

Chp. 1 Introduction to Forensic Science

Review of Assignment 3, Loose Ends, Web-based Data Collection

Database Design for Success

GAA 2014 Spring Workshop

OR Experience

NO RICE!

Odds and Ends

The Cognitive Dog

Leadership in Reading First: A principled team

Beginnings, Middles and Ends

Lesson 9 - Against All Odds

Chapter 15 -1

The Pigman

FLEXBONE OFFENSIVE PLAYBOOK

FLEXBONE OFFENSIVE PLAYBOOK

Calculus Date: 3/7/2014 ID Check Obj: SWBAT connect Differential and Integral Calculus

Oregon militant standoff ends

Odds &amp; Ends

Odds &amp; Ends

Presentation Transcript

Unit 4: Analytic Epidemiology

Leadership Begins and Ends…. With You

Punctuation: Commas In A Series

Supporting Children Who Take Us to the Ends of Your Rope….

Sociology

Chp. 1 Introduction to Forensic Science

Review of Assignment 3, Loose Ends, Web-based Data Collection

Database Design for Success

GAA 2014 Spring Workshop

OR Experience

NO RICE!

Odds and Ends

The Cognitive Dog

Leadership in Reading First: A principled team

Beginnings, Middles and Ends

Lesson 9 - Against All Odds

Chapter 15 -1

The Pigman

FLEXBONE OFFENSIVE PLAYBOOK

FLEXBONE OFFENSIVE PLAYBOOK

Calculus Date: 3/7/2014 ID Check Obj: SWBAT connect Differential and Integral Calculus

Oregon militant standoff ends

Odds & Ends

Odds & Ends