1 / 42

Adaptive Sequential Decision Making with Self-Interested Agents

Adaptive Sequential Decision Making with Self-Interested Agents. David C. Parkes Division of Engineering and Applied Sciences. Harvard University. http://www.eecs.harvard.edu/econcs. Wayne State University October 17, 2006. Context. Multiple agents Self-interest

nantai
Download Presentation

Adaptive Sequential Decision Making with Self-Interested Agents

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Adaptive Sequential Decision Making with Self-Interested Agents David C. Parkes Division of Engineering and Applied Sciences Harvard University http://www.eecs.harvard.edu/econcs Wayne State University October 17, 2006

  2. Context • Multiple agents • Self-interest • Private information about preferences, capabilities • Coordinated decision problem • social planner • auctioneer

  3. Social Planner: LaGuardia Airport

  4. Social Planner: WiFi @ Starbucks

  5. Self-interested Auctioneer: Sponsored Search

  6. This talk: Sequential Decision Making • Multiple time periods • Agent arrival and departure • Values for sequences of decisions • Learning by agents and the “center” • Example scenarios: • allocating computational/network resources • sponsored search • last-minute ticket auctions • bidding for shared cars, air-taxis,… • …

  7. Markov Decision Process Pr(st+1|at,st) at st st+2 st+1 r(at,st) + Self-interest

  8. Online Mechanisms actions M=(,p) t: S! A pt: S! Rn agent reports • Each period: • agents report state/rewards • center picks action, payments • Main question: • what policies can be implemented in a game-theoretic equilibrium? payments

  9. Outline • Multi-armed Bandits Problem [agent learning] • canonical, stylized learning problem from AI • introduce a multi-agent variation • provide a mechanism to bring optimal coordinated learning into an equilibrium • Dynamic auction problem [center learning] • resource allocation (e.g. WiFi) • dynamic arrival & departure of agents • provide a truthful, adaptive mechanism

  10. Multi-Armed Bandit Problem • Multi-armed bandit (MAB) problem • n arms • Each arm has stationary uncertain reward process • Goal: implement a (Bayesian) optimal learning policy

  11. Learning as Planning

  12. Optimal Learning as Planning

  13. Tractability: Gittins’ result • Theorem [Gittins & Jones 1974]: The complexity of computing an optimal joint policy for a collection of n Markov Chains is linear in n. • There exist independent index functions such that the MC with highest “Gittins index” at any given time should be activated. • Can compute as optimal MDP value to “restart-in-i” MDP, solve using LP (Katehakis & Veinott’87)

  14. Self-Interest + MABP • Multi-armed bandit (MAB) problem • n arms • Each arm has stationary uncertain reward process • Goal: implement a (Bayesian) optimal learning policy

  15. Self-Interest + MABP • Multi-armed bandit (MAB) problem • n arms (arm == agent) • Each arm has stationary uncertain reward process, (privately observed) • Goal: implement a (Bayesian) optimal learning policy Mechanism

  16. reward A1 A2 A3 reward

  17. Review: The Vickrey Auction • Rules: “sell to highest bidder at second- highest price” • How should you bid? Truthfully! • Alice wins for $8 Alice: $10 Bob: $8 Carol: $6 mr.robot

  18. Review: The Vickrey Auction • Rules: “sell to highest bidder at second- highest price” • How should you bid? Truthfully! • Alice wins for $8 Alice: $10 Bob: $8 Carol: $6 mr.robot (dominant-strategy equilibrium)

  19. Conjecture: Agents will bid Gittins index for arm in each round. Intuition? First Idea Vickrey auction

  20. Not truthful! • Agent 1 may have knowledge that the mean reward for arm 2 is smaller than agent 2’s current Gittins index. • Learning by 2 would decrease the price paid by 1 in the future ) 1 should under-bid

  21. Second Idea • At every time-step: • Each agent reports claim about Gittins index • Suppose b1¸ b2¸ … ¸ bn • Mechanism activates agent 1 • Agent 1 reports reward, r1 • Mechanism pays r1 to each agent  1 • Theorem: Truthful reporting is a Markov-Perfect equilibrium, and mechanism implements optimal Bayesian learning.

  22. Learning-Gittins VCG • At every time-step: • Activate Agent with highest bid. • Pay the reward received by activated agent to all others • Collect from every agent i, expected value agents  i would receive without i in system • Sample hypothetical execution path(s), using no reported state information. • Theorem: Mechanism is truthful, system-optimal, ex ante IR, and ex ante strong budget-balanced in MPE.

  23. Learning-Gittins VCG (CPS’06) • At every time-step: • Activate Agent with highest bid. • Pay the reward received by activated agent to all others • Collect from every agent i, expected value agents  i would receive without i in system • Sample hypothetical execution path(s), using no reported state information. • Theorem: Mechanism is truthful, system-optimal, ex ante IR, and ex ante strong budget-balanced in MPE.

  24. where X-iis the total expected value agents other than i would have received in this period if i weren’t there.

  25. Outline • Multi-armed Bandits Problem [agent learning] • canonical, stylized learning problem from AI • introduce a multi-agent variation • provide a mechanism to bring optimal coordinated learning into an equilibrium • Dynamic auction problem [center learning] • resource allocation (e.g. WiFi) • dynamic arrival & departure of agents • provide a truthful, adaptive mechanism, that converges towards an optimal decision policy

  26. A3 } } A1,A2 } st st+1 st+2 st+3 A4 First question: what policies can be truthfully implemented in this environment, where agents can misreport private information?

  27. 9am A1 (9,11,$3), A2 (9,11,$2) 10am A3 (10,11,$1) Illustrative Example Selling a single right to access WiFi in each period Agent: (ai,di,wi) ) value wi for allocation in t2[ai,di] Scenario: Second-price: Sell to A1 for $2, then A2 for $1 Manipulation?

  28. 9am A1 (9,11,$3), A2 (9,11,$2) 10am A3 (10,11,$1) Illustrative Example Selling a single right to access WiFi in each period Agent: (ai,di,wi) ) value wi for allocation in t2[ai,di] Scenario: Second-price: Sell to A1 for $2, then A2 for $1

  29. 9am A1 (9,11,$3), A2 (9,11,$2) 10am A3 (10,11,$1) Illustrative Example Selling a single right to access WiFi in each period Agent: (ai,di,wi) ) value wi for allocation in t2[ai,di] Scenario: Second-price: Sell to A1 for $2, then A2 for $1 Naïve Vickrey approach fails!

  30. 9am A1 (9,11,$3), A2 (9,11,$2) 10am A3 (10,11,$1) (NPS’02) Mechanism Rule: Greedy policy, collect “critical value payment”, i.e. the smallest value can bid and still be allocated. ) Sell to A1, collect $1. Sell to A2, collect $1. Theorem. Truthful, and implements a 2-approximation allocation, when no-early arrivals and no-late departures.

  31. Key Intuition: Monotonicity (HKMP’05) Monotonic: i(vi,v-i) = 1 )i(v’i,v-i)=1 for higher bid w’i¸wi, more relaxed [a’i,d’i]¶[ai,di] win p’ p p time lose a a’ d’ d

  32. Single-Valued Domains • Type i=(ai,di,[ri,Li]) • Value ri for decision kt2Li, or kt2LjÂLi • Examples: • “single-minded” online combinatorial auctions • WiFi allocation with fixed lengths of service • Monotonic: higher r, smaller L, earlier a, later d • Theorem: monotonicity is necessary and sufficient for truthfulness in SV domains.

  33. A3 } } A1,A2 } st st+1 st+2 st+3 A4 Second question: how to compute monotonic policies in stochastic, SV domains? How to allow learning (by center)?

  34. Basic Idea 0 1 2 3 T0 T1 T2 T3 … • Model-Based Reinforcement Learning • Update model in each epoch • Planning: compute new policy 0, 1, … • Collect critical value payments • Key Components: 1. Ensure policies aremonotonic 2. Method to compute critical-value payments 3. Careful updates to model.

  35. 1. Planning: Sparse-Sampling h0 Sparse-sampling() w L depth-L sampled tree, each node is state, each node’s children obtained by sampling each action w times, back-up estimates to root. Monotonic? Not Quite.

  36. Achieving Monotonicity: Ironing • Assume a maximal patience,  • Ironing: if ss allocates to (ai,di,ri,Li) in period t then check ss would allocate to (ai,di+,ri,Li) • NO: block(ai,di,ri,Li) allocation • YES: allow allocation • Also use “cross-state sampling” to be aware of ironing when planning.

  37. 2. Computing payments: Virtual Worlds ’1: value ! vc(t0)- VW1 A1 wins A2 wins … t0 t1 t2 t3 VW2 ’2: value ! vc(t1)- + method to compute vc(t) in any state st

  38. 3. Delayed Updates 0 1 2 3 T0 T1 T2 T3 … • Consider critical payment for an agent ai<T1<di • Delayed-updates: only include departed agents in revised 1 • Ensures policy is agent-independent

  39. Complete procedure • In each period: • maintain main world • maintain virtual world without each agent active + allocated • For planning: • use ironing to cancel an action • cross-state sparse-sampling to improve policy • For pricing: • charge minimal critical value across virtual worlds • Periodically: move to a new model (and policy) • only use departed types • Theorem: truthful (DSE), adaptive policy for single-valued domains.

  40. Future: Online CAs • Combinatorial auctions (CAs) well studied and used in practice (e.g. procurement) • Challenge problem: Online CAs • Two pronged approach: • computational (e.g. leveraging recent work in stochastic online combinatorial optimization by Pascal Van Hentenryck, Brown) • incentive considerations (e.g. finding appropriate relaxations of dominant strategy truthfulness to the online domain)

  41. Summary • Online mechanisms extend traditional mechanism design to consider dynamics (both exogeneous, e.g. supply and endogeneous) • Opportunity for learning: • by agents. Multi-agent MABP • demonstrated use of payments to bring optimal learning into an equilibrium • by center. Adaptive online auctions • demonstrated use of payments to bring expected-value maximizing policies into an equilibrium • Exciting area. Lots of work still to do!

  42. Thanks • Satinder Singh, Jonathan Bredin, Quang Duong, Mohammad Hagiaghayi, Adam Juda, Robert Kleinberg, Mohammad Mahdian, Chaki Ng, Dimah Yanovsky. • More information www.eecs.harvard.edu/econcs

More Related