420 likes | 533 Views
This talk explores the complexities of adaptive sequential decision-making among self-interested agents who possess private information about their preferences and capabilities. Focusing on multiple time periods and the unique challenges posed by coordinated decision problems, the presentation outlines key scenarios such as resource allocation (e.g., WiFi access) and auction bidding (e.g., last-minute ticket sales). Utilizing concepts like the Multi-Armed Bandit Problem and Vickrey auctions, the session emphasizes learning, optimal policies, and the equilibrium dynamics between participating agents and a central decision-maker.
E N D
Adaptive Sequential Decision Making with Self-Interested Agents David C. Parkes Division of Engineering and Applied Sciences Harvard University http://www.eecs.harvard.edu/econcs Wayne State University October 17, 2006
Context • Multiple agents • Self-interest • Private information about preferences, capabilities • Coordinated decision problem • social planner • auctioneer
This talk: Sequential Decision Making • Multiple time periods • Agent arrival and departure • Values for sequences of decisions • Learning by agents and the “center” • Example scenarios: • allocating computational/network resources • sponsored search • last-minute ticket auctions • bidding for shared cars, air-taxis,… • …
Markov Decision Process Pr(st+1|at,st) at st st+2 st+1 r(at,st) + Self-interest
Online Mechanisms actions M=(,p) t: S! A pt: S! Rn agent reports • Each period: • agents report state/rewards • center picks action, payments • Main question: • what policies can be implemented in a game-theoretic equilibrium? payments
Outline • Multi-armed Bandits Problem [agent learning] • canonical, stylized learning problem from AI • introduce a multi-agent variation • provide a mechanism to bring optimal coordinated learning into an equilibrium • Dynamic auction problem [center learning] • resource allocation (e.g. WiFi) • dynamic arrival & departure of agents • provide a truthful, adaptive mechanism
Multi-Armed Bandit Problem • Multi-armed bandit (MAB) problem • n arms • Each arm has stationary uncertain reward process • Goal: implement a (Bayesian) optimal learning policy
Tractability: Gittins’ result • Theorem [Gittins & Jones 1974]: The complexity of computing an optimal joint policy for a collection of n Markov Chains is linear in n. • There exist independent index functions such that the MC with highest “Gittins index” at any given time should be activated. • Can compute as optimal MDP value to “restart-in-i” MDP, solve using LP (Katehakis & Veinott’87)
Self-Interest + MABP • Multi-armed bandit (MAB) problem • n arms • Each arm has stationary uncertain reward process • Goal: implement a (Bayesian) optimal learning policy
Self-Interest + MABP • Multi-armed bandit (MAB) problem • n arms (arm == agent) • Each arm has stationary uncertain reward process, (privately observed) • Goal: implement a (Bayesian) optimal learning policy Mechanism
reward A1 A2 A3 reward
Review: The Vickrey Auction • Rules: “sell to highest bidder at second- highest price” • How should you bid? Truthfully! • Alice wins for $8 Alice: $10 Bob: $8 Carol: $6 mr.robot
Review: The Vickrey Auction • Rules: “sell to highest bidder at second- highest price” • How should you bid? Truthfully! • Alice wins for $8 Alice: $10 Bob: $8 Carol: $6 mr.robot (dominant-strategy equilibrium)
Conjecture: Agents will bid Gittins index for arm in each round. Intuition? First Idea Vickrey auction
Not truthful! • Agent 1 may have knowledge that the mean reward for arm 2 is smaller than agent 2’s current Gittins index. • Learning by 2 would decrease the price paid by 1 in the future ) 1 should under-bid
Second Idea • At every time-step: • Each agent reports claim about Gittins index • Suppose b1¸ b2¸ … ¸ bn • Mechanism activates agent 1 • Agent 1 reports reward, r1 • Mechanism pays r1 to each agent 1 • Theorem: Truthful reporting is a Markov-Perfect equilibrium, and mechanism implements optimal Bayesian learning.
Learning-Gittins VCG • At every time-step: • Activate Agent with highest bid. • Pay the reward received by activated agent to all others • Collect from every agent i, expected value agents i would receive without i in system • Sample hypothetical execution path(s), using no reported state information. • Theorem: Mechanism is truthful, system-optimal, ex ante IR, and ex ante strong budget-balanced in MPE.
Learning-Gittins VCG (CPS’06) • At every time-step: • Activate Agent with highest bid. • Pay the reward received by activated agent to all others • Collect from every agent i, expected value agents i would receive without i in system • Sample hypothetical execution path(s), using no reported state information. • Theorem: Mechanism is truthful, system-optimal, ex ante IR, and ex ante strong budget-balanced in MPE.
where X-iis the total expected value agents other than i would have received in this period if i weren’t there.
Outline • Multi-armed Bandits Problem [agent learning] • canonical, stylized learning problem from AI • introduce a multi-agent variation • provide a mechanism to bring optimal coordinated learning into an equilibrium • Dynamic auction problem [center learning] • resource allocation (e.g. WiFi) • dynamic arrival & departure of agents • provide a truthful, adaptive mechanism, that converges towards an optimal decision policy
A3 } } A1,A2 } st st+1 st+2 st+3 A4 First question: what policies can be truthfully implemented in this environment, where agents can misreport private information?
9am A1 (9,11,$3), A2 (9,11,$2) 10am A3 (10,11,$1) Illustrative Example Selling a single right to access WiFi in each period Agent: (ai,di,wi) ) value wi for allocation in t2[ai,di] Scenario: Second-price: Sell to A1 for $2, then A2 for $1 Manipulation?
9am A1 (9,11,$3), A2 (9,11,$2) 10am A3 (10,11,$1) Illustrative Example Selling a single right to access WiFi in each period Agent: (ai,di,wi) ) value wi for allocation in t2[ai,di] Scenario: Second-price: Sell to A1 for $2, then A2 for $1
9am A1 (9,11,$3), A2 (9,11,$2) 10am A3 (10,11,$1) Illustrative Example Selling a single right to access WiFi in each period Agent: (ai,di,wi) ) value wi for allocation in t2[ai,di] Scenario: Second-price: Sell to A1 for $2, then A2 for $1 Naïve Vickrey approach fails!
9am A1 (9,11,$3), A2 (9,11,$2) 10am A3 (10,11,$1) (NPS’02) Mechanism Rule: Greedy policy, collect “critical value payment”, i.e. the smallest value can bid and still be allocated. ) Sell to A1, collect $1. Sell to A2, collect $1. Theorem. Truthful, and implements a 2-approximation allocation, when no-early arrivals and no-late departures.
Key Intuition: Monotonicity (HKMP’05) Monotonic: i(vi,v-i) = 1 )i(v’i,v-i)=1 for higher bid w’i¸wi, more relaxed [a’i,d’i]¶[ai,di] win p’ p p time lose a a’ d’ d
Single-Valued Domains • Type i=(ai,di,[ri,Li]) • Value ri for decision kt2Li, or kt2LjÂLi • Examples: • “single-minded” online combinatorial auctions • WiFi allocation with fixed lengths of service • Monotonic: higher r, smaller L, earlier a, later d • Theorem: monotonicity is necessary and sufficient for truthfulness in SV domains.
A3 } } A1,A2 } st st+1 st+2 st+3 A4 Second question: how to compute monotonic policies in stochastic, SV domains? How to allow learning (by center)?
Basic Idea 0 1 2 3 T0 T1 T2 T3 … • Model-Based Reinforcement Learning • Update model in each epoch • Planning: compute new policy 0, 1, … • Collect critical value payments • Key Components: 1. Ensure policies aremonotonic 2. Method to compute critical-value payments 3. Careful updates to model.
1. Planning: Sparse-Sampling h0 Sparse-sampling() w L depth-L sampled tree, each node is state, each node’s children obtained by sampling each action w times, back-up estimates to root. Monotonic? Not Quite.
Achieving Monotonicity: Ironing • Assume a maximal patience, • Ironing: if ss allocates to (ai,di,ri,Li) in period t then check ss would allocate to (ai,di+,ri,Li) • NO: block(ai,di,ri,Li) allocation • YES: allow allocation • Also use “cross-state sampling” to be aware of ironing when planning.
2. Computing payments: Virtual Worlds ’1: value ! vc(t0)- VW1 A1 wins A2 wins … t0 t1 t2 t3 VW2 ’2: value ! vc(t1)- + method to compute vc(t) in any state st
3. Delayed Updates 0 1 2 3 T0 T1 T2 T3 … • Consider critical payment for an agent ai<T1<di • Delayed-updates: only include departed agents in revised 1 • Ensures policy is agent-independent
Complete procedure • In each period: • maintain main world • maintain virtual world without each agent active + allocated • For planning: • use ironing to cancel an action • cross-state sparse-sampling to improve policy • For pricing: • charge minimal critical value across virtual worlds • Periodically: move to a new model (and policy) • only use departed types • Theorem: truthful (DSE), adaptive policy for single-valued domains.
Future: Online CAs • Combinatorial auctions (CAs) well studied and used in practice (e.g. procurement) • Challenge problem: Online CAs • Two pronged approach: • computational (e.g. leveraging recent work in stochastic online combinatorial optimization by Pascal Van Hentenryck, Brown) • incentive considerations (e.g. finding appropriate relaxations of dominant strategy truthfulness to the online domain)
Summary • Online mechanisms extend traditional mechanism design to consider dynamics (both exogeneous, e.g. supply and endogeneous) • Opportunity for learning: • by agents. Multi-agent MABP • demonstrated use of payments to bring optimal learning into an equilibrium • by center. Adaptive online auctions • demonstrated use of payments to bring expected-value maximizing policies into an equilibrium • Exciting area. Lots of work still to do!
Thanks • Satinder Singh, Jonathan Bredin, Quang Duong, Mohammad Hagiaghayi, Adam Juda, Robert Kleinberg, Mohammad Mahdian, Chaki Ng, Dimah Yanovsky. • More information www.eecs.harvard.edu/econcs