adaptive sequential decision making with self interested agents
Download
Skip this Video
Download Presentation
Adaptive Sequential Decision Making with Self-Interested Agents

Loading in 2 Seconds...

play fullscreen
1 / 42

Adaptive Sequential Decision Making with Self-Interested Agents - PowerPoint PPT Presentation


  • 85 Views
  • Uploaded on

Adaptive Sequential Decision Making with Self-Interested Agents. David C. Parkes Division of Engineering and Applied Sciences. Harvard University. http://www.eecs.harvard.edu/econcs. Wayne State University October 17, 2006. Context. Multiple agents Self-interest

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Adaptive Sequential Decision Making with Self-Interested Agents' - nantai


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
adaptive sequential decision making with self interested agents

Adaptive Sequential Decision Making with Self-Interested Agents

David C. Parkes

Division of Engineering and Applied Sciences

Harvard University

http://www.eecs.harvard.edu/econcs

Wayne State University October 17, 2006

context
Context
  • Multiple agents
  • Self-interest
  • Private information about preferences, capabilities
  • Coordinated decision problem
    • social planner
    • auctioneer
this talk sequential decision making
This talk: Sequential Decision Making
  • Multiple time periods
  • Agent arrival and departure
  • Values for sequences of decisions
  • Learning by agents and the “center”
  • Example scenarios:
    • allocating computational/network resources
    • sponsored search
    • last-minute ticket auctions
    • bidding for shared cars, air-taxis,…
slide7
Markov Decision Process

Pr(st+1|at,st)

at

st

st+2

st+1

r(at,st)

+ Self-interest

online mechanisms
Online Mechanisms

actions

M=(,p)

t: S! A

pt: S! Rn

agent reports

  • Each period:
    • agents report state/rewards
    • center picks action, payments
  • Main question:
    • what policies can be implemented in a game-theoretic equilibrium?

payments

outline
Outline
  • Multi-armed Bandits Problem [agent learning]
    • canonical, stylized learning problem from AI
    • introduce a multi-agent variation
    • provide a mechanism to bring optimal coordinated learning into an equilibrium
  • Dynamic auction problem [center learning]
    • resource allocation (e.g. WiFi)
    • dynamic arrival & departure of agents
    • provide a truthful, adaptive mechanism
multi armed bandit problem
Multi-Armed Bandit Problem
  • Multi-armed bandit (MAB) problem
  • n arms
  • Each arm has stationary uncertain reward process
  • Goal: implement a (Bayesian) optimal learning policy
tractability gittins result
Tractability: Gittins’ result
  • Theorem [Gittins & Jones 1974]: The complexity of computing an optimal joint policy for a collection of n Markov Chains is linear in n.
    • There exist independent index functions such that the MC with highest “Gittins index” at any given time should be activated.
    • Can compute as optimal MDP value to “restart-in-i” MDP, solve using LP (Katehakis & Veinott’87)
self interest mabp
Self-Interest + MABP
  • Multi-armed bandit (MAB) problem
  • n arms
  • Each arm has stationary uncertain reward process
  • Goal: implement a (Bayesian) optimal learning policy
self interest mabp1
Self-Interest + MABP
  • Multi-armed bandit (MAB) problem
  • n arms (arm == agent)
  • Each arm has stationary uncertain reward process, (privately observed)
  • Goal: implement a (Bayesian) optimal learning policy

Mechanism

slide16
reward

A1

A2

A3

reward

review the vickrey auction
Review: The Vickrey Auction
  • Rules: “sell to highest bidder at second- highest price”
  • How should you bid? Truthfully!
  • Alice wins for $8

Alice: $10

Bob: $8

Carol: $6

mr.robot

review the vickrey auction1
Review: The Vickrey Auction
  • Rules: “sell to highest bidder at second- highest price”
  • How should you bid? Truthfully!
  • Alice wins for $8

Alice: $10

Bob: $8

Carol: $6

mr.robot

(dominant-strategy equilibrium)

slide19
Conjecture: Agents will bid Gittins index for arm in each round.

Intuition?

First Idea

Vickrey auction

slide20
Not truthful!
  • Agent 1 may have knowledge that the mean reward for arm 2 is smaller than agent 2’s current Gittins index.
  • Learning by 2 would decrease the price paid by 1 in the future ) 1 should under-bid
slide21
Second Idea
  • At every time-step:
    • Each agent reports claim about Gittins index
    • Suppose b1¸ b2¸ … ¸ bn
    • Mechanism activates agent 1
    • Agent 1 reports reward, r1
    • Mechanism pays r1 to each agent  1
    • Theorem: Truthful reporting is a Markov-Perfect equilibrium, and mechanism implements optimal Bayesian learning.
learning gittins vcg
Learning-Gittins VCG
  • At every time-step:
    • Activate Agent with highest bid.
    • Pay the reward received by activated agent to all others
    • Collect from every agent i, expected value agents  i would receive without i in system
      • Sample hypothetical execution path(s), using no reported state information.
  • Theorem: Mechanism is truthful, system-optimal, ex ante IR, and ex ante strong budget-balanced in MPE.
learning gittins vcg1
Learning-Gittins VCG

(CPS’06)

  • At every time-step:
    • Activate Agent with highest bid.
    • Pay the reward received by activated agent to all others
    • Collect from every agent i, expected value agents  i would receive without i in system
      • Sample hypothetical execution path(s), using no reported state information.
  • Theorem: Mechanism is truthful, system-optimal, ex ante IR, and ex ante strong budget-balanced in MPE.
slide24
where X-iis the total expected value agents other than i would have received in this period if i weren’t there.
outline1
Outline
  • Multi-armed Bandits Problem [agent learning]
    • canonical, stylized learning problem from AI
    • introduce a multi-agent variation
    • provide a mechanism to bring optimal coordinated learning into an equilibrium
  • Dynamic auction problem [center learning]
    • resource allocation (e.g. WiFi)
    • dynamic arrival & departure of agents
    • provide a truthful, adaptive mechanism, that converges towards an optimal decision policy
slide26
A3

}

}

A1,A2

}

st

st+1

st+2

st+3

A4

First question: what policies can be

truthfully implemented in this environment,

where agents can misreport private

information?

illustrative example
9am A1 (9,11,$3), A2 (9,11,$2)

10am A3 (10,11,$1)

Illustrative Example

Selling a single right to access WiFi in each period

Agent: (ai,di,wi) ) value wi for allocation in t2[ai,di]

Scenario:

Second-price: Sell to A1 for $2, then A2 for $1

Manipulation?

illustrative example1
9am A1 (9,11,$3), A2 (9,11,$2)

10am A3 (10,11,$1)

Illustrative Example

Selling a single right to access WiFi in each period

Agent: (ai,di,wi) ) value wi for allocation in t2[ai,di]

Scenario:

Second-price: Sell to A1 for $2, then A2 for $1

illustrative example2
9am A1 (9,11,$3), A2 (9,11,$2)

10am A3 (10,11,$1)

Illustrative Example

Selling a single right to access WiFi in each period

Agent: (ai,di,wi) ) value wi for allocation in t2[ai,di]

Scenario:

Second-price: Sell to A1 for $2, then A2 for $1

Naïve Vickrey approach fails!

slide30
9am A1 (9,11,$3), A2 (9,11,$2)

10am A3 (10,11,$1)

(NPS’02)

Mechanism Rule: Greedy policy, collect “critical value payment”, i.e. the smallest value can bid and still be allocated.

) Sell to A1, collect $1. Sell to A2, collect $1.

Theorem. Truthful, and implements a 2-approximation allocation, when no-early arrivals and no-late departures.

key intuition monotonicity
Key Intuition: Monotonicity

(HKMP’05)

Monotonic: i(vi,v-i) = 1 )i(v’i,v-i)=1 for higher bid w’i¸wi, more relaxed [a’i,d’i]¶[ai,di]

win

p’

p

p

time

lose

a

a’

d’

d

single valued domains
Single-Valued Domains
  • Type i=(ai,di,[ri,Li])
  • Value ri for decision kt2Li, or kt2LjÂLi
  • Examples:
    • “single-minded” online combinatorial auctions
    • WiFi allocation with fixed lengths of service
  • Monotonic: higher r, smaller L, earlier a, later d
  • Theorem: monotonicity is necessary and sufficient for truthfulness in SV domains.
slide33
A3

}

}

A1,A2

}

st

st+1

st+2

st+3

A4

Second question: how to compute monotonic

policies in stochastic, SV domains? How to

allow learning (by center)?

basic idea
Basic Idea

0

1

2

3

T0

T1

T2

T3

  • Model-Based Reinforcement Learning
    • Update model in each epoch
  • Planning: compute new policy 0, 1, …
  • Collect critical value payments
  • Key Components:

1. Ensure policies aremonotonic

2. Method to compute critical-value payments

3. Careful updates to model.

slide35
1. Planning: Sparse-Sampling

h0

Sparse-sampling()

w

L

depth-L sampled tree, each node is state, each node’s children obtained by sampling each action w times, back-up estimates to root.

Monotonic? Not Quite.

achieving monotonicity ironing
Achieving Monotonicity: Ironing
  • Assume a maximal patience, 
  • Ironing: if ss allocates to (ai,di,ri,Li) in period t then check ss would allocate to (ai,di+,ri,Li)
    • NO: block(ai,di,ri,Li) allocation
    • YES: allow allocation
  • Also use “cross-state sampling” to be aware of ironing when planning.
2 computing payments virtual worlds
2. Computing payments: Virtual Worlds

’1: value ! vc(t0)-

VW1

A1 wins

A2 wins

t0

t1

t2

t3

VW2

’2: value ! vc(t1)-

+ method to compute vc(t) in any state st

3 delayed updates
3. Delayed Updates

0

1

2

3

T0

T1

T2

T3

  • Consider critical payment for an agent ai
  • Delayed-updates: only include departed agents in revised 1
  • Ensures policy is agent-independent
complete procedure
Complete procedure
  • In each period:
    • maintain main world
    • maintain virtual world without each agent active + allocated
  • For planning:
    • use ironing to cancel an action
    • cross-state sparse-sampling to improve policy
  • For pricing:
    • charge minimal critical value across virtual worlds
  • Periodically: move to a new model (and policy)
    • only use departed types
  • Theorem: truthful (DSE), adaptive policy for single-valued domains.
future online cas
Future: Online CAs
  • Combinatorial auctions (CAs) well studied and used in practice (e.g. procurement)
  • Challenge problem: Online CAs
  • Two pronged approach:
    • computational (e.g. leveraging recent work in stochastic online combinatorial optimization by Pascal Van Hentenryck, Brown)
    • incentive considerations (e.g. finding appropriate relaxations of dominant strategy truthfulness to the online domain)
summary
Summary
  • Online mechanisms extend traditional mechanism design to consider dynamics (both exogeneous, e.g. supply and endogeneous)
  • Opportunity for learning:
    • by agents. Multi-agent MABP
    • demonstrated use of payments to bring optimal learning into an equilibrium
    • by center. Adaptive online auctions
    • demonstrated use of payments to bring expected-value maximizing policies into an equilibrium
  • Exciting area. Lots of work still to do!
thanks
Thanks
  • Satinder Singh, Jonathan Bredin, Quang Duong, Mohammad Hagiaghayi, Adam Juda, Robert Kleinberg, Mohammad Mahdian, Chaki Ng, Dimah Yanovsky.
  • More information

www.eecs.harvard.edu/econcs

ad