Adaptive sequential decision making with self interested agents
Download
1 / 42

Adaptive Sequential Decision Making with Self-Interested Agents - PowerPoint PPT Presentation


  • 85 Views
  • Uploaded on

Adaptive Sequential Decision Making with Self-Interested Agents. David C. Parkes Division of Engineering and Applied Sciences. Harvard University. http://www.eecs.harvard.edu/econcs. Wayne State University October 17, 2006. Context. Multiple agents Self-interest

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Adaptive Sequential Decision Making with Self-Interested Agents' - nantai


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Adaptive sequential decision making with self interested agents

Adaptive Sequential Decision Making with Self-Interested Agents

David C. Parkes

Division of Engineering and Applied Sciences

Harvard University

http://www.eecs.harvard.edu/econcs

Wayne State University October 17, 2006


Context
Context Agents

  • Multiple agents

  • Self-interest

  • Private information about preferences, capabilities

  • Coordinated decision problem

    • social planner

    • auctioneer


Social planner laguardia airport
Social Planner: AgentsLaGuardia Airport


Social planner wifi @ starbucks
Social Planner: AgentsWiFi @ Starbucks



This talk sequential decision making
This talk: Sequential Decision Making Agents

  • Multiple time periods

  • Agent arrival and departure

  • Values for sequences of decisions

  • Learning by agents and the “center”

  • Example scenarios:

    • allocating computational/network resources

    • sponsored search

    • last-minute ticket auctions

    • bidding for shared cars, air-taxis,…


Markov Decision Process Agents

Pr(st+1|at,st)

at

st

st+2

st+1

r(at,st)

+ Self-interest


Online mechanisms
Online Mechanisms Agents

actions

M=(,p)

t: S! A

pt: S! Rn

agent reports

  • Each period:

    • agents report state/rewards

    • center picks action, payments

  • Main question:

    • what policies can be implemented in a game-theoretic equilibrium?

payments


Outline
Outline Agents

  • Multi-armed Bandits Problem [agent learning]

    • canonical, stylized learning problem from AI

    • introduce a multi-agent variation

    • provide a mechanism to bring optimal coordinated learning into an equilibrium

  • Dynamic auction problem [center learning]

    • resource allocation (e.g. WiFi)

    • dynamic arrival & departure of agents

    • provide a truthful, adaptive mechanism


Multi armed bandit problem
Multi-Armed Bandit Problem Agents

  • Multi-armed bandit (MAB) problem

  • n arms

  • Each arm has stationary uncertain reward process

  • Goal: implement a (Bayesian) optimal learning policy




Tractability gittins result
Tractability: Gittins’ result Agents

  • Theorem [Gittins & Jones 1974]: The complexity of computing an optimal joint policy for a collection of n Markov Chains is linear in n.

    • There exist independent index functions such that the MC with highest “Gittins index” at any given time should be activated.

    • Can compute as optimal MDP value to “restart-in-i” MDP, solve using LP (Katehakis & Veinott’87)


Self interest mabp
Self-Interest + MABP Agents

  • Multi-armed bandit (MAB) problem

  • n arms

  • Each arm has stationary uncertain reward process

  • Goal: implement a (Bayesian) optimal learning policy


Self interest mabp1
Self-Interest + MABP Agents

  • Multi-armed bandit (MAB) problem

  • n arms (arm == agent)

  • Each arm has stationary uncertain reward process, (privately observed)

  • Goal: implement a (Bayesian) optimal learning policy

Mechanism


reward Agents

A1

A2

A3

reward


Review the vickrey auction
Review: The Vickrey Auction Agents

  • Rules: “sell to highest bidder at second- highest price”

  • How should you bid? Truthfully!

  • Alice wins for $8

Alice: $10

Bob: $8

Carol: $6

mr.robot


Review the vickrey auction1
Review: The Vickrey Auction Agents

  • Rules: “sell to highest bidder at second- highest price”

  • How should you bid? Truthfully!

  • Alice wins for $8

Alice: $10

Bob: $8

Carol: $6

mr.robot

(dominant-strategy equilibrium)


Conjecture: Agents will bid Gittins index for arm in each round.

Intuition?

First Idea

Vickrey auction


Not truthful! round.

  • Agent 1 may have knowledge that the mean reward for arm 2 is smaller than agent 2’s current Gittins index.

  • Learning by 2 would decrease the price paid by 1 in the future ) 1 should under-bid


Second Idea round.

  • At every time-step:

    • Each agent reports claim about Gittins index

    • Suppose b1¸ b2¸ … ¸ bn

    • Mechanism activates agent 1

    • Agent 1 reports reward, r1

    • Mechanism pays r1 to each agent  1

    • Theorem: Truthful reporting is a Markov-Perfect equilibrium, and mechanism implements optimal Bayesian learning.


Learning gittins vcg
Learning-Gittins VCG round.

  • At every time-step:

    • Activate Agent with highest bid.

    • Pay the reward received by activated agent to all others

    • Collect from every agent i, expected value agents  i would receive without i in system

      • Sample hypothetical execution path(s), using no reported state information.

  • Theorem: Mechanism is truthful, system-optimal, ex ante IR, and ex ante strong budget-balanced in MPE.


Learning gittins vcg1
Learning-Gittins VCG round.

(CPS’06)

  • At every time-step:

    • Activate Agent with highest bid.

    • Pay the reward received by activated agent to all others

    • Collect from every agent i, expected value agents  i would receive without i in system

      • Sample hypothetical execution path(s), using no reported state information.

  • Theorem: Mechanism is truthful, system-optimal, ex ante IR, and ex ante strong budget-balanced in MPE.


  • where round. X-iis the total expected value agents other than i would have received in this period if i weren’t there.


Outline1
Outline round.

  • Multi-armed Bandits Problem [agent learning]

    • canonical, stylized learning problem from AI

    • introduce a multi-agent variation

    • provide a mechanism to bring optimal coordinated learning into an equilibrium

  • Dynamic auction problem [center learning]

    • resource allocation (e.g. WiFi)

    • dynamic arrival & departure of agents

    • provide a truthful, adaptive mechanism, that converges towards an optimal decision policy


A round. 3

}

}

A1,A2

}

st

st+1

st+2

st+3

A4

First question: what policies can be

truthfully implemented in this environment,

where agents can misreport private

information?


Illustrative example

9am round. A1 (9,11,$3), A2 (9,11,$2)

10am A3 (10,11,$1)

Illustrative Example

Selling a single right to access WiFi in each period

Agent: (ai,di,wi) ) value wi for allocation in t2[ai,di]

Scenario:

Second-price: Sell to A1 for $2, then A2 for $1

Manipulation?


Illustrative example1

9am round. A1 (9,11,$3), A2 (9,11,$2)

10am A3 (10,11,$1)

Illustrative Example

Selling a single right to access WiFi in each period

Agent: (ai,di,wi) ) value wi for allocation in t2[ai,di]

Scenario:

Second-price: Sell to A1 for $2, then A2 for $1


Illustrative example2

9am round. A1 (9,11,$3), A2 (9,11,$2)

10am A3 (10,11,$1)

Illustrative Example

Selling a single right to access WiFi in each period

Agent: (ai,di,wi) ) value wi for allocation in t2[ai,di]

Scenario:

Second-price: Sell to A1 for $2, then A2 for $1

Naïve Vickrey approach fails!


9am round. A1 (9,11,$3), A2 (9,11,$2)

10am A3 (10,11,$1)

(NPS’02)

Mechanism Rule: Greedy policy, collect “critical value payment”, i.e. the smallest value can bid and still be allocated.

) Sell to A1, collect $1. Sell to A2, collect $1.

Theorem. Truthful, and implements a 2-approximation allocation, when no-early arrivals and no-late departures.


Key intuition monotonicity
Key Intuition: Monotonicity round.

(HKMP’05)

Monotonic: i(vi,v-i) = 1 )i(v’i,v-i)=1 for higher bid w’i¸wi, more relaxed [a’i,d’i]¶[ai,di]

win

p’

p

p

time

lose

a

a’

d’

d


Single valued domains
Single-Valued Domains round.

  • Type i=(ai,di,[ri,Li])

  • Value ri for decision kt2Li, or kt2LjÂLi

  • Examples:

    • “single-minded” online combinatorial auctions

    • WiFi allocation with fixed lengths of service

  • Monotonic: higher r, smaller L, earlier a, later d

  • Theorem: monotonicity is necessary and sufficient for truthfulness in SV domains.


A round. 3

}

}

A1,A2

}

st

st+1

st+2

st+3

A4

Second question: how to compute monotonic

policies in stochastic, SV domains? How to

allow learning (by center)?


Basic idea
Basic Idea round.

0

1

2

3

T0

T1

T2

T3

  • Model-Based Reinforcement Learning

    • Update model in each epoch

  • Planning: compute new policy 0, 1, …

  • Collect critical value payments

  • Key Components:

    1. Ensure policies aremonotonic

    2. Method to compute critical-value payments

    3. Careful updates to model.


1. Planning: Sparse-Sampling round.

h0

Sparse-sampling()

w

L

depth-L sampled tree, each node is state, each node’s children obtained by sampling each action w times, back-up estimates to root.

Monotonic? Not Quite.


Achieving monotonicity ironing
Achieving Monotonicity: Ironing round.

  • Assume a maximal patience, 

  • Ironing: if ss allocates to (ai,di,ri,Li) in period t then check ss would allocate to (ai,di+,ri,Li)

    • NO: block(ai,di,ri,Li) allocation

    • YES: allow allocation

  • Also use “cross-state sampling” to be aware of ironing when planning.


2 computing payments virtual worlds
2. Computing payments: Virtual Worlds round.

’1: value ! vc(t0)-

VW1

A1 wins

A2 wins

t0

t1

t2

t3

VW2

’2: value ! vc(t1)-

+ method to compute vc(t) in any state st


3 delayed updates
3. Delayed Updates round.

0

1

2

3

T0

T1

T2

T3

  • Consider critical payment for an agent ai<T1<di

  • Delayed-updates: only include departed agents in revised 1

  • Ensures policy is agent-independent


Complete procedure
Complete procedure round.

  • In each period:

    • maintain main world

    • maintain virtual world without each agent active + allocated

  • For planning:

    • use ironing to cancel an action

    • cross-state sparse-sampling to improve policy

  • For pricing:

    • charge minimal critical value across virtual worlds

  • Periodically: move to a new model (and policy)

    • only use departed types

  • Theorem: truthful (DSE), adaptive policy for single-valued domains.


Future online cas
Future: Online CAs round.

  • Combinatorial auctions (CAs) well studied and used in practice (e.g. procurement)

  • Challenge problem: Online CAs

  • Two pronged approach:

    • computational (e.g. leveraging recent work in stochastic online combinatorial optimization by Pascal Van Hentenryck, Brown)

    • incentive considerations (e.g. finding appropriate relaxations of dominant strategy truthfulness to the online domain)


Summary
Summary round.

  • Online mechanisms extend traditional mechanism design to consider dynamics (both exogeneous, e.g. supply and endogeneous)

  • Opportunity for learning:

    • by agents. Multi-agent MABP

    • demonstrated use of payments to bring optimal learning into an equilibrium

    • by center. Adaptive online auctions

    • demonstrated use of payments to bring expected-value maximizing policies into an equilibrium

  • Exciting area. Lots of work still to do!


Thanks
Thanks round.

  • Satinder Singh, Jonathan Bredin, Quang Duong, Mohammad Hagiaghayi, Adam Juda, Robert Kleinberg, Mohammad Mahdian, Chaki Ng, Dimah Yanovsky.

  • More information

    www.eecs.harvard.edu/econcs


ad