Learning in Approximate Dynamic Programming for Managing a Multi-Attribute Driver Martijn Mes

Learning in Approximate Dynamic Programming for Managing a Multi-Attribute Driver Martijn MesDepartment of Operational Methods for Production and LogisticsUniversity of TwenteThe Netherlands Sunday, November 7, 2010INFORMS Annual Meeting Austin INFORMS Annual Meeting Austin

OUTLINE • Illustration: a transportation application • Stylized illustration: the Nomadic Trucker Problem • Approximate Dynamic Programming (ADP) • Challenges with ADP • Optimal Learning • Optimal Learning in ADP • Challenges with Optimal Learning in ADP • Sketch of our solution concept INFORMS Annual Meeting Austin

TRANSPORTATION APPLICATION • Heisterkamp • Trailer trucking: • Providing trucks and drivers • Planning department: • Accept orders • Assign orders to trucks • Assign drivers to trucks • Type of orders: • Direct order: move trailer from A to B; client pays depending on distance between A to B, but trailer might go through hubs to change the truck and/or driver • Customer guidance order: rent a truck and driver to a client for some time period INFORMS Annual Meeting Austin

REAL APPLICATION • Heisterkamp INFORMS Annual Meeting Austin

CHARACTERISTICS • The drivers are bounded by EU drivers’ hours regulations • However, given sufficient supply of orders and drivers, trucks can in principle be utilized 24/7 by switching drivers • Even though we can replace a driver (to increase utilization of trucks), we still might face costs for the old driver • Objective: increase profits by ‘clever’ order acceptance and minimization of costs for drivers, trucks and moving empty (i.e., without trailer) • We solve a dynamic assignment problem, given the state of all trucks and (probabilistic) known orders, at specific time instances for a fixed horizon • This problem is known as a Dynamic Fleet Management Problem (DFMP). For illustrative purposes we now focus on the single vehicle version of the DFMP. INFORMS Annual Meeting Austin

THE NOMADIC TRUCKER PROBLEM • Single trucker moving from city to city either with a load or empty • Rewards when moving loads otherwise there are costs involved • Vector of attributes describing a single resource with the set of possible attribute vectors The truck Dynamic attributes The driver INFORMS Annual Meeting Austin

MODELING THE DYNAMICS • State where • with Rta=1 when the truck has attribute a (in the DFMP, Rta gives the number of resources at time t with attribute a) • with Dtl the number of loads of type l • Decision xt: make a loaded move, wait at current location, or move empty to another location; xt follows from a decision function where πΠ is a family of policies • Exogenous information Wt+1: information arriving between t and t+1 such as new loads, wear of truck, occurrence of breakdowns etc. • Choosing decision xt with current state St and exogenous information Wt+1, results in a transition with contribution (payment or costs) INFORMS Annual Meeting Austin

OBJECTIVE • Objective is to find the policy π that maximizes the expected sum of discounted contributions over all time periods INFORMS Annual Meeting Austin

SOLVING THE PROBLEM • Optimality equation (expectation form of Bellman’s equation): • Enumerating by backward induction? • Suppose a=(location, arrival time, domicile) and we discretize to 500 locations and 50 possible arrival times → |A |=12,500,000 • In the backward loop we not only have to visit all states, but also we have to evaluate all actions, and, to compute the expectation, we probably also have to evaluate all possible outcomes • Backwards dynamic programming might become intractable Approximate Dynamic Programming INFORMS Annual Meeting Austin

APPROXIMATE DYNAMIC PROGRAMMING • We replace the original optimality equation • With the following INFORMS Annual Meeting Austin

APPROXIMATE DYNAMIC PROGRAMMING • We replace the original optimality equation • With the following 1 Using a value function approximation This allows us to step forward in time INFORMS Annual Meeting Austin

APPROXIMATE DYNAMIC PROGRAMMING • We replace the original optimality equation • With the following 2 Using the post-decision state variable Deterministic function INFORMS Annual Meeting Austin

APPROXIMATE DYNAMIC PROGRAMMING • We replace the original optimality equation • With the following 3 Generating sample paths INFORMS Annual Meeting Austin

APPROXIMATE DYNAMIC PROGRAMMING • We replace the original optimality equation • With the following 4 Learning through iterations INFORMS Annual Meeting Austin

Deterministic optimization Statistics Simulation OUTLINE OF THE ADP ALGORITHM INFORMS Annual Meeting Austin

CHALLENGES WITH ADP • Exploration vs. exploitation: • Exploitation: we do wecurrently think is best • Exploration: we choose totry something and learnmore (information collection) • To avoid getting stuck in a local optimum, we have to explore. But what do we want to explore and for how long? Do we need to explore the whole state space? • Do we update the value functions using the results of the exploration steps or do we want to perform off-policy control? • Techniques from Optimal Learning might help here INFORMS Annual Meeting Austin

OPTIMAL LEARNING • To cope with the exploration vs. exploitation dilemma • Undirected exploration: • Try to randomly explore the whole state space • Examples: pure exploration and epsilon greedy (explore with probability εn and exploit with probability 1- εn) • Directed exploration: • Utilize past experience to execute efficient exploration (costs are gradually avoided by making more expensive actions less likely) • Examples of directed exploration • Boltzmann exploration; choose x that maximizes • Interval estimation; choose x that maximizes • The knowledge gradient policy (see next sheets) INFORMS Annual Meeting Austin

Observation Change which produces a change in the decision. Updated estimate of the value of option 5 Change in estimated value of option 5 due to measurement of 5 THE KNOWLEDGE GRADIENT POLICY [1/2] • Basic principle: • Assume you can make only one measurement, after which you have to make a final choice (the implementation decision) • What choice would you make now to maximize the expected value of the implementation decision? 5 1 2 3 4 INFORMS Annual Meeting Austin

THE KNOWLEDGE GRADIENT POLICY [2/2] • The knowledge gradient is the expected marginal value of a single measurement x • The knowledge gradient policy is given by • There are many problems where making one measurement tells us something about what we might observe from other measurements (e.g., in our transportation application nearby locations have similar properties) • Correlations are particularly important when the number of possible measurements is extremely large relative to the measurement budget (or continuous functions) • There are various extensions of the Knowledge Gradient policy that take into account similarities between alternatives Hierarchical Knowledge Gradient policy INFORMS Annual Meeting Austin

HIERARCHIAL KNOWLEDGE GRADIENT (HKG) • Idea: instead of having a belief on the true value θxof each alternative x (Bayesian prior with mean and precision ), we have a belief on the value of each alternative at various levels of aggregation (with and ) • Using aggregation, we express (our estimate of θx) as a weighted combination • Intuition: highest weight to levels with lowest sum of variance and bias; see [1] and [2] for details. [1] M.R.K. Mes, W.B. Powell, and P.I. Frazier (2010). Hierarchical Knowledge Gradient for Sequential Sampling. [2] A. George, W.B. Powell, and S.R. Kulkarni (2008). Value Function Approximation using Multiple Aggregation for Multiattribute Resource Management. INFORMS Annual Meeting Austin

STATISTICAL AGGREGATION • Example of an aggregation structure for the Nomadic Trucker Problem • With HKG we would have 38,911 beliefs and our belief about a single alternative can be expressed as a function of 6 beliefs (1 for each aggregation level). We need this for each time unit * include in this level - exclude in this level INFORMS Annual Meeting Austin

ILLUSTRATION OF HKG • The knowledge gradient policy prefers to measure alternatives with high mean and/or low precision: • Equal means  measure lowest precision • Equal precisions  measure highest mean • Demo HKG… INFORMS Annual Meeting Austin

COMBINING OPTIMAL LEARNING AND ADP DECISIONS • Illustration learning in ADP • State St=(Rt,Dt) where Rt resembles a location Rt{A,B,C,D} and Dt available loads going out from Rt • Decision xt is a location to move to xt{A,B,C,D} • Exogenous information Wtare the new loads Dt A location → B t-1 C t time → D t+1 INFORMS Annual Meeting Austin t+2

COMBINING OPTIMAL LEARNING AND ADP DECISIONS We were in the post decision state where we decided to move to location C. After observing the new loads , we are in the pre decision state A location → B t-1 C t time → D t+1 INFORMS Annual Meeting Austin t+2

COMBINING OPTIMAL LEARNING AND ADP DECISIONS iteration → where n+1 n A location → B t-1 C t time → D t+1 INFORMS Annual Meeting Austin t+2

COMBINING OPTIMAL LEARNING AND ADP DECISIONS So not necessarily influences the value However, it determines the state we update next iteration → n+1 n A location → B t-1 C t time → D t+1 INFORMS Annual Meeting Austin t+2

COMBINING OPTIMAL LEARNING AND ADP DECISIONS Using Optimal Learning, we estimate the knowledge gain iteration → n+1 n A location → B t-1 C t time → D t+1 INFORMS Annual Meeting Austin t+2

COMBINING OPTIMAL LEARNING AND ADP DECISIONS We decide to move to location B resulting in a post decision state iteration → n+1 n A location → B t-1 C t time → D t+1 INFORMS Annual Meeting Austin t+2

COMBINING OPTIMAL LEARNING AND ADP DECISIONS After observing the new loads , we are in the pre decision state iteration → n+1 n A location → B t-1 C t time → D t+1 INFORMS Annual Meeting Austin t+2

COMBINING OPTIMAL LEARNING AND ADP DECISIONS iteration → where n+1 n A location → B t-1 C t time → D t+1 INFORMS Annual Meeting Austin t+2

COMBINING OPTIMAL LEARNING AND ADP DECISIONS Again we have to make a sampling decision iteration → n+1 n A location → B t-1 C t time → D t+1 INFORMS Annual Meeting Austin t+2

COMBINING OPTIMAL LEARNING AND ADP DECISIONS Again, we estimate the knowledge gain iteration → n+1 n A location → B t-1 C t time → D t+1 INFORMS Annual Meeting Austin t+2

COMBINING OPTIMAL LEARNING AND ADP DECISIONS We decide to move to location B resulting in a post decision state iteration → n+1 n A location → B t-1 C t time → D t+1 INFORMS Annual Meeting Austin t+2

CHALLENGES WITH OPTIMAL LEARNING IN ADP • Impact on next iteration hard to compute → so we assume a similar resource and demand state in the next iteration and evaluate the impact of an updated knowledge state • Bias: • Decisions have impact on the value of states in the downstream path (we learn what we measure) • Decisions have impact on the value of states in the upstream path (with on-policy control) • To decision to measure a state will change its value which in turn might influence our decisions in the next iteration: Simply measuring states more often might increase their estimated values which in turn make them more attractive next time INFORMS Annual Meeting Austin

SKETCH OF OUR SOLUTION APPROACH • To cope with the bias, we propose using so-called projected value functions • Assumption: exponential increase (decrease if we started with optimistic estimates) in estimated values as a function of the number of iterations • Value iteration is known to converge geometrically, see [1] [1] M.L. Puterman (1994). Markov decision processes. New York: John Wiley & Sons. …and hopefully weighted estimates n>n0 output aftern0 limitingvalue rate INFORMS Annual Meeting Austin

SKETCH OF OUR SOLUTION APPROACH • Illustration projected value functions: INFORMS Annual Meeting Austin

NEW ADP ALGORITHM • Step 2b: • Update the value functionestimates at all levels ofaggregation • Update the weights andcompute the weightedvalue function estimates, possibly for many states at once • Step 2c: • combine the updated value function estimates with the prior distributions on the projected value functions to obtain posterior distributions, see [1] for details • The new state follows from running HKG using our beliefs on the projected value functions as input • So we completely separated the updating step (step 2a/b) and the exploration step (step 2c) [1] P.I. Frazier, W.B. Powell, and H.P. Simão (2009). Simulation model calibration with correlated knowledge-gradients. INFORMS Annual Meeting Austin

PERFORMANCE IMPRESSION • Experiment on an instance of the Nomadic Trucker Problem INFORMS Annual Meeting Austin

SHORTCOMINGS • Fitting • It is not always possible to find a nice fit • For example, if observed values increase slightly faster in the beginning and slower after that (compared to the fitted exponential), we still have this bias where sampled states look more attractive than others; after a sufficient number of measurements this will be corrected • Computation time • We have to spent quite some computation time to make the sampling decision; we could have used this time just to sample the states instead of thinking about it • Application area: large state space (pure exploration doesn’t make sense) but small action space INFORMS Annual Meeting Austin

CONCLUSIONS • We illustrated the challenges of ADP using the Nomadic Trucker example • We illustrated how optimal learning can be helpful here • We illustrated the difficulty of learning in ADP due to the bias: • our estimated values are influenced by the measurement policy which in turn is influenced by our estimated values • To cope with this bias we introduced the notion of projected value functions • This enables use to use the HKG policy to • cope with the exploration vs. exploitation dilemma • allow generalization across states • We shortly illustrated the potential of using this approach but also mentioned several shortcomings INFORMS Annual Meeting Austin

QUESTIONS? Martijn Mes Assistant professor University of Twente School of Management and Governance Operational Methods for Production and Logistics The Netherlands Contact Phone: +31-534894062 Email: m.r.k.mes@utwente.nl Web: http://mb.utwente.nl/ompl/staff/Mes/ INFORMS Annual Meeting Austin

Learning in Approximate Dynamic Programming for Managing a Multi-Attribute Driver Martijn Mes

Learning in Approximate Dynamic Programming for Managing a Multi-Attribute Driver Martijn Mes

Presentation Transcript

Multi-Attribute Risk Assessment

Dynamic Sample Selection for Approximate Query Processing

XDoclet : Attribute-Oriented Programming

Tactical Planning in Healthcare with Approximate Dynamic Programming

Approximate Dynamic Programming and Reinforcement Learning for Nonlinear Optimal Control of Power Systems

Feasibility Mapping for Multi-attribute Decision Making

Multi -Attribute Spaces: Calibration for Attribute Fusion and Similarity Search

Multi Criteria Operators for Multi-attribute Auctions

Dynamic Sample Selection for Approximate Query Processing

Multi-agent Reinforcement Learning in a Dynamic Environment

Multi-Attribute Preference Logic

Managing dynamic concurrent tasks in real-time multi-media systems

Multi-attribute planning in GIS for ecosystem protection

Programming Abstractions for Approximate Computing

Petroleum Reservoir Management Based on Approximate Dynamic Programming

Reinforcement Learning : Dynamic Programming

Dynamic Database Integration in a JDBC Driver

Managing for Results in Country Programming

Managing in a Multi-Sourced Environment

Reinforcement Learning : Dynamic Programming

Computational Mechanisms for Multi-Attribute Exchange Markets

Approximate Dynamic Programming Methods for Resource Constrained Sensor Management