400 likes | 619 Views
Learning to Trade via Direct Reinforcement. John Moody International Computer Science Institute, Berkeley & J E Moody & Company LLC, Portland Moody@ICSI.Berkeley.Edu John@JEMoody.Com Global Derivatives Trading & Risk Management Paris, May 2008. What is Reinforcement Learning?.
E N D
Learning to Trade viaDirect Reinforcement John Moody International Computer Science Institute, Berkeley & J E Moody & Company LLC, Portland Moody@ICSI.Berkeley.Edu John@JEMoody.Com Global Derivatives Trading & Risk Management Paris, May 2008
What is Reinforcement Learning? RL Considers: • A Goal-Directed “Learning” Agent • interacting with an Uncertain Environment • that attempts to maximize Reward / Utility RL is an Active Paradigm: • Agent “Learns” by “Trial & Error” Discovery • Actions result in Reinforcement RL Paradigms: • Value Function Learning (Dynamic Programming) • Direct Reinforcement (Adaptive Control) Global Derivatives Trading & Risk Management – May 2008
I. Why Direct Reinforcement? Direct Reinforcement Learning: Finds predictive structure in financial data Integrates Forecasting w/ Decision Making Balances Risk vs. Reward Incorporates Transaction Costs Discover Trading Strategies! Global Derivatives Trading & Risk Management – May 2008
Optimizing Trades based on Forecasts • Indirect Approach: • Two sets of parameters • Forecast error is not Utility • Forecaster ignores transaction costs • Information bottleneck Global Derivatives Trading & Risk Management – May 2008
Learning to Trade via Direct Reinforcement • Trader Properties: • One set of parameters • A single utility function • U includes transaction costs • Direct mapping from inputs to actions Global Derivatives Trading & Risk Management – May 2008
Direct RL Trader(USD/GBP):ReturnA=15%, SRA=2.3, DDRA=3.3 Global Derivatives Trading & Risk Management – May 2008
II. Direct Reinforcement:Algorithms & Illustrations • Algorithms: • Recurrent Reinforcement Learning (RRL) • Stochastic Direct Reinforcement (SDR) • Illustrations: • Sensitivity to Transaction Costs • Risk-Averse Reinforcement Global Derivatives Trading & Risk Management – May 2008
Learning to Trade via Direct Reinforcement DR Trader: • Recurrent policy (Trading signals, Portfolio weights) • Takes action, Receives reward (Trading Return w/ Transaction Costs) • Causal performance function (Generally path-dependent) • Learn policy by varying GOAL: Maximize performance or marginal performance Global Derivatives Trading & Risk Management – May 2008
Recurrent Reinforcement Learning (RRL)(Moody & Wu 1997) Deterministic gradient (batch): with recursion: Stochastic gradient (on-line): stochastic recursion: Stochastic parameter update (on-line): Constant : adaptive learning. Declining : stochastic approx. Global Derivatives Trading & Risk Management – May 2008
Structure of Traders • Single Asset - Price series - Return series • Traders - Discrete position size - Recurrent policy • Observations: • Full system State is not known • Simple Trading Returns and Profit: • Transaction Costs: represented by . Global Derivatives Trading & Risk Management – May 2008
Risk-Averse Reinforcement:Financial Performance Measures Performance Functions: • Path independent: (Standard Utility Functions) • Path dependent: Performance Ratios: • Sharpe Ratio: • Downside Deviation Ratio: For Learning: • Per-Period Returns: • Marginal Performance: e.g. Differential Sharpe Ratio . Global Derivatives Trading & Risk Management – May 2008
Long / Short Trader SimulationSensitivity to Transaction Costs • Learns from scratch and on-line • Moving average Sharpe Ratio with = 0.01 Global Derivatives Trading & Risk Management – May 2008
Trader Simulation Transaction Costs vs. Performance 100 Runs; Costs = 0.2%, 0.5%, and 1.0% Sharpe Ratio Trading Frequency Global Derivatives Trading & Risk Management – May 2008
Minimizing Downside Risk:Artificial Price Series w/ Heavy Tails Global Derivatives Trading & Risk Management – May 2008
Comparison of Risk-Averse Traders Underwater Curves Global Derivatives Trading & Risk Management – May 2008
Comparison of Risk-Averse Traders: Draw-Downs Global Derivatives Trading & Risk Management – May 2008
III. Direct Reinforcement vs.Dynamic Programming • Algorithms: • Value Function Method (Q-Learning) • Direct Reinforcement Learning (RRL) • Illustration: • Asset Allocation: S&P 500 & T-Bills • RRL vs. Q-Learning Global Derivatives Trading & Risk Management – May 2008
Value Function Learning Origins: Dynamic Programming Learn “optimal” Q-Function Q: state action value Solve Bellman’s Equation Action: “Indirect” Direct Reinforcement Origins: Adaptive Control Learn “good” Policy P P: observations p(action) Optimize “Policy Gradient” Action: “Direct” RL Paradigms Compared Global Derivatives Trading & Risk Management – May 2008
S&P-500 / T-Bill Asset Allocation:Maximizing the Differential Sharpe Ratio Global Derivatives Trading & Risk Management – May 2008
S&P-500: Opening Up the Black Box 85 series: Learned relationships are nonstationary over time Global Derivatives Trading & Risk Management – May 2008
Closing Remarks • Direct Reinforcement Learning: • Discovers Trading Opportunities in Markets • Integrates Forecasting w/ Trading • Maximizes Risk-Adjusted Returns • Optimizes Trading w/ Transaction Costs • Direct Reinforcement Offers Advantages Over: • Trading based on Forecasts (Supervised Learning) • Dynamic Programming RL (Value Function Methods) • Illustrations: • Controlled Simulations • FX Currency Trader • Asset Allocation: S&P 500 vs. Cash Moody@ICSI.Berkeley.Edu & John@JEMoody.Com Global Derivatives Trading & Risk Management – May 2008
Selected References: [1] John Moody and Lizhong Wu. Optimization of trading systems and portfolios. Decision Technologies for Financial Engineering, 1997. [2] John Moody, Lizhong Wu, Yuansong Liao, and Matthew Saffell. Performance functions and reinforcement learning for trading systems and portfolios. Journal of Forecasting, 17:441-470, 1998. [3] Jonathan Baxter and Peter L. Bartlett. Direct gradient-based reinforcement learning: Gradient estimation algorithms. 2001. [4] John Moody and Matthew Saffell. Learning to trade via direct reinforcement. IEEE Transactions on Neural Networks, 12(4):875-889, July 2001. [5] Carl Gold. FX Trading via Recurrent Reinforcement Learning. Proceedings of IEEE CIFEr Conference, Hong Kong, 2003. [6] John Moody, Y. Liu, M. Saffell and K.J. Youn. Stochastic Direct Reinforcement: Application to Simple Games with Recurrence. In Artificial Multiagent Learning, Sean Luke et al. eds, AAAI Press, 2004. Global Derivatives Trading & Risk Management – May 2008
Supplemental Slides • Differential Sharpe Ratio • Portfolio Optimization • Stochastic Direct Reinforcement (SDR) Global Derivatives Trading & Risk Management – May 2008
Maximizing the Sharpe Ratio Sharpe Ratio: Exponential Moving AverageSharpe Ratio: with time scale and Motivation: • EMA Sharpe ratio emphasizes recent patterns; • can be updated incrementally. Global Derivatives Trading & Risk Management – May 2008
Differential Sharpe Ratiofor Adaptive Optimization Expand to first order in : Define Differential Sharpe Ratio as: where Global Derivatives Trading & Risk Management – May 2008
Learning with the Differential SR Evaluate “Marginal Utility” Gradient: Motivation for DSR: • isolates contribution of to (“marginal utility” ); • provides interpretability; • adapts to changing market conditions; • facilitates efficient on-line learning (stochastic optimization). Global Derivatives Trading & Risk Management – May 2008
Trader Simulation Transaction costs vs. Performance 100 runs; Costs = 0.2%, 0.5%, and 1.0% Trading Frequency Cumulative Profit Sharpe Ratio Global Derivatives Trading & Risk Management – May 2008
Portfolio Optimization (3 Securities) Global Derivatives Trading & Risk Management – May 2008
Stochastic Direct Reinforcement: Probabilistic Policies Global Derivatives Trading & Risk Management – May 2008
Learning to Trade • Single Asset - Price series - Return series • Trader - Discrete position size - Recurrent policy • Observations: • Full system State is not known • Simple Trading Returns and Profit: • Transaction cost rate . Global Derivatives Trading & Risk Management – May 2008
Why does Reinforcement need Recurrence? Considera learning agent with stochastic policy function whose inputs include recent observations o and actions a : Why should past actions (recurrence) be included? Examples: Games (observations o are opponent’s actions) Trading financial markets In General: Model opponent’s responses o to previous actions a Minimize transaction costs, market impact Recurrence enables discovery of better policies that capture an agent’s impact on the world !! Global Derivatives Trading & Risk Management – May 2008
Stochastic Direct Reinforcement (SDR):Maximize Performance Expected total performance of a sequence of T actions Maximize performance via direct gradient ascent Must evaluate total policy gradient for a policy represented by Global Derivatives Trading & Risk Management – May 2008
Stochastic Direct Reinforcement (SDR):Maximize Performance The goal of SDR is to maximize expected total performance of a sequence of T actions via direct gradient ascent Must evaluate for a policy represented by Notation: The complete history is denoted . is a partial history of length (n,m) . Global Derivatives Trading & Risk Management – May 2008
Stochastic Direct Reinforcement:First Order Recurrent Policy Gradient For first order recurrence (m=1), conditional action probability is given by the policy: The probabilities of current actions depend upon the probabilities of prior actions: The total (recurrent) policy gradient is computed as : with partial (naïve) policy gradient : Global Derivatives Trading & Risk Management – May 2008
SDR Trader Simulationw/ Transaction Costs Global Derivatives Trading & Risk Management – May 2008
Trading Frequency vs. Transaction Costs Recurrent SDR Non-Recurrent Global Derivatives Trading & Risk Management – May 2008
Sharpe Ratio vs. Transaction Costs Recurrent SDR Non-Recurrent Global Derivatives Trading & Risk Management – May 2008