420 likes | 425 Views
Learning to Trade via Direct Reinforcement. John Moody and Matthew Saffell, IEEE Trans Neural Networks 12(4), pp. 875-889, 2001 Summarized by Jangmin O. Author. J. Moody Director of Computational Finance Program and a Professor of CSEE at Oregon Graduate Institute of Science and Technology
E N D
Learning to Trade via Direct Reinforcement John Moody and Matthew Saffell, IEEE Trans Neural Networks 12(4), pp. 875-889, 2001 Summarized by Jangmin O BioIntelligence Lab.
Author • J. Moody • Director of Computational Finance Program and a Professor of CSEE at Oregon Graduate Institute of Science and Technology • Founder & President of Nonlinear Prediction Systems • Program Co-Chair for Computational Finance 2000 • a past General Chair and Program Chair of the NIPS • a member of the editorial board of Quantitative Finance BioIntelligence Lab.
I. Introduction BioIntelligence Lab.
Optimizing Investment Performance • Characteristic • Path-dependent • Methods : Direct Reinforcement learning (DR) • Recurrent Reinforcement Learning [1, 2] • No need for forecasting model • Single security or Asset allocation • Recurrent Reinforcement Learning (RRL) • Adaptive policy search • Learning investment strategy on-line • No need to learn a value function • Immediate rewards available in financial market BioIntelligence Lab.
Difference between RRL & Q or TD • Financial decision making problem : suitable to RRL • Immediate feedback available • Performance criteria : risk-adjusted investment returns • Shape ratio • Downside risk minimization • Differential form BioIntelligence Lab.
Experimental Data • U.S. dollar/British Pound foreign exchange market • S&P 500 Stock Index and Treasury Bills • RRL v.s. Q • Bellman’s curse of dimensionality BioIntelligence Lab.
II. Trading Systems and Performance Criteria BioIntelligence Lab.
Structure of Trading Systems (1) • An agent : assumption • 단일 시장에서 고정 포지션씩 거래 • Trader at time t , Ft{+1, 0, 1} • Long : 매수, Neutral : 관망, Short : 공매도 • 이익 Rt • (t-1, t] 의 끝에 실현, Ft-1포지션에 따른 손익 + Ft-1에서 Ft로의 포지션 이동에 따른 수수료 • Recurrent 구조로 가야 한다! • 수수료, 마켓 임팩트, 세금등을 고려한 결정을 하기 위해서 BioIntelligence Lab.
Structure of Trading Systems (2) • A single asset trading system • t : system parameter at time t • It : information at time t • zt : price series, yt : other external variable series • Simple example BioIntelligence Lab.
Profit and Wealth for Trading Systems (1) • Performance functions, U(), for risk insensitive trader = Profit • Additive profits • Security 의 고정 주수(shares or contracts) 에 대한 거래 • rt = zt – zt-1 : risky asset 의 리턴 • rtf : risk-free asset 의 리턴 (T-bill 같은) • : 수수료 비율 • Trader의 자산 : WT = W0 + PT BioIntelligence Lab.
Profit and Wealth for Trading Systems (2) • Multiplicative profits • 누적 자산의 일정 비율 > 0 이 투자됨 • rt = (zt/zt-1 –1) • In case of no short sales, when = 1 BioIntelligence Lab.
Performance Criteria • UTin general form of U(RT,…,Rt,…,R2,R1;W0) • Simple form U(WT) : standard economic utility • Path-dependent performance function : Sharpe ratio etc. • Moody의 관심사. • Marginal increase of Ut, caused by Rt at each time step • Differential performance criteria BioIntelligence Lab.
Differential Sharpe Ratio (1) • Sharpe ratio : risk adjusted return • Differential Sharpe ratio • 온라인 러닝을 위해, 시간 t에서의 Rt의 영향을 계산이 필요. • 지수 이동 평균 사용 • Adaptation rate 에 대한 1차 Taylor 전개 = 0 이면 St = St-1 BioIntelligence Lab.
Differential Sharpe Ratio (2) • Exponential moving average with adaptation rate • Sharpe Ratio • Taylor 전개로부터, Rt > At-1 : increased reward Rt2 > Bt-1 : increased risk BioIntelligence Lab.
Differential Sharpe Ratio (3) • Derivative with . • Dt is max at Rt = Bt-1/At-1 • Meaning of differential Sharpe ratio • Making on-line learning possible : At-1과 Bt-1로부터 쉽게 계산 가능 • Recursive updating 이 가능함 • 최근 return 에 강한 가중치 부여 • 해석력 : Rt의 기여도를 알 수 있게됨 BioIntelligence Lab.
III. Learning to Trade BioIntelligence Lab.
Reinforcement Framework • RL • Maximizing the expected reward • Trial and error exploration of the environment • Comparison with supervised learning [1, 2] • Problematic with transaction costs • Structural credit assignment v.s. temporal credit assignment • Types of RL • DL : policy search • Q-learning : value function • Actor-critic method BioIntelligence Lab.
Recurrent Reinforcement Learning (1) • Goal • 트레이딩 시스템 Ft()에 대해, UT를 최대화 하는 파라미터 를 찾는 것 • Example • 트레이딩 시스템 • Trading return • 시간 T후의 미분 공식 BioIntelligence Lab.
Recurrent Reinforcement Learning (2) • 학습 기법 • Back-propagation through time (BPTT) • Temporal dependencies • Stochastic version • Rt에 관계되는 항에만 집중 Differential performance criteria Dt BioIntelligence Lab.
Recurrent Reinforcement Learning (3) • Remind • Moody 는 특정 액션에 대한 즉각적인 측정치, Dt를 최적화 하는 것에 초점 • [1, 2] • 포트폴리오 최적화 등 BioIntelligence Lab.
Value Function (1) • Implicitly learning correct actions through value iteration • Value function • Discounted future rewards being received from state x following the policy 상태 x에서 액션 a를 취할 확률 x y 상태 전이시 액션 a를 취할 확률 x y 상태 전이시 액션 a를 취할 때의 immediate reward Future reward와 immediate rewards 간의 discount factor BioIntelligence Lab.
Value Function (2) • Optimal value function & Bellman’s optimally equation • Value iteration update : Converge to optimal solution • Optimal Policy BioIntelligence Lab.
Q-Learning • Q-function : 현재 상태와 현재 액션에 대한 future reward 계산 • Value iteration update : Converge to optimal Q-function • Calculating the best action • No need to know pxy(a) Error function of function approximator (i.e. NN) BioIntelligence Lab.
IV. Empirical Results Artificial price series U.S. Dollar/British Pound Exchange rate Monthly S&P 500 stock index BioIntelligence Lab.
A trading system based on DR BioIntelligence Lab.
Artificial price series • Data : autoregressive trend processes • 10,000 samples • 검증 • RRL 이 트레이딩 전략의 학습 도구로 적합한지? • 거래세의 증가에 따른 거래 횟수의 경향은? BioIntelligence Lab.
Error function of function approximator (i.e. NN) 10,000 샘플 {long, short} position only ~2,000 기간 동안 성능 저하 BioIntelligence Lab.
9,000~ 확대 = 0.01 BioIntelligence Lab.
거래횟수 Sharpe Ratio 100 번 실험 후 결과 100 에포크 학습 + 온라인 적응 거래세 0.2%, 0.5%, 1% 누적이익 BioIntelligence Lab.
U.S. Dollar/British Pound Foreign Exchange Trading • {long, neutral, short} trading system • 30 minute U.S. Dollar/British Pound foreign exchange (FX) rate data • 주 5일, 24시간 거래 : 1996년 1~8월 분량 • 전략 • 2,000 데이터 학습 • 480 데이터 트레이딩 (2주) • 윈도우 이동후 재학습 • 결과 • Annualized 15% return with annualized Sharpe ratio 2.3 • 평균적으로 5시간당 1번 거래 • 고려되지 않은 사항 • 피크를 이룬 트레이딩. • 시장의 비유동성 BioIntelligence Lab.
S&P 500/T-Bill Asset Allocation (1) • 소개 • Long position : S&P 500 에 포지션, T-Bill 이윤은 없음 • Short position : 2배의 T-Bill 비율을 얻음 배당금 재투자 T-Bill 배당금 S&P500 배당금 BioIntelligence Lab.
S&P 500/T-Bill Asset Allocation (2) • 시뮬레이션 • 데이터(1950 ~ 1994): 초기 학습(~1969) + 테스트(1970~) • 학습 윈도우 : 10년 학습 + 10년 validation • Input Feature : 84 (financial + macroeconomic) series • RRL-trader • tanh 유닛 1개, weight decay • Q-trader • bootstrap 샘플 사용 • 2-layer FNN (30 tanh 유닛) • Bias/variance trade off : 10, 20, 30, 40 유닛 모델중 선택 BioIntelligence Lab.
Voting methods RRL : 30 번, Q : 10 번 거래세 0.5% 이익금 재투자 Multiplicative profit ratio Buy and Hold : 1348% Q-Trader : 3359% RRL-Trader : 5860% BioIntelligence Lab.
시장조정 통화긴축 오일쇼크 걸프 전쟁 시장붕괴 대전제: 1970 ~ 1994 의 25년 동안 미국 증권/재무증권 시장은 예측가능했다. Statistically significant BioIntelligence Lab.
인플레이션 기대치 Sensitivity Analysis BioIntelligence Lab.
V. Learn the Policy or Learn the Value? BioIntelligence Lab.
Immediate v.s. Future Rewards • Reinforcement signal • Immediate (RRL) or delayed (Q , dynamic programming, or TD) • RRL • Policy is represented directly. • Learning value function is bypassed • Q • Policy is represented indirect BioIntelligence Lab.
Policies v.s. Values • Some limitations of value function approach • Original formulation of Q-learning : discrete action & state spaces • Curse of dimensionality • Policies derived from Q-learning tend to be brittle : small changes in value function may lead large changes in the policy • Large scale noise and non-stationarity may lead severe problems • RRL’s advantages • Policy is represented directly : Simpler functional form is sufficient • Can produce real valued actions • More robust in noisy environment / Quick adaptation to non-stationarity BioIntelligence Lab.
An Example • Simple trading system • {buy, sell} a single asset. • Assumption: rt+1 is known in advance. • No need to future rewards : = 0 • Policy funciton is trivial : at = rt+1 • 1 tanh unit is sufficient • Value function : ability to treat XOR • 2 tanh units needed BioIntelligence Lab.
Conclusion • How to train trading systems via DR • RRL algorithm • Differential Sharpe ratio & differential downside deviation ratio • RRL is more efficient than Q-learning in financial area. BioIntelligence Lab.