Learning to Trade via Direct Reinforcement

Learning to Trade via Direct Reinforcement John Moody and Matthew Saffell, IEEE Trans Neural Networks 12(4), pp. 875-889, 2001 Summarized by Jangmin O BioIntelligence Lab.

Author • J. Moody • Director of Computational Finance Program and a Professor of CSEE at Oregon Graduate Institute of Science and Technology • Founder & President of Nonlinear Prediction Systems • Program Co-Chair for Computational Finance 2000 • a past General Chair and Program Chair of the NIPS • a member of the editorial board of Quantitative Finance BioIntelligence Lab.

I. Introduction BioIntelligence Lab.

Optimizing Investment Performance • Characteristic • Path-dependent • Methods : Direct Reinforcement learning (DR) • Recurrent Reinforcement Learning [1, 2] • No need for forecasting model • Single security or Asset allocation • Recurrent Reinforcement Learning (RRL) • Adaptive policy search • Learning investment strategy on-line • No need to learn a value function • Immediate rewards available in financial market BioIntelligence Lab.

Difference between RRL & Q or TD • Financial decision making problem : suitable to RRL • Immediate feedback available • Performance criteria : risk-adjusted investment returns • Shape ratio • Downside risk minimization • Differential form BioIntelligence Lab.

Experimental Data • U.S. dollar/British Pound foreign exchange market • S&P 500 Stock Index and Treasury Bills • RRL v.s. Q • Bellman’s curse of dimensionality BioIntelligence Lab.

II. Trading Systems and Performance Criteria BioIntelligence Lab.

Structure of Trading Systems (1) • An agent : assumption • 단일 시장에서 고정 포지션씩 거래 • Trader at time t , Ft{+1, 0, 1} • Long : 매수, Neutral : 관망, Short : 공매도 • 이익 Rt • (t-1, t] 의 끝에 실현, Ft-1포지션에 따른 손익 + Ft-1에서 Ft로의 포지션 이동에 따른 수수료 • Recurrent 구조로 가야 한다! • 수수료, 마켓 임팩트, 세금등을 고려한 결정을 하기 위해서 BioIntelligence Lab.

Structure of Trading Systems (2) • A single asset trading system • t : system parameter at time t • It : information at time t • zt : price series, yt : other external variable series • Simple example BioIntelligence Lab.

Profit and Wealth for Trading Systems (1) • Performance functions, U(), for risk insensitive trader = Profit • Additive profits • Security 의 고정 주수(shares or contracts) 에 대한 거래 • rt = zt – zt-1 : risky asset 의 리턴 • rtf : risk-free asset 의 리턴 (T-bill 같은) •  : 수수료 비율 • Trader의 자산 : WT = W0 + PT BioIntelligence Lab.

Profit and Wealth for Trading Systems (2) • Multiplicative profits • 누적 자산의 일정 비율  > 0 이 투자됨 • rt = (zt/zt-1 –1) • In case of no short sales, when  = 1 BioIntelligence Lab.

Performance Criteria • UTin general form of U(RT,…,Rt,…,R2,R1;W0) • Simple form U(WT) : standard economic utility • Path-dependent performance function : Sharpe ratio etc. • Moody의 관심사. • Marginal increase of Ut, caused by Rt at each time step • Differential performance criteria BioIntelligence Lab.

Differential Sharpe Ratio (1) • Sharpe ratio : risk adjusted return • Differential Sharpe ratio • 온라인 러닝을 위해, 시간 t에서의 Rt의 영향을 계산이 필요. • 지수 이동 평균 사용 • Adaptation rate 에 대한 1차 Taylor 전개  = 0 이면 St = St-1 BioIntelligence Lab.

Differential Sharpe Ratio (2) • Exponential moving average with adaptation rate  • Sharpe Ratio • Taylor 전개로부터, Rt > At-1 : increased reward Rt2 > Bt-1 : increased risk BioIntelligence Lab.

Differential Sharpe Ratio (3) • Derivative with . • Dt is max at Rt = Bt-1/At-1 • Meaning of differential Sharpe ratio • Making on-line learning possible : At-1과 Bt-1로부터 쉽게 계산 가능 • Recursive updating 이 가능함 • 최근 return 에 강한 가중치 부여 • 해석력 : Rt의 기여도를 알 수 있게됨 BioIntelligence Lab.

III. Learning to Trade BioIntelligence Lab.

Reinforcement Framework • RL • Maximizing the expected reward • Trial and error exploration of the environment • Comparison with supervised learning [1, 2] • Problematic with transaction costs • Structural credit assignment v.s. temporal credit assignment • Types of RL • DL : policy search • Q-learning : value function • Actor-critic method BioIntelligence Lab.

Recurrent Reinforcement Learning (1) • Goal • 트레이딩 시스템 Ft()에 대해, UT를 최대화 하는 파라미터  를 찾는 것 • Example • 트레이딩 시스템 • Trading return • 시간 T후의 미분 공식 BioIntelligence Lab.

Recurrent Reinforcement Learning (2) • 학습 기법 • Back-propagation through time (BPTT) • Temporal dependencies • Stochastic version • Rt에 관계되는 항에만 집중 Differential performance criteria Dt BioIntelligence Lab.

Recurrent Reinforcement Learning (3) • Remind • Moody 는 특정 액션에 대한 즉각적인 측정치, Dt를 최적화 하는 것에 초점 • [1, 2] • 포트폴리오 최적화 등 BioIntelligence Lab.

Value Function (1) • Implicitly learning correct actions through value iteration • Value function • Discounted future rewards being received from state x following the policy  상태 x에서 액션 a를 취할 확률 x  y 상태 전이시 액션 a를 취할 확률 x  y 상태 전이시 액션 a를 취할 때의 immediate reward Future reward와 immediate rewards 간의 discount factor BioIntelligence Lab.

Value Function (2) • Optimal value function & Bellman’s optimally equation • Value iteration update : Converge to optimal solution • Optimal Policy BioIntelligence Lab.

Q-Learning • Q-function : 현재 상태와 현재 액션에 대한 future reward 계산 • Value iteration update : Converge to optimal Q-function • Calculating the best action • No need to know pxy(a) Error function of function approximator (i.e. NN) BioIntelligence Lab.

IV. Empirical Results Artificial price series U.S. Dollar/British Pound Exchange rate Monthly S&P 500 stock index BioIntelligence Lab.

A trading system based on DR BioIntelligence Lab.

Artificial price series • Data : autoregressive trend processes • 10,000 samples • 검증 • RRL 이 트레이딩 전략의 학습 도구로 적합한지? • 거래세의 증가에 따른 거래 횟수의 경향은? BioIntelligence Lab.

Error function of function approximator (i.e. NN) 10,000 샘플 {long, short} position only ~2,000 기간 동안 성능 저하 BioIntelligence Lab.

9,000~ 확대  = 0.01 BioIntelligence Lab.

거래횟수 Sharpe Ratio 100 번 실험 후 결과 100 에포크 학습 + 온라인 적응 거래세 0.2%, 0.5%, 1% 누적이익 BioIntelligence Lab.

U.S. Dollar/British Pound Foreign Exchange Trading • {long, neutral, short} trading system • 30 minute U.S. Dollar/British Pound foreign exchange (FX) rate data • 주 5일, 24시간 거래 : 1996년 1~8월 분량 • 전략 • 2,000 데이터 학습 • 480 데이터 트레이딩 (2주) • 윈도우 이동후 재학습 • 결과 • Annualized 15% return with annualized Sharpe ratio 2.3 • 평균적으로 5시간당 1번 거래 • 고려되지 않은 사항 • 피크를 이룬 트레이딩. • 시장의 비유동성 BioIntelligence Lab.

BioIntelligence Lab.

S&P 500/T-Bill Asset Allocation (1) • 소개 • Long position : S&P 500 에 포지션, T-Bill 이윤은 없음 • Short position : 2배의 T-Bill 비율을 얻음 배당금 재투자 T-Bill 배당금 S&P500 배당금 BioIntelligence Lab.

S&P 500/T-Bill Asset Allocation (2) • 시뮬레이션 • 데이터(1950 ~ 1994): 초기 학습(~1969) + 테스트(1970~) • 학습 윈도우 : 10년 학습 + 10년 validation • Input Feature : 84 (financial + macroeconomic) series • RRL-trader • tanh 유닛 1개, weight decay • Q-trader • bootstrap 샘플 사용 • 2-layer FNN (30 tanh 유닛) • Bias/variance trade off : 10, 20, 30, 40 유닛 모델중 선택 BioIntelligence Lab.

Voting methods RRL : 30 번, Q : 10 번 거래세 0.5% 이익금 재투자 Multiplicative profit ratio Buy and Hold : 1348% Q-Trader : 3359% RRL-Trader : 5860% BioIntelligence Lab.

시장조정 통화긴축 오일쇼크 걸프 전쟁 시장붕괴 대전제: 1970 ~ 1994 의 25년 동안 미국 증권/재무증권 시장은 예측가능했다. Statistically significant BioIntelligence Lab.

인플레이션 기대치 Sensitivity Analysis BioIntelligence Lab.

V. Learn the Policy or Learn the Value? BioIntelligence Lab.

Immediate v.s. Future Rewards • Reinforcement signal • Immediate (RRL) or delayed (Q , dynamic programming, or TD) • RRL • Policy is represented directly. • Learning value function is bypassed • Q • Policy is represented indirect BioIntelligence Lab.

Policies v.s. Values • Some limitations of value function approach • Original formulation of Q-learning : discrete action & state spaces • Curse of dimensionality • Policies derived from Q-learning tend to be brittle : small changes in value function may lead large changes in the policy • Large scale noise and non-stationarity may lead severe problems • RRL’s advantages • Policy is represented directly : Simpler functional form is sufficient • Can produce real valued actions • More robust in noisy environment / Quick adaptation to non-stationarity BioIntelligence Lab.

An Example • Simple trading system • {buy, sell} a single asset. • Assumption: rt+1 is known in advance. • No need to future rewards :  = 0 • Policy funciton is trivial : at = rt+1 • 1 tanh unit is sufficient • Value function : ability to treat XOR • 2 tanh units needed BioIntelligence Lab.

Conclusion • How to train trading systems via DR • RRL algorithm • Differential Sharpe ratio & differential downside deviation ratio • RRL is more efficient than Q-learning in financial area. BioIntelligence Lab.

Learning to Trade via Direct Reinforcement

Learning to Trade via Direct Reinforcement

Presentation Transcript

Learning Pastoralists Preferences via Inverse Reinforcement Learning (IRL)

Learning Pastoralists Preferences via Inverse Reinforcement Learning (IRL)

Reinforcement Learning

Reinforcement Learning

Reinforcement Learning

Reinforcement Learning

Introduction to Reinforcement Learning

Reinforcement Learning

REINFORCEMENT LEARNING

Introduction to Reinforcement Learning

Reinforcement Learning

Reinforcement Learning

Reinforcement Learning

Apprenticeship Learning via Inverse Reinforcement Learning

Reinforcement Learning

Reinforcement Learning

Reinforcement Learning

Reinforcement Learning

Reinforcement Learning

Introduction to Reinforcement Learning