260 likes | 269 Views
Financial news sources, product delivery, doc2vec. 15-8-2017 David Ling. Contents. Comparing different news sources Similar existing products for Chinese news Expected milestone Sentiment using word2vec and doc2vec Relation to stock price. News. Available Chinese News sources: 3 types
E N D
Financial news sources, product delivery, doc2vec 15-8-2017 David Ling
Contents • Comparing different news sources • Similar existing products for Chinese news • Expected milestone • Sentiment using word2vec and doc2vec • Relation to stock price
News • Available Chinese News sources: 3 types • Official news and announcements – HKEx • Traditional newspaper – Wisenews • Real time online news – 華富財經, 經濟通, 阿思達克財經網, 財華網
HKEX Disadvantages • Not frequent • Only official documents • Many announcements are numerical data and forms, format may vary for different companies, difficult for data cleaning • Often numbers and tables, a lot of unrelated words, and few related sentence for textural data for analyzing Sample news and announcement 2 Sample news and announcement 1
Wisenews • Collection of news from traditional newspapers Advantages • Large data base (up to the year 2000) • More frequent than HKEx • News context is richer than HKEx documents, with more descriptive terms like “急挫”,“落後”,“增持” Disadvantages • Not as frequent as online news • Need segmentation and tagging for related company • Not free • News are not financial categorized 8297 3983, 3303
Online news1: 華富財經 Advantages: • Update frequently • Tagged with related company already • Easy to crawl (I have wrote a crawler) Disadvantages: • Less reliable when compared with newspaper • Less, but still sufficient past data (up to ~ 2010)
Online news2: 財華網 Advantages: • Update very frequently • Used by Li Xiaodong 2014 • Doctorial thesis, City university • News Impact Analysis in Algorithmic Trading • 2003-2008 English news articles were used • dictionary based Disadvantages: • Harder to crawl
Online news3: 阿思達克財經網, 經濟通 Advantages • Similar to the previous Disadvantages • Fewer past data • Search results are limited to 15 pages (阿思達克) • Search results are limited to 200 items(經濟通)
News sources comparison • According to data format, we may start with online news first • Online news is very demanding for algorithmic and no latency trading, as there are too many online financial news to read for traders • Final product should be able to include newspaper
Similar works on sentiment • Last time: • DICTION (dictionary- based) • Thomson Reuters news sentiment (neural network, supervised learning) • This time: • 台股新聞情緒 (Chinese, dictionary-based) • 優礦 (Chinese) • Wordstat (dictionary-based)
台股新聞情緒 • Webpage format • Providing 2 kinds of index of a company: • SR (optimistic) • ITDC (risk) • Also lists companies with top SR index per week • News sentiment by counting keywords“指標試算” • They have developed their own dictionary in Chinese 聯合知識庫+銘傳大學
News sentiment by counting keywords: Incorrect, the tone should be negative. Correct, the tone is negative. Dictionary seems not so sophisticated. Like “蝕”is not detected.
優礦 • A mainland big data company • Provide news and news sentiment api • 40k news threats per day • Sentiment score: [-1,1] • Provide strategy trading simulation • A user demonstrated a strategy trading on the company forum
By user cheng.li • Funny simulation result: both strategies are making profits
Wordstat • Non- free article sentiment analysis software • Dictionary-based • Trial version last for 30 days • Provide only descriptive statistic by using Loughranand McDonald Sentiment Word Lists
Milestone delivery: • Similar to 聯合知識庫and 優礦 • Web, api,and financial news dictionary for Hong Kong can be built • Collect online news from multiple online sources • Provide news headlines, links, and sentiment scores • Calculate score index for each company for the day • List companies with high/low scores for recommendation • May also attract mainland traders
Sentiment methods • Classify texts into groups (eg. Optimistic or pessimistic) • Last time: • n gram + 1-hot vector + SVM • Other possible ways: • Word2vec • Doc2vec • N gram + 1-hot vector • Document 1: “cat sat mat” • Document 2: “cat hate cat”
word2vec • Model which turns a word into a vector • Method • Teach the machine to guess the context from the target word (skip- gram) • A mapping between the context and the target word • Example: • Doc1: Thecatsat on the mat • Doc2: Thedogsat on the mat • Teaching: • Given “cat” (target word), guess “the” and “sat” (context) • Given “dog” (target word), guess “the” and “sat” (context) • Outcome: • Similar words are usually having very similar context • Their mapping parameters are similar • And thus the word vectors (which are the mapping parameters)
Word2vec-practical • Jieba sentence cutting (結巴分詞) • tagging numerical terms (regexp) • Before: • 長和今天放榜,早前大摩預測長和中期比只升5%,主要受英英鎊貶值等外匯因素影響,預測長和上半年經營溢利同比升5%至309億元 • After: • 長和 今天 放榜 早前 大摩 預測 長和 中期 比 只升 xpercent主要 受英 英鎊 貶值 等 外匯 因素 影響 預測 長和 上半年 經營 溢利 同比 升 xpercent至 xmoney
Word2vec-practical • using crawled 30000 Quamnet news, 300 embedded feature • Results Word2vec by Tensorflow Nearest to 跌: 升, 挫, 倒跌, 微跌, 股亦收升, 現跌, 無升, 微升, Nearest to 同比: 按年, 去年同期, 之後高見, 僅減, 連特別息, 遠洋報, 此負, Nearest to 日: xdate, 日期, 日起, 日向, 昨日, 日止, 日終, 郭樹清, Nearest to 對: 認為, 家會員, 讓, 運費, 他們將, 將對, 令電能, 與, Nearest to 涉及: 共, 成交, xhand, 涉資約, 光啟, bcm_energy_partners, 對換, Nearest to 公布: 公佈, 宣布, 公告, 放榜, 發布, 止, 公在, arthur_h_del_prado, Nearest to 中: 指中, 內, 中解釋, 神華及, 耐, 港鐵學院, 其後再展, 遴選及, Nearest to 在: 於, 或, , 預期, 將在, 將於未來, 資源予, 與, Nearest to 而: 但, 另外, 表示, 或, 九鐵, 至於, 認為, 他稱, Nearest to 投資: 投資及, 金遠, 資產, 融資, 基金, 投資項, 阿拉斯加, 發展, Nearest to 公佈: 公布, 宣布, tank, 表示, 宣, 矽谷, 姜元, 刊發, Nearest to 虧損: 盈利, 溢利, 純利, 增長, 收益, , 收入, 錄純利, Nearest to 止: 止六個, 止全, 止首, 日止, 止九個, 月止, 止三個, 止將, Nearest to 會: 將會, 洪建, 只會, 會會, 可以, 起累, 匿名信, 希望, Nearest to 由: 則由, 為, 因為, 至, 因倫敦, 從, 自, 調莎莎, Nearest to 後: 簡俊傑, xindex, 前, 向國纜, 擴大後之, 港股, 已, 建市場,
Target word • Another method in word2vec is Continuous bag of words (CBOW) • Opposite to skip-gram, given the context, guess the target word • Mapping parameters form the word vector • But, we need a vector for a document for classification, not for a single word • Solutions: • Solution 1: Averaging all the word vectors: Doc1 vector = [0.42, 0.38,0.22] • bad, as averaging is losing a lot of information • Solution 2: Extending word2vec to doc2vec context
doc2vec • Adding 1-hot paragraph / document id vector • Is a constant input vector for different input context in a particular paragraph • Both weighting parameters for paragraph id and words are updated at the same time during training • Weighting parameters for the paragraph id • represent the missing information from the current context • act as a memory of the topic of the paragraph • formed the paragraph vector Quoc Le, Tomas Mikolov 2014 https://arxiv.org/pdf/1405.4053v2.pdf
doc2vec Comparing sentiment accuracy (movie review): https://recurrentnull.wordpress.com/tag/sentiment-analysis/ • Doc2vec with logistic regression has the highest accuracy • But only slightly higher than bag of words + 1 hot by 1%
Sentiment method and evaluation: • Proposed approach: supervised learning for optimistic and pessimistic : • Manually classify may be ~5k news articles • Very Positive, positive, neutral, negative, very negative • Additional scores can be added at a later time • Separate into training data and testing data, evaluate accuracy and F1 score • Use SVM, FFNN, RNN, logistic regression, and even naive bayesian for performance comparing • Features will be 1hot bag of words, doc2vec, part of speech • Comparison with Mcdonal’s financial dictionary may not be meaningful • Their word list come from 10K reports, while we are news • Their word list is in English
Relation to stock price • Calculate the correlation • 2 time series, corr(X,Y) • Event study (The Econometrics of Financial Markets (Ch.4)) • Impact on stock price by an event • Abnormal return = return – normal return • Return: data in event window (diff in daily closing prices) • Normal return: using data in estimation window (eg. 60 days) • Null hypothesis: AR~N(0,var) • Stock buying simulation using simple strategies Very small corr, as stock prices may also vary by other factors not on news
The End and Thank you references • 技术分析【3】—— 众星拱月,众口铄金?https://uqer.io/community/share/55498c0af9f06c1c3d68806e • Sentiment Analysis of Movie Reviews (3): doc2vechttps://recurrentnull.wordpress.com/tag/sentiment-analysis/ • Distributed representation of sentences and documentshttps://arxiv.org/pdf/1405.4053v2.pdf