270 likes | 278 Views
Assigning an entrepreneurship score for companies. 12-9-2017 David Ling. Direction of project. Choices: remains unchanged: financial news sentiment for stock return prediction back to the original: entrepreneurship of a company via annual reports knowledge graph (still reading)
E N D
Assigning an entrepreneurship score for companies 12-9-2017 David Ling
Direction of project • Choices: • remains unchanged: financial news sentiment for stock return prediction • back to the original: entrepreneurship of a company via annual reports • knowledge graph (still reading) • David Webb (still no idea yet) • Regarding to Choice 2: • Compare annual reports between good and bad companies • Assigning an entrepreneurship score for companies
Assigning an entrepreneurship score for companies Definition from Wikipedia: • Target: US companies, based on their annual reports • Difficulties: • Definitions and features of entrepreneurship are abstract and subjective • No existing word list for “entrepreneurship” • Can the machine learn the word relation itself? • Solutions: • develop a word list which is related to entrepreneurship manually • Using the idea from ANZ research group (Australian cash rate) and the co-occurrence probability language model • Using word2vec or Glove word vectors
Solution B • Recalling the research done by the ANZ research group • Decide whether the bank statement is more “Hawkish” or “Dovish” • Use Google’s search engine to see if the words are associated more with the word ‘hawkish’ or ‘dovish’ • Similarly, we can use the number of google searching results to approximate how much of “entrepreneurship” of each word
Solution B • To learn the definitions and features of entrepreneurship, one may search over google
Web page 1 Solution B • Text on the resulted pages are usually examples, definitions, or descriptions of “entrepreneurship” • Words which frequently appear in the searched results are thus highly related to “entrepreneurship” • For examples, “business”, “develop”, “venture”, “Steve Jobs”, “Apple” are highly related Web page 2 Web page 3
Solution B • Therefore, we may define the word relation to entrepreneurship by the probability of occurrence on a web page: • Examples • Relationship between Entrepreneurship and business • Relationship between Entrepreneurship and door • For flower and investment • Door and flower have a lower score as they are not so related Relation between the word k and entrepreneurship P(word k |entrepreneurship)
Solution B • ASSUMPTION: A report with more related words means the company is acting more like a successful entrepreneur • By searching for each word in an annual report, inversely weighted by frequency and take an average, we have a score for a report • We may • Average over time (eg. 5 years) to give a score for each company to give a ranked list • Find correlation of a report score and the stock yearly return (next year) • Compare with other Solutions (eg. Solution C) Schematic table for the rank
Solution C • Using co-occurrence probability of words is actually the same as the GloVe and word2vec • For GloVe, it uses matrix factorization to reproduce the log co-occurrence probability (neural network is used for word2vec) • Cosine similarity of word vectors are related to the log co-occurrence probability in some forms (?) • Thus, we can define the word relation score by the cosine similarity between the word k and entrepreneurship Number of co-occurrence Word and bias vector of word k
Solution C Top 20 closest words to ‘entrepreneurship’ in GloVe • ('entrepreneurship', 0.99999999999999967), ('innovation', 0.79568864817468621), ('entrepreneurial', 0.7764247592066883),('promotes', 0.77394689203429401), ('fosters', 0.74679234987383847), ('fostering', 0.74079458934702958), ('interdisciplinary', 0.73785926780240063), ('educational', 0.7366730052474193), ('advancement', 0.73600707209495098), ('outreach', 0.7209472352913362), ('sustainability', 0.71649423278308733), ('experiential', 0.71450324156613265), ('encourages', 0.71082716517947486), ('philanthropy', 0.70262490307912706), ('grassroots', 0.69859303363697234), ('strives', 0.69324547345355292), ('humanities', 0.69242081208373807), ('mentorship', 0.68831284264530612), ('endeavors', 0.68830383131309092), ('promoting', 0.68374839092623585)
Discussions • Comparing Sol B and Sol C: • Sol B • uses the whole webpage as the window, and google crawled internet webpages as the corpus • was demonstrated by ANZ group (my interpretation) • Sol C • is new (haven’t seen before) • Pre-trained word vectors are immediately available (Wikipedia, twitter, and Common crawl) • Although Sol B and C may not be regarded as deep learning methods, but they used one of the latest natural language models in deep learning
Discussions • Problems: • Cannot resolve negative scope : eg. “Flower is not related to entrepreneurship”. (Assuming portion is small) • As the score is new, difficult to have a baseline from other methods • Google may block your searching request due to excessive traffic (for sol B)
Table of contents of 3M 10k-filing Discussions • 10k-filings downloaded are in html format. • Filings can be downloaded via https://www.sec.gov/edgar/searchedgar/companysearch.html • Questions: • Any suggested companies to start with? • Full 10k report or particular sections in 10k-filings? • Should we removing company names? Like “Bill Gates”, “Microsoft” (in order to remove the dependency on the names).
Thank you • End for the entrepreneurship score • References:https://nlp.stanford.edu/projects/glove/ • https://www.bloomberg.com/news/articles/2017-08-27/this-algorithm-tracks-what-australia-s-central-bank-is-really-thinking
Google knowledge graph • Google search engine • Inverted list • PageRank • Knowledge base • Google knowledge base/ graph • Information extraction • Bootstrapping • Distant supervision • Link prediction
Google search • Reference • https://www.google.com/search/howsearchworks/ • The Anatomy of a Large-Scale Hypertextual Web Search Engine, Stanford, 2008 • Crawling • Indexing • inverted list, record font size, date, keywords, hyperlinks • Search algorithm • Analyzing your search (wrong spelling, synonym) • Matching your search (eg. Keywords, Language) • Ranking pages
Google search • Inverted list • Build a list using keywords of the web pages • Example: • web1 - "ilove you" • Web2 - "god is love" • Web3 - "love is blind" • Web4 - "blind justice" Inverted list If you search “is” + ”love”, then you will get web 2 and 3. Difference of word positions tells you how far are the words in the web. keywords webID, word position
Google search engine • PageRank • a score to sort the searched results • If a website A is linked by many websites => higher PR score • If a website contains many links to other webs => smaller affect to the PR score • Suppose website A is linked by B, C, D, …., PageRank (PR) score of A is Number of links of B to other webs
Google knowledge graph • Is a knowledge base • Provide semantic-search information to enhance search results • Results are shown on the “Knowledge Graph card” • Knowledge base is basically a set of triples: • (subject, predicate, object) • (harry potter, author, J. K. Rowling) • (harry potter, is-a, book) • (J. K. Rowling, is-a, author) • Sources: • Wikidata and Freebase (structured data set of facts) • Human curated and collaborative • Next generation: Google knowledge vault (automatic)
Knowledge Vault: A Web-Scale Approach toProbabilistic Knowledge Fusion (google, KDD 2014) • Extraction: • named entity recognition • part of speech tagging • dependency parsing • co-reference resolution (eg. he, she, it) • entity linkage (maps mentioned nouns to the cor-responding entities in the knowledge base) • Relation extraction: • Hand-built patterns • Supervised methods • Bootstrapping methods • Distant supervision (used by KV)
Bootstrap method • Using a few seeds to generate patterns by searching • Target relation: burial place • Seed tuple: [Mark Twain, burial place, Elmira] • Google or search in corpus for “Mark Twain” and “Elmira” • “Mark Twain is buried in Elmira, NY.” → X is buried in Y • “The grave of Mark Twain is in Elmira” → The grave of X is in Y • “Elmira is Mark Twain’s final resting place” → Y is X’s final resting place • Use those patterns to search for new tuples
Distant supervision • Similar to Bootstrap method, but label searched results according to existing KB to form a large set of training data • For example, predicate is “married_to” • Suppose known pairs in KB: (BarackObama, MichelleObama) and (Bill-Clinton, HillaryClinton) • Search sentences in which this pair is mentioned, extract features patterns. Use the features to search more pairs. (Bootstrap phase) • If the new pair is in the known knowledge base -> label it as correct, else is incorrect (local closed world assumption) • Correct pair: (BarackObama, MichelleObama) -- feature A: 10 times, feature B: 2 times • Incorrect pair: (Trump, Zi) -- feature A: 2 times, feature B: 5 times • Logistic regression
Distant supervision • Example features: • word sequence between named entities • Dependence parse Named entity2: location Named entity1: person • S: surface subject • Pred: predicate of a clause • Mod: relationship between a word and its adjunct modifier • Pcomp-n: nominal complement of prepositions Dependence parse:
Link predictions • Assign a probability to any possible triple even there is no corresponding evidence for this fact on the Web • Example: • Extracted triples • (Peter, parent of, Sam), (Susan, parent of, Sam) • Predict / assign a high probability of the unseen triple • (Peter, married to, Susan) • Tensor factorization (collective learning)/ neural network
Link predictions • Regard all the possible triples as a ExPxE tensor Y • E = no. of possible entities • P = no. of possible predicates • Collective learning • Model the tensor Y like matrix factorization, turn an entity into a vector eg. (Peter, parent of, Sam) Embedding size of an entity vector k- th relation as a matrix (dimension ) Embedding the i-th entity as a vector
Link predictions • Since entity vectors e remain unchanged with different relation W • Similar entities have similar embedding vectors • Unknown relation can be predicted after matrix multiplication
Thank you • Reference used: • A Review of Relational Machine Learning for Knowledge Graphs, Proceedings of the IEEE ( Volume: 104, Issue: 1, Jan. 2016 ) • Reasoning with Neural Tensor Networks for Knowledge Base Completion. In NIPS, 2013. • Knowledge Vault: A Web-Scale Approach to Probabilistic Knowledge Fusion. KDD 2014. • A Three-Way Model for Collective Learning on Multi-Relational Data. International Conference on Machine Learning 2011.