1 / 27

Assigning an entrepreneurship score for companies

Assigning an entrepreneurship score for companies. 12-9-2017 David Ling. Direction of project. Choices: remains unchanged: financial news sentiment for stock return prediction back to the original: entrepreneurship of a company via annual reports knowledge graph (still reading)

perryprice
Download Presentation

Assigning an entrepreneurship score for companies

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Assigning an entrepreneurship score for companies 12-9-2017 David Ling

  2. Direction of project • Choices: • remains unchanged: financial news sentiment for stock return prediction • back to the original: entrepreneurship of a company via annual reports • knowledge graph (still reading) • David Webb (still no idea yet) • Regarding to Choice 2: • Compare annual reports between good and bad companies • Assigning an entrepreneurship score for companies

  3. Assigning an entrepreneurship score for companies Definition from Wikipedia: • Target: US companies, based on their annual reports • Difficulties: • Definitions and features of entrepreneurship are abstract and subjective • No existing word list for “entrepreneurship” • Can the machine learn the word relation itself? • Solutions: • develop a word list which is related to entrepreneurship manually • Using the idea from ANZ research group (Australian cash rate) and the co-occurrence probability language model • Using word2vec or Glove word vectors

  4. Solution B • Recalling the research done by the ANZ research group • Decide whether the bank statement is more “Hawkish” or “Dovish” • Use Google’s search engine to see if the words are associated more with the word ‘hawkish’ or ‘dovish’ • Similarly, we can use the number of google searching results to approximate how much of “entrepreneurship” of each word

  5. Solution B • To learn the definitions and features of entrepreneurship, one may search over google

  6. Web page 1 Solution B • Text on the resulted pages are usually examples, definitions, or descriptions of “entrepreneurship” • Words which frequently appear in the searched results are thus highly related to “entrepreneurship” • For examples, “business”, “develop”, “venture”, “Steve Jobs”, “Apple” are highly related Web page 2 Web page 3

  7. Solution B • Therefore, we may define the word relation to entrepreneurship by the probability of occurrence on a web page: • Examples • Relationship between Entrepreneurship and business • Relationship between Entrepreneurship and door • For flower and investment • Door and flower have a lower score as they are not so related Relation between the word k and entrepreneurship P(word k |entrepreneurship)

  8. Solution B • ASSUMPTION: A report with more related words means the company is acting more like a successful entrepreneur • By searching for each word in an annual report, inversely weighted by frequency and take an average, we have a score for a report • We may • Average over time (eg. 5 years) to give a score for each company to give a ranked list • Find correlation of a report score and the stock yearly return (next year) • Compare with other Solutions (eg. Solution C) Schematic table for the rank

  9. Solution C • Using co-occurrence probability of words is actually the same as the GloVe and word2vec • For GloVe, it uses matrix factorization to reproduce the log co-occurrence probability (neural network is used for word2vec) • Cosine similarity of word vectors are related to the log co-occurrence probability in some forms (?) • Thus, we can define the word relation score by the cosine similarity between the word k and entrepreneurship Number of co-occurrence Word and bias vector of word k

  10. Solution C Top 20 closest words to ‘entrepreneurship’ in GloVe • ('entrepreneurship', 0.99999999999999967), ('innovation', 0.79568864817468621), ('entrepreneurial', 0.7764247592066883),('promotes', 0.77394689203429401), ('fosters', 0.74679234987383847), ('fostering', 0.74079458934702958), ('interdisciplinary', 0.73785926780240063), ('educational', 0.7366730052474193), ('advancement', 0.73600707209495098), ('outreach', 0.7209472352913362), ('sustainability', 0.71649423278308733), ('experiential', 0.71450324156613265), ('encourages', 0.71082716517947486), ('philanthropy', 0.70262490307912706), ('grassroots', 0.69859303363697234), ('strives', 0.69324547345355292), ('humanities', 0.69242081208373807), ('mentorship', 0.68831284264530612), ('endeavors', 0.68830383131309092), ('promoting', 0.68374839092623585)

  11. Discussions • Comparing Sol B and Sol C: • Sol B • uses the whole webpage as the window, and google crawled internet webpages as the corpus • was demonstrated by ANZ group (my interpretation) • Sol C • is new (haven’t seen before) • Pre-trained word vectors are immediately available (Wikipedia, twitter, and Common crawl) • Although Sol B and C may not be regarded as deep learning methods, but they used one of the latest natural language models in deep learning

  12. Discussions • Problems: • Cannot resolve negative scope : eg. “Flower is not related to entrepreneurship”. (Assuming portion is small) • As the score is new, difficult to have a baseline from other methods • Google may block your searching request due to excessive traffic (for sol B)

  13. Table of contents of 3M 10k-filing Discussions • 10k-filings downloaded are in html format. • Filings can be downloaded via https://www.sec.gov/edgar/searchedgar/companysearch.html • Questions: • Any suggested companies to start with? • Full 10k report or particular sections in 10k-filings? • Should we removing company names? Like “Bill Gates”, “Microsoft” (in order to remove the dependency on the names).

  14. Thank you • End for the entrepreneurship score • References:https://nlp.stanford.edu/projects/glove/ • https://www.bloomberg.com/news/articles/2017-08-27/this-algorithm-tracks-what-australia-s-central-bank-is-really-thinking

  15. Google knowledge graph • Google search engine • Inverted list • PageRank • Knowledge base • Google knowledge base/ graph • Information extraction • Bootstrapping • Distant supervision • Link prediction

  16. Google search • Reference • https://www.google.com/search/howsearchworks/ • The Anatomy of a Large-Scale Hypertextual Web Search Engine, Stanford, 2008 • Crawling • Indexing • inverted list, record font size, date, keywords, hyperlinks • Search algorithm • Analyzing your search (wrong spelling, synonym) • Matching your search (eg. Keywords, Language) • Ranking pages

  17. Google search • Inverted list • Build a list using keywords of the web pages • Example: • web1 - "ilove you" • Web2 - "god is love" • Web3 - "love is blind" • Web4 - "blind justice" Inverted list If you search “is” + ”love”, then you will get web 2 and 3. Difference of word positions tells you how far are the words in the web. keywords webID, word position

  18. Google search engine • PageRank • a score to sort the searched results • If a website A is linked by many websites => higher PR score • If a website contains many links to other webs => smaller affect to the PR score • Suppose website A is linked by B, C, D, …., PageRank (PR) score of A is Number of links of B to other webs

  19. Google knowledge graph • Is a knowledge base • Provide semantic-search information to enhance search results • Results are shown on the “Knowledge Graph card” • Knowledge base is basically a set of triples: • (subject, predicate, object) • (harry potter, author, J. K. Rowling) • (harry potter, is-a, book) • (J. K. Rowling, is-a, author) • Sources: • Wikidata and Freebase (structured data set of facts) • Human curated and collaborative • Next generation: Google knowledge vault (automatic)

  20. Knowledge Vault: A Web-Scale Approach toProbabilistic Knowledge Fusion (google, KDD 2014) • Extraction: • named entity recognition • part of speech tagging • dependency parsing • co-reference resolution (eg. he, she, it) • entity linkage (maps mentioned nouns to the cor-responding entities in the knowledge base) • Relation extraction: • Hand-built patterns • Supervised methods • Bootstrapping methods • Distant supervision (used by KV)

  21. Bootstrap method • Using a few seeds to generate patterns by searching • Target relation: burial place • Seed tuple: [Mark Twain, burial place, Elmira] • Google or search in corpus for “Mark Twain” and “Elmira” • “Mark Twain is buried in Elmira, NY.” → X is buried in Y • “The grave of Mark Twain is in Elmira” → The grave of X is in Y • “Elmira is Mark Twain’s final resting place” → Y is X’s final resting place • Use those patterns to search for new tuples

  22. Distant supervision • Similar to Bootstrap method, but label searched results according to existing KB to form a large set of training data • For example, predicate is “married_to” • Suppose known pairs in KB: (BarackObama, MichelleObama) and (Bill-Clinton, HillaryClinton) • Search sentences in which this pair is mentioned, extract features patterns. Use the features to search more pairs. (Bootstrap phase) • If the new pair is in the known knowledge base -> label it as correct, else is incorrect (local closed world assumption) • Correct pair: (BarackObama, MichelleObama) -- feature A: 10 times, feature B: 2 times • Incorrect pair: (Trump, Zi) -- feature A: 2 times, feature B: 5 times • Logistic regression

  23. Distant supervision • Example features: • word sequence between named entities • Dependence parse Named entity2: location Named entity1: person • S: surface subject • Pred: predicate of a clause • Mod: relationship between a word and its adjunct modifier • Pcomp-n: nominal complement of prepositions Dependence parse:

  24. Link predictions • Assign a probability to any possible triple even there is no corresponding evidence for this fact on the Web • Example: • Extracted triples • (Peter, parent of, Sam), (Susan, parent of, Sam) • Predict / assign a high probability of the unseen triple • (Peter, married to, Susan) • Tensor factorization (collective learning)/ neural network

  25. Link predictions • Regard all the possible triples as a ExPxE tensor Y • E = no. of possible entities • P = no. of possible predicates • Collective learning • Model the tensor Y like matrix factorization, turn an entity into a vector eg. (Peter, parent of, Sam) Embedding size of an entity vector k- th relation as a matrix (dimension ) Embedding the i-th entity as a vector

  26. Link predictions • Since entity vectors e remain unchanged with different relation W • Similar entities have similar embedding vectors • Unknown relation can be predicted after matrix multiplication

  27. Thank you • Reference used: • A Review of Relational Machine Learning for Knowledge Graphs, Proceedings of the IEEE ( Volume: 104, Issue: 1, Jan. 2016 ) • Reasoning with Neural Tensor Networks for Knowledge Base Completion. In NIPS, 2013. • Knowledge Vault: A Web-Scale Approach to Probabilistic Knowledge Fusion. KDD 2014. • A Three-Way Model for Collective Learning on Multi-Relational Data. International Conference on Machine Learning 2011.

More Related