Natural language processing

Natural language processing Rohit Apte (Rohit@rohitapte.com)www.rohitapte.com

Natural Language Processing • A way for computers to analyze language in a useful way • Perform tasks (make appointments, buy things) • Language Translation • Question Answering • Information Extraction (Analyze documents, read emails, chatbots, etc.) NLP

Basic NLP techniques • Sentence segmentation – split a paragraph into sentences • Word tokenization – split a sentence into token words • Stemming and lemmatization (text normalization) - reduce inflection in words to their root forms. Stemming tends to be more crude (usually truncating end of words), while Lemmatization uses vocabulary and morphological analysis of words.

Basic NLP techniques (cont.) • Part of Speech (POS) tagging – tag words as nouns, adjectives, principles, pronouns, etc • Named Entity Recognition – Identify Organization, Person, Location, Date, Time, etc. • Dependency Parsing – Analyze grammatical structure of a sentence, establishing relationship between “head” words and words that modify those heads. • Coreference resolution – Find all expressions that refer to the same entity in a text. • Relation extraction – Track semantic relationships from a text.

Penn treebank POS tags

Advanced NLP techniques • Question Answering • Sentiment Analysis • Dialog System (chatbots, reasoning over knowledge base) • Document Summarization • Text Generation • Machine Translation

Common libraries/tools for nlp • NLTK • SpaCy • Gensim • Stanford NLP (Java) • Apache NLP (Java) • Word embeddings – word2vec, GloVE, FastText

Machine learning vs deep learning • Traditional approach to NLP was Statistical (Machine Learning) based • Required lots of feature engineering (and domain knowledge) • Task of optimizing weights was fairly trivial relative to feature engineering • Deep learning attempts to transform raw data into representations of increasing complexity. • Requires little or no feature engineering • Provides a very flexible learning framework • In 2010 deep learning techniques started outperforming ML techniques.

Finding collocations using ML techniques • A collocation is a pair or group of words that are habitually juxtaposed • Examples – crystal clear, middle management, nuclear option, etc. • Traditional method is to find a formula based on the statistical quantities of those words to calculate a score associated with every word pair. • *https://nlp.stanford.edu/fsnlp/promo/colloc.pdf

NLP is hard! • Natural Languages are complex and ambiguous (unlike programming languages). Rules are often broken. • Most tech giants (Google, Alibaba, Microsoft, Amazon) tackling “BIG” NLP problems like Machine Comprehension, inference, etc. These take years of research before any commercial gains are realized. • There is value in addressing “SMALLER” problems – summarizing resumes, analyzing central bank statements, etc.

Resume summarization • Resumes often excessively populated in detail, most of which is irrelevant. • Can simplify the process by scanning the text from PDF document. • Extract Named Entities for each line of text. • Provides a quick summary of education, work experience, etc. • We can build a basic prototype in under 10 minutes (and less than 20 lines of code!)

Harvard Research Associate Harvard Business School Harvard Medical School/Boston Children’s Vanderbilt University Summa D Vanderbilt University Harvard College Raymond DuBois American Association for Cancer M.D. Anderson Cancer Center Discovered Renaissance Weekend Walter Annenberg PharmaChina Executive Retreat China Healthcare Investment Conference Co Boston Biotech Conference – Moderated Asia Healthcare Panel Biotech Harvard Asian Healthcare Caucasian MD/PhD McKinsey Christoph Westphal S&P China Healthcare – Barclays Equity Research Fidelity, Capital SAC Mann Biotechnology Ventures Joined Barclays SciClone Pharmaceuticals SciClone NASDAQ CFO VC McKinsey & Company Joined SciClone

Parsing fomc statements • The Federal Open Market Committee manages monetary policy and sets interest rates that has a major impact on the world economy. • The Fed releases statements along with interest rate targets at 2pm NY time on selected dates (as per a fixed calendar schedule). • These statements tend to move financial markets, especially if the Fed acts in contrast to market expectations. • Traders position around these statements and the market is often in a race to digest the information in the statement.

Can we tackle this using nlp? • Crawl Fed website for latest (and historical) statements. • Extract common things the Fed speaks about – POS tagging and collocation extraction. • Analyze changes counts for each phrase. • This uses no domain knowledge of Interest Rates or the Fed!

Can we tackle this using nlp (cont.)? See which members voted for the resolution (certain members are known hawks or doves) Extract fed rate target (most important part of the statement). React to changes in fed statement vs market expectations.

Can we do more? • What about sentiment analysis on sentences containing keywords (inflation, unemployment, etc.) to give an overall confidence score? • Popular platforms (Microsoft, Google, Amazon) don’t give very good results. They are trained on different datasets. They work well for movie/product reviews but not for our task. • We could train our own sentiment classifier, but the challenge is where to get enough labelled data?

Working with text in machine learning • We need to convert text to numbers that can be fed into ML models. • Traditionally there were a few approaches • Bag of Words – sentence is represented as a bag (multiset) of its words. • One hot encoding – each word is encoded as a vector of zeros with 1 indicating the word. • TFIDF – Term Frequency (summarizes how often a word occurs within a document), Inverse Document Frequency (downscales words that appear across a lot of documents). • Hashing – convert sentence to a hash and use that as input to your models (Hashing is a one way function).

But there is a major drawback with this • As our vocabulary grows, the size of our vectors gets very large. For a general problem like Machine Comprehension or Chatbots, vocabulary size of 2 million words is quite common. • Storing large vectors with mostly zeros is memory intensive. • Sparse Vectors can help with this, but we are still dealing with large vectors that can slow our ML models. • Stemming and lemmatization can help, but again we may lose critical information by using these transformations. • One hot representations are orthogonal vectors. So there is no natural similarity for one-hot vectors.

Is there another way? • YES! Word Embeddings. • 3 popular types – word2vec, GloVe and FastText • Core Idea: A word’s meaning is given by words that frequently appear close by (coreference). • Build a (dense) vector representation for each word, chosen so that it is similar to other words that appear in similar contexts.

Word embeddings • Word2vec – Developed by Tomas Mikolov at Google in 2013 • Captures coreference information using a predictive model. • Skip gram and CBOW models (usually take the average of the two vectors). • GloVE – developed by Pennington, Socher, Manning at Stanford in 2014 • Also captures coreference information, but uses a count-based model (with dimensionality reduction). • FastText – developed by Tomas Mikolov at Facebook in 2015 • Extension of word2vec that improves embeddings for rare words. • Can construct a vector for out of vocabulary words using its neighboring words.

Word embedding created using Word2vec | Source: https://www.adityathakker.com/introduction-to-word2vec-how-it-works/

Word embeddings (cont.) • Word embeddings are usually a good starting point for deep learning (and vanilla) models. Convert words to vectors using embeddings and then apply deep learning models to this data. • Note that these are trained on Wikipedia (GloVe has vectors trained on Twitter data as well). • If using for a different domain (Crypto, Medical files, etc.) we need to train our own embeddings. • Gensim provides a framework to do train word2vec on custom datasets.

Natural language processing

Natural language processing

Presentation Transcript

Natural Language Processing

Natural Language Processing

Natural Language Processing

Natural Language Processing

Natural Language Processing

Natural Language Processing

Natural Language Processing

Natural Language Processing

Natural Language Processing

Natural Language Processing

Natural Language Processing

Natural Language Processing

Natural Language Processing

Natural Language Processing

Natural Language Processing

Natural Language Processing

Natural Language Processing

Natural Language Processing

Natural Language Processing

Natural Language Processing

Natural Language Processing

Natural Language Processing