1 / 29

Natural language processing

Natural Language Processing (NLP) is a way for computers to analyze language in a useful way. NLP can perform tasks such as making appointments, buying things, language translation, question answering, and information extraction. This article explores basic and advanced NLP techniques, common libraries/tools for NLP, and the difference between machine learning and deep learning approaches in NLP.

ssaldana
Download Presentation

Natural language processing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Natural language processing Rohit Apte (Rohit@rohitapte.com)www.rohitapte.com

  2. Natural Language Processing • A way for computers to analyze language in a useful way • Perform tasks (make appointments, buy things) • Language Translation • Question Answering • Information Extraction (Analyze documents, read emails, chatbots, etc.) NLP

  3. Basic NLP techniques • Sentence segmentation – split a paragraph into sentences • Word tokenization – split a sentence into token words • Stemming and lemmatization (text normalization) - reduce inflection in words to their root forms. Stemming tends to be more crude (usually truncating end of words), while Lemmatization uses vocabulary and morphological analysis of words.

  4. Basic NLP techniques (cont.) • Part of Speech (POS) tagging – tag words as nouns, adjectives, principles, pronouns, etc • Named Entity Recognition – Identify Organization, Person, Location, Date, Time, etc. • Dependency Parsing – Analyze grammatical structure of a sentence, establishing relationship between “head” words and words that modify those heads. • Coreference resolution – Find all expressions that refer to the same entity in a text. • Relation extraction – Track semantic relationships from a text.

  5. Penn treebank POS tags

  6. Advanced NLP techniques • Question Answering • Sentiment Analysis • Dialog System (chatbots, reasoning over knowledge base) • Document Summarization • Text Generation • Machine Translation

  7. Common libraries/tools for nlp • NLTK • SpaCy • Gensim • Stanford NLP (Java) • Apache NLP (Java) • Word embeddings – word2vec, GloVE, FastText

  8. Machine learning vs deep learning • Traditional approach to NLP was Statistical (Machine Learning) based • Required lots of feature engineering (and domain knowledge) • Task of optimizing weights was fairly trivial relative to feature engineering • Deep learning attempts to transform raw data into representations of increasing complexity. • Requires little or no feature engineering • Provides a very flexible learning framework • In 2010 deep learning techniques started outperforming ML techniques.

  9. Finding collocations using ML techniques • A collocation is a pair or group of words that are habitually juxtaposed • Examples – crystal clear, middle management, nuclear option, etc. • Traditional method is to find a formula based on the statistical quantities of those words to calculate a score associated with every word pair. • *https://nlp.stanford.edu/fsnlp/promo/colloc.pdf

  10. NLP is hard! • Natural Languages are complex and ambiguous (unlike programming languages). Rules are often broken. • Most tech giants (Google, Alibaba, Microsoft, Amazon) tackling “BIG” NLP problems like Machine Comprehension, inference, etc. These take years of research before any commercial gains are realized. • There is value in addressing “SMALLER” problems – summarizing resumes, analyzing central bank statements, etc.

  11. Resume summarization • Resumes often excessively populated in detail, most of which is irrelevant. • Can simplify the process by scanning the text from PDF document. • Extract Named Entities for each line of text. • Provides a quick summary of education, work experience, etc. • We can build a basic prototype in under 10 minutes (and less than 20 lines of code!)

  12. Harvard Research Associate Harvard Business School Harvard Medical School/Boston Children’s Vanderbilt University Summa D Vanderbilt University Harvard College Raymond DuBois American Association for Cancer M.D. Anderson Cancer Center Discovered Renaissance Weekend Walter Annenberg PharmaChina Executive Retreat China Healthcare Investment Conference Co Boston Biotech Conference – Moderated Asia Healthcare Panel Biotech Harvard Asian Healthcare Caucasian MD/PhD McKinsey Christoph Westphal S&P China Healthcare – Barclays Equity Research Fidelity, Capital SAC Mann Biotechnology Ventures Joined Barclays SciClone Pharmaceuticals SciClone NASDAQ CFO VC McKinsey & Company Joined SciClone

  13. Parsing fomc statements • The Federal Open Market Committee manages monetary policy and sets interest rates that has a major impact on the world economy. • The Fed releases statements along with interest rate targets at 2pm NY time on selected dates (as per a fixed calendar schedule). • These statements tend to move financial markets, especially if the Fed acts in contrast to market expectations. • Traders position around these statements and the market is often in a race to digest the information in the statement.

  14. Can we tackle this using nlp? • Crawl Fed website for latest (and historical) statements. • Extract common things the Fed speaks about – POS tagging and collocation extraction. • Analyze changes counts for each phrase. • This uses no domain knowledge of Interest Rates or the Fed!

  15. Can we tackle this using nlp (cont.)? See which members voted for the resolution (certain members are known hawks or doves) Extract fed rate target (most important part of the statement). React to changes in fed statement vs market expectations.

  16. Can we do more? • What about sentiment analysis on sentences containing keywords (inflation, unemployment, etc.) to give an overall confidence score? • Popular platforms (Microsoft, Google, Amazon) don’t give very good results. They are trained on different datasets. They work well for movie/product reviews but not for our task. • We could train our own sentiment classifier, but the challenge is where to get enough labelled data?

  17. Working with text in machine learning • We need to convert text to numbers that can be fed into ML models. • Traditionally there were a few approaches • Bag of Words – sentence is represented as a bag (multiset) of its words. • One hot encoding – each word is encoded as a vector of zeros with 1 indicating the word. • TFIDF – Term Frequency (summarizes how often a word occurs within a document), Inverse Document Frequency (downscales words that appear across a lot of documents). • Hashing – convert sentence to a hash and use that as input to your models (Hashing is a one way function).

  18. But there is a major drawback with this • As our vocabulary grows, the size of our vectors gets very large. For a general problem like Machine Comprehension or Chatbots, vocabulary size of 2 million words is quite common. • Storing large vectors with mostly zeros is memory intensive. • Sparse Vectors can help with this, but we are still dealing with large vectors that can slow our ML models. • Stemming and lemmatization can help, but again we may lose critical information by using these transformations. • One hot representations are orthogonal vectors. So there is no natural similarity for one-hot vectors.

  19. Is there another way? • YES! Word Embeddings. • 3 popular types – word2vec, GloVe and FastText • Core Idea: A word’s meaning is given by words that frequently appear close by (coreference). • Build a (dense) vector representation for each word, chosen so that it is similar to other words that appear in similar contexts.

  20. Word embeddings • Word2vec – Developed by Tomas Mikolov at Google in 2013 • Captures coreference information using a predictive model. • Skip gram and CBOW models (usually take the average of the two vectors). • GloVE – developed by Pennington, Socher, Manning at Stanford in 2014 • Also captures coreference information, but uses a count-based model (with dimensionality reduction). • FastText – developed by Tomas Mikolov at Facebook in 2015 • Extension of word2vec that improves embeddings for rare words. • Can construct a vector for out of vocabulary words using its neighboring words.

  21. Word embedding created using Word2vec | Source: https://www.adityathakker.com/introduction-to-word2vec-how-it-works/

  22. Word embeddings (cont.) • Word embeddings are usually a good starting point for deep learning (and vanilla) models. Convert words to vectors using embeddings and then apply deep learning models to this data. • Note that these are trained on Wikipedia (GloVe has vectors trained on Twitter data as well). • If using for a different domain (Crypto, Medical files, etc.) we need to train our own embeddings. • Gensim provides a framework to do train word2vec on custom datasets.

More Related