1 / 17

Natural language processing

Natural language processing. Applications. Classification (spam) Clustering (news stories, twitter) Input correction (spell checking) Sentiment analysis (product reviews) Information retrieval (web search) Question answering (web search, IBM’s Watson)

kaiyo
Download Presentation

Natural language processing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Natural language processing

  2. Applications • Classification (spam) • Clustering (news stories, twitter) • Input correction (spell checking) • Sentiment analysis (product reviews) • Information retrieval (web search) • Question answering (web search, IBM’s Watson) • Machine translation (english to spanish) • Speech recognition (Siri)

  3. Language Models • Two ways to think about modeling language: • Sequences of letters/words • probabilistic, word based, learned • Tree-based grammar models • Logical, boolean, often hand coded

  4. Bag-of-words Model Transform documents into sparse numeric vectors and then deal with them with linear algebra operations

  5. Bag-of-words Model • One of the most common ways to deal with documents • Forgets everything about the linguistic structure within the text • Useful for classification, clustering, visualization etc.

  6. Similarity between document vectors • Each document is represented as a vector of weights • Cosine similarity (dot product) is the most widely used similarity measure between two document vectors • …calculates cosine of the angle between document vectors • …efficient to calculate (sum of products of intersecting words) • …similarity value between 0 (different) and 1 (the same)

  7. Bag of Words with Word Weighting • Each word is represented as a separate variable having numeric weight (importance) • The most popular weighting schema is normalized word frequency TFIDF: • – term frequency (number of word occurrences in a document) • – document frequency (number of documents containing the word) • – number of all documents • – relative importance of the word in the document The word is more important if it appears several times in a target document The word is more important if it appears in less documents

  8. Example document and its vector representation TRUMP MAKES BID FOR CONTROL OF RESORTS Casino owner and real estateDonald Trump has offered to acquire all Class B common sharesof Resorts International Inc, a spokesman for Trump said.The estate of late Resorts chairman James M. Crosby owns340,783 of the 752,297 Class B shares. Resorts also has about 6,432,000 Class A common sharesoutstanding. Each Class B share has 100 times the voting powerof a Class A share, giving the Class B stock about 93 pct ofResorts' voting power. [RESORTS:0.624] [CLASS:0.487] [TRUMP:0.367] [VOTING:0.171] [ESTATE:0.166] [POWER:0.134] [CROSBY:0.134] [CASINO:0.119] [DEVELOPER:0.118] [SHARES:0.117] [OWNER:0.102] [DONALD:0.097] [COMMON:0.093] [GIVING:0.081] [OWNS:0.080] [MAKES:0.078] [TIMES:0.075] [SHARE:0.072] [JAMES:0.070] [REAL:0.068] [CONTROL:0.065] [ACQUIRE:0.064] [OFFERED:0.063] [BID:0.063] [LATE:0.062] [OUTSTANDING:0.056] [SPOKESMAN:0.049] [CHAIRMAN:0.049] [INTERNATIONAL:0.041] [STOCK:0.035] [YORK:0.035] [PCT:0.022] [MARCH:0.011] Original text Bag-of-Words representation (high dimensional sparse vector)

  9. What happens if some words do not appear in the training corpus? • Smoothing: assigning very low, but greater than 0, probabilities to previously unseen words

  10. Wouldn’t it be helpful to reason about word order?

  11. -gram models • Probabilistic language model based on a contiguous sequence of nitems • n=1 is a unigram model (“bag of words”) • n=2 is a bigram model • n=3 is a trigram model • … etc

  12. -gram example • Source text: to be or not to be • Unigrams: to, be, or, not, to, be • Bigrams: to be, be or, or not, not to, to be • Trigrams: to be or, be or not, or not to, not to be

  13. Google -gram corpus • In September 2006 Google announced availability of n-gram corpus: • http://googleresearch.blogspot.com/2006/08/all-our-n-gram-are-belong-to-you.html#links • Some statistics of the corpus: • File sizes: approx. 24 GB compressed (gzip'ed) text files • Number of tokens: 1,024,908,267,229 • Number of sentences: 95,119,665,584 • Number of unigrams: 13,588,391 • Number of bigrams: 314,843,401 • Number of trigrams: 977,069,902 • Number of fourgrams: 1,313,818,354 • Number of fivegrams: 1,176,470,663

  14. ceramics collectables collectibles 55ceramics collectables fine 130ceramics collected by 52ceramics collectible pottery 50ceramics collectibles cooking 45ceramics collection , 144ceramics collection . 247ceramics collection </S> 120ceramics collection and 43ceramics collection at 52ceramics collection is 68ceramics collection of 76ceramics collection | 59ceramics collections , 66ceramics collections . 60ceramics combined with 46ceramics come from 69ceramics comes from 660ceramics community , 109ceramics community . 212ceramics community for 61ceramics companies . 53ceramics companies consultants 173ceramics company ! 4432ceramics company , 133ceramics company . 92ceramics company </S> 41ceramics company facing 145ceramics company in 181ceramics company started 137ceramics company that 87ceramics component ( 76ceramics composed of 85 serve as the incoming 92serve as the incubator 99serve as the independent 794serve as the index 223serve as the indication 72serve as the indicator 120serve as the indicators 45serve as the indispensable 111serve as the indispensible 40serve as the individual 234serve as the industrial 52serve as the industry 607serve as the info 42serve as the informal 102serve as the information 838serve as the informational 41serve as the infrastructure 500serve as the initial 5331serve as the initiating 125serve as the initiation 63serve as the initiator 81serve as the injector 56serve as the inlet 41serve as the inner 87serve as the input 1323serve as the inputs 189serve as the insertion 49serve as the insourced 67serve as the inspection 43serve as the inspector 66serve as the inspiration 1390serve as the installation 136serve as the institute 187 Example: Google -grams

  15. Using -grams to generate text(Shakespeare) • Unigrams: • Every enter now severally so, let • Hill he late speaks; or! a more to leg less first you enter • Bigrams: • What means, sir. I confess she? then all sorts, he is trim, captain. • Why dost stand forth thy canopy, forsooth; he is this palpable hit the King Henry. • Trigrams: • Sweet prince, Falstaff shall die. • This shall forbid it should be branded, if renown made it empty.

  16. Using -grams to generate text(Shakespeare) • Quadrigrams • What! I will go seek the traitor Gloucester. • Will you not tell me who I am? • Note: As we increase the value of N, the accuracy of an n-gram model increases, since choice of next word becomes increasingly constrained

  17. Using -grams to generate text(Wall Street Journal)

More Related