1 / 64

Modeling Natural Text

Modeling Natural Text. David Kauchak CS458 Fall 2012. Admin. Final project Paper draft due next Friday by midnight Saturday, I’ll e-mail out 1-2 paper drafts for you to read Send me your reviews by Sunday at midnight Monday morning, I’ll forward these so you can integrate comments

enan
Download Presentation

Modeling Natural Text

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Modeling Natural Text David Kauchak CS458 Fall 2012

  2. Admin • Final project • Paper draft • due next Friday by midnight • Saturday, I’ll e-mail out 1-2 paper drafts for you to read • Send me your reviews by Sunday at midnight • Monday morning, I’ll forward these so you can integrate comments • Initial code submission • Make sure to start integrating your code sooner than later • Initial code submission due next Friday

  3. Admin • Final project continued • At the beginning of class on Tuesday and Thursday we’ll spend 15 min. discussing where things are at • Any support from me? • let me know sooner than later…

  4. Watson paper discussion • First application attempts • How did the discussion go? • One more paper discussion next Tuesday…

  5. Modeling natural text Questions what are the key topics in the text? what is the sentiment of the text? who/what does the article refer to? what are the key phrases? … Phenomena synonymy sarcasm/hyperbole variety of language (slang), mispellings coreference (e.g. pronouns like he/she) …

  6. Applications search engines search advertising corporate databases language generation I think, therefore I am I am speech recognition machine translation text simplification text classification and clustering   SPAM sentiment analysis document hierarchies

  7. Document modeling:learn a probabilistic model of documents Predict the likelihood that an unseen document belongs to a set of documents ? Model should capture text characteristics

  8. Training a document model trainingdocuments model parameter estimation document model

  9. Applying a document model Document model: what is the probability the new document is in the same “set” as the training documents? probability new document document model Applications?

  10. Application: text classification sports politics entertainment business … Category ? Spam spamnot-spam Sentiment positivenegative

  11. Text classification: Training SPAM model parameter estimation non-SPAM

  12. Text classification: Applying probability of document being SPAM Is it SPAM or non-SPAM? which is larger? probability of document being non-SPAM

  13. Representation and Notation Standard representation: bag of words • Fixed vocabulary ~50K words • Documents represented by a count vector, where each dimension represents the frequency of a word (4, 1, 1, 0, 0, 1, 0, 0, …) Clinton said banana repeatedly last week on tv, “banana, banana, banana” clinton said california across tv wrong capital banana Representation allows us to generalize across documents Downside?

  14. Representation and Notation • Standard representation: bag of words • Fixed vocabulary ~50K words • Documents represented by a count vector, where each dimension represents the frequency of a word (4, 1, 1, 0, 0, 1, 0, 0, …) Clinton said banana repeatedly last week on tv, “banana, banana, banana” clinton said california across tv wrong capital banana Representation allows us to generalize across documents Downside: lose word ordering information

  15. Word burstiness What is the probability that a political document contains the word “Clinton”exactly once? The Stacy Koon-Lawrence Powell defense! The decisions of Janet Reno and Bill Clinton in this affair are essentially the moral equivalents of Stacy Koon's. … p(“Clinton”=1|political)= 0.12

  16. Word burstiness What is the probability that a political document contains the word “Clinton” exactlytwice? The Stacy Koon-Lawrence Powell defense! The decisions of Janet Reno and Bill Clinton in this affair are essentially the moral equivalents of Stacy Koon's. Reno and Clinton have the advantage in that they investigate themselves. p(“Clinton”=2|political)= 0.05

  17. Word burstiness in models p(“Clinton”=2|political)= 0.05 Many models incorrectly predict: p(“Clinton”=2|political) ≈ p(“Clinton”=1|political)2 0.05 ≠ 0.0144(0.122) And in general, predict: p(“Clinton”=i|political) ≈ p(“Clinton”=1|political)i

  18. p(“Clinton” = x | political) probability “Clinton” occurs exactly x times in document

  19. Word count probabilities common words – 71% of word occurrences and 1% of the vocabularyaverage words – 21% of word occurrences and 10% of the vocabularyrare words – 8% or word occurrences and 89% of the vocabulary

  20. The models…

  21. Multinomial model 20 rolls of a fair, 6-side die – each number is equally probable (1, 10, 5, 1, 2, 1) (3, 3, 3, 3, 4, 4) ones ones twos twos threes threes fives fives fours fours sixes sixes Which is more probable?

  22. Multinomial model 20 rolls of a fair, 6-side die – each number is equally probable (1, 10, 5, 1, 2, 1) (3, 3, 3, 3, 4, 4) ones ones twos twos threes threes fives fives fours fours sixes sixes How much more probable?

  23. Multinomial model 20 rolls of a fair, 6-side die – each number is equally probable (1, 10, 5, 1, 2, 1) (3, 3, 3, 3, 4, 4) 0.000000764 0.000891 1000 times more likely

  24. Multinomial model for text Many more “sides” on the die than 6, but the same concept… (4, 1, 1, 0, 0, 1, 0, 0, …) multinomialdocument model banana clinton said california across tv wrong capital probability

  25. Generative Story To apply a model, we’re given a document and we obtain the probability We can also ask how a given model would generate a document This is the “generative story” for a model

  26. w2 w1 w1 w1 w1 w3 w3 Multinomial Urn: Drawing words from a multinomial Selected:

  27. w2 w1 w1 w1 w1 w3 w3 Drawing words from a multinomial Selected:

  28. w2 w1 w1 w1 w1 w1 w3 w3 Drawing words from a multinomial Selected: Put a copy of w1 back sampling with replacement

  29. w2 w1 w1 w1 w1 w1 w3 w3 Drawing words from a multinomial Selected:

  30. w2 w1 w1 w1 w1 w1 w1 w3 w3 Drawing words from a multinomial Selected: Put a copy of w1 back sampling with replacement

  31. w2 w1 w1 w1 w1 w1 w1 w3 w3 Drawing words from a multinomial Selected:

  32. w2 w2 w1 w1 w1 w1 w1 w1 w3 w3 Drawing words from a multinomial Selected: Put a copy of w2 back sampling with replacement

  33. w2 w2 w1 w1 w1 w1 w1 w1 w3 w3 Drawing words from a multinomial Selected: …

  34. Drawing words from a multinomial Does the multinomial model capture burstiness?

  35. Drawing words from a multinomial p(word) remains constant, independent of which words have already been drawn (in particular, how many of this particular word have been drawn) burstiness

  36. Multinomial probability simplex Generate documents containing 100 words from a multinomial with just 3 possible words word 2 word 1 word 3 {0.31, 0.44, 0.25}

  37. Multinomial word count probabilities

  38. Multinomial does not model burstiness of average and rare words

  39. Better model of burstiness: DCM Dirichlet Compound Multinomial Polya Urn process • KEY: Urn distribution changes based on previous words drawn • Generative story: • Repeat until document length hit • Randomly draw a word from urn – call it wi • Put 2 copies of wi back in urn

  40. w2 w1 w1 w1 w1 w3 w3 Drawing words from a Polya urn Selected:

  41. w2 w1 w1 w1 w1 w3 w3 Drawing words from a Polya urn Selected:

  42. w2 w1 w1 w1 w1 w1 w1 w3 w3 Drawing words from a Polya urn Selected: Put 2 copies of w1 back Adjust parameters

  43. w2 w1 w1 w1 w1 w1 w1 w3 w3 Drawing words from a Polya urn Selected:

  44. w2 w1 w1 w1 w1 w1 w1 w1 w1 w3 w3 Drawing words from a Polya urn Selected: Put 2 copies of w1 back Adjust parameters

  45. w2 w1 w1 w1 w1 w1 w1 w1 w1 w3 w3 Drawing words from a Polya urn Selected:

  46. w2 w2 w2 w1 w1 w1 w1 w1 w1 w1 w1 w3 w3 Drawing words from a Polya urn Selected: Put 2 copies of w2 back Adjust parameters

  47. w2 w2 w2 w1 w1 w1 w1 w1 w1 w1 w1 w3 w3 Drawing words from a Polya urn Selected: …

  48. Polya urn • Words already drawn are more likely to be seen again Results in the DCM distribution We can modulate burstiness by increasing/decreasing the number of words in the urn while keeping distribution the same

  49. Controlling burstiness • Same distribution of words Which is more bursty? more bursty less bursty

  50. Burstiness with DCM Multinomial DCM Down scaled {.31, .44, .25} Up scaled{2.81,3.94, 2.25} Medium scaled{.93, 1.32, .75}

More Related