1 / 30

NLP Document and Sequence Models

NLP Document and Sequence Models. Computational models of how natural languages work. These are sometimes called Language Models or sometimes Grammars Three main types (among many others): Document models, or “topic” models Sequence models: Markov models, HMMs, others

mei
Download Presentation

NLP Document and Sequence Models

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. NLP Document and Sequence Models

  2. Computational models of how natural languages work These are sometimes called Language Models or sometimes Grammars Three main types (among many others): • Document models, or “topic” models • Sequence models: Markov models, HMMs, others • Context-free grammar models

  3. Computational models of how natural languages work Most of the models I will show you are • Probabilistic models • Graphical models • Generative models In other words, they are essentially Bayes Nets. In addition, many (but not all) are • Latent variable models This means that some variables in the model are not observed in data, and must be inferred. (Like the hidden states in an HMM.)

  4. Topic Models Topic models are models that • can predict the likelihood of a document • using latent topic variables Let x1, …, xNbe the words in a document. P(x1, …, xN) is the likelihood of a document. Topic models compute this (conceptually) by computing where each zi is a latent variable that represents the “topic” of word xi.

  5. Three documents with the word “play”(numbers & colors  topic assignments)

  6. Example: topics from an educational corpus (TASA) • 37K docs, 26K words • 1700 topics, e.g.: PRINTING PAPER PRINT PRINTED TYPE PROCESS INK PRESS IMAGE PRINTER PRINTS PRINTERS COPY COPIES FORM OFFSET GRAPHIC SURFACE PRODUCED CHARACTERS PLAY PLAYS STAGE AUDIENCE THEATER ACTORS DRAMA SHAKESPEARE ACTOR THEATRE PLAYWRIGHT PERFORMANCE DRAMATIC COSTUMES COMEDY TRAGEDY CHARACTERS SCENES OPERA PERFORMED TEAM GAME BASKETBALL PLAYERS PLAYER PLAY PLAYING SOCCER PLAYED BALL TEAMS BASKET FOOTBALL SCORE COURT GAMES TRY COACH GYM SHOT JUDGE TRIAL COURT CASE JURY ACCUSED GUILTY DEFENDANT JUSTICE EVIDENCE WITNESSES CRIME LAWYER WITNESS ATTORNEY HEARING INNOCENT DEFENSE CHARGE CRIMINAL HYPOTHESIS EXPERIMENT SCIENTIFIC OBSERVATIONS SCIENTISTS EXPERIMENTS SCIENTIST EXPERIMENTAL TEST METHOD HYPOTHESES TESTED EVIDENCE BASED OBSERVATION SCIENCE FACTS DATA RESULTS EXPLANATION STUDY TEST STUDYING HOMEWORK NEED CLASS MATH TRY TEACHER WRITE PLAN ARITHMETIC ASSIGNMENT PLACE STUDIED CAREFULLY DECIDE IMPORTANT NOTEBOOK REVIEW

  7. Polysemy PRINTING PAPER PRINT PRINTED TYPE PROCESS INK PRESS IMAGE PRINTER PRINTS PRINTERS COPY COPIES FORM OFFSET GRAPHIC SURFACE PRODUCED CHARACTERS PLAY PLAYS STAGE AUDIENCE THEATER ACTORS DRAMA SHAKESPEARE ACTOR THEATRE PLAYWRIGHT PERFORMANCE DRAMATIC COSTUMES COMEDY TRAGEDY CHARACTERS SCENES OPERA PERFORMED TEAM GAME BASKETBALL PLAYERS PLAYER PLAY PLAYING SOCCER PLAYED BALL TEAMS BASKET FOOTBALL SCORE COURT GAMES TRY COACH GYM SHOT JUDGE TRIAL COURT CASE JURY ACCUSED GUILTY DEFENDANT JUSTICE EVIDENCE WITNESSES CRIME LAWYER WITNESS ATTORNEY HEARING INNOCENT DEFENSE CHARGE CRIMINAL HYPOTHESIS EXPERIMENT SCIENTIFIC OBSERVATIONS SCIENTISTS EXPERIMENTS SCIENTIST EXPERIMENTAL TEST METHOD HYPOTHESES TESTED EVIDENCE BASED OBSERVATION SCIENCE FACTS DATA RESULTS EXPLANATION STUDY TEST STUDYING HOMEWORK NEED CLASS MATH TRY TEACHER WRITE PLAN ARITHMETIC ASSIGNMENT PLACE STUDIED CAREFULLY DECIDE IMPORTANT NOTEBOOK REVIEW

  8. Topic Modeling Techniques The two most common techniques are: • Probabilistic Latent Semantic Analysis (pLSA) • Latent Dirichlet Allocation (LDA) Commonly-used software packages: • Mallet, a Java toolkit for various NLP related things like document classification, and includes a widely-used implementation of LDA. • Stanford Topic Modeling Toolbox • A list of implementations for various topic modeling techniques

  9. The LDA Model     z1 z2 z3 z4 z1 z2 z3 z4 z1 z2 z3 z4 w1 w2 w3 w4 w1 w2 w3 w4 w1 w2 w3 w4 For each document, • Choose ~Dirichlet() • For each of the N words wn: • Choose a topic zn» Multinomial() • Choose a word wn from p(wn|zn,), a multinomial probability conditioned on the topic zn. b

  10. Applications • Visualization and exploration of a corpus • Track news stories as they evolve • Pre-processing of a corpus for document classification tasks

  11. (slide from tutorial by David Blei, KDD 2011)

  12. Microsoft’s Twahpic System http://twahpic.cloudapp.net/S4.aspx

  13. LDA/pLSA for Text Classification • Topic models are easy to incorporate into text classification: • Train a topic model using a big corpus • Decode the topic model (find best topic/cluster for each word) on a training set • Train classifier using the topic/cluster as a feature • On a test document, first decode the topic model, then make a prediction with the classifier

  14. Why use a topic model for classification? • Topic models help handle polysemy and synonymy • The count for a topic in a document can be much more informative than the count of individual words belonging to that topic. • Topic models help combat data sparsity • You can control the number of topics • At a reasonable choice for this number, you’ll observe the topics many times in training data (unlike individual words, which may be very sparse)

  15. Synonymy and Polysemy (example from Lillian Lee) auto engine bonnet tyres lorry boot car emissions hood make model trunk make hidden Markov model emissions normalize Training Synonymy Appear dissimilar if you compare words but are related Polysemy Appear similar if you compare words but not truly related

  16. Sequence Models Sequences models are models that • can predict the likelihood of a sequence of text (eg, a sentence). • Sometimes using latent state variables Let x1, …, xN be the words in a sentence. P(x1, …, xN) is the likelihood of the sentence. We’ll look at two types of generative sequence models: N-grammodels, which are slightly fancy versions of Markov models Hidden Markov Models, which you’ve seen before

  17. What’s a sequence model for? • Speech Recognition: • often the acoustic model will be confused between several possible words for a given sound • Speech recognition systems choose between these possibilities by selecting the one with the highest probability, according to a sequence model • Machine translation • Often, the translation model will be confused between several possible translations of a given phrase • The system chooses between these possibilities by selecting the one with the highest probability, according to a sequence model Many other applications: • Handwriting recognition • Spelling correction • Optical character recognition • …

  18. Example Language Model Application Speech Recognition: convert an acoustic signal (sound wave recorded by a microphone) to a sequence of words (text file). Straightforward model: But this can be hard to train effectively.

  19. Example Language Model Application Speech Recognition: convert an acoustic signal (sound wave recorded by a microphone) to a sequence of words (text file). Traditional solution: Bayes’ Rule Acoustic Model (easier to train) Language Model Ignore: doesn’t matter for picking a good text

  20. Example Language Model Application Speech Recognition: convert an acoustic signal (sound wave recorded by a microphone) to a sequence of words (text file). Traditional solution: Bayes’ Rule Acoustic Model (easier to train) Sequence Model Ignore: doesn’t matter for picking a good text

  21. N-gram Models P(w1, …, wT)= P(w1) x P(w2|w1) x P(w3|w2,w1) x P(w4|w3,w2,w1) x … x P(wT|wT-1,…,w1) N-gram models make the assumption that the next word depends only on the previous N-1 words. For example, a 2-gram model (usually called bigram): P(w1, …, wT)= P(w1) x P(w2|w1) x P(w3|w2) x P(w4|w3) x … x P(wT|wT-1) Notice: this is a Markov model, where the states are words.

  22. N-gram Tools • IRSTLM, a C++ n-gram toolkit often used for speech recognition • Berkeleylm, another n-gram toolkit, in Java • Moses, a machine translation toolkit • CMU Sphinx, open-source speech recognition toolkit

  23. Some HMM Tools • jHMM, a Java implementation that’s relatively easy to use • MPI implementation of HMMs, for training large HMMs in a distributed environment

  24. (HMM Demo)

  25. Conditional Random Fields CRFs, like HMMs, are also latent-variable sequence models. However, CRFs are discriminative instead of generative: They cannot tell you P(x1, …, xN, z1, …, zN), and they cannot tell you P(x1, …, xN). But they are often better than HMMs at predicting P(z1, …, zN| x1, …, xN). For these reason, they are often used as sequence labelers.

  26. Example Sequence Labeling Task Slide from Jenny Rose Finkel (ACL 2005)

  27. Other common sequence-labeling tasks • Tokenization/segmentation SSNNNNNNNNNNNSSNSNNSSNNNNNSSNNNSS “Tokenization isn’t always easy.” Chinese word segmentation: Pi-ChuanChang, Michel Galley and Chris Manning, 2008:

  28. Other commonsequence-labeling tasks • Part-of-speech tagging “ N V AdvAdvAdj . “ “ Tokenization is n’talways easy . ” • Relation extraction O OB-arg I-argO B-rel I-rel Drug giant Pfizer Inc.has reached an I-rel I-rel O O agreement to buy the private O OB-arg I-arg biotechnology firm RinatNeuroscience I-arg O OOOO Corp. , the companies announced Thursday  buy(Pfizer Inc., Rinat Neuroscience Corp.)

  29. ReVerb demo http://reverb.cs.washington.edu/

  30. Some CRF implementations, tools • http://teamcohen.github.io/MinorThird/, a package of a variety of NLP tools, including CRFs • http://www.chokkan.org/software/crfsuite/, a CRF toolkit • http://nlp.stanford.edu/software/, a variety of NLP tools, including several CRF models

More Related