1 / 0

LING/C SC 581: Advanced Computational Linguistics

LING/C SC 581: Advanced Computational Linguistics. Lecture Notes Jan 16 th. Course. Webpage for lecture slides and Panopto recordings: http://dingo.sbs.arizona.edu/~sandiway /ling581- 14/ Meeting information. Course Objectives.

mare
Download Presentation

LING/C SC 581: Advanced Computational Linguistics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. LING/C SC 581: Advanced Computational Linguistics

    Lecture Notes Jan 16th
  2. Course Webpage for lecture slides and Panopto recordings: http://dingo.sbs.arizona.edu/~sandiway/ling581-14/ Meeting information
  3. Course Objectives Follow-on course to LING/C SC/PSYC 438/538 Computational Linguistics: continue with J&M: 25 chapters, a lot of material not covered in 438/538 And gain more extensive experience dealing with natural language software packages Installation, input data formatting operation project exercises useful “real-world” computational experience abilities gained will be of value to employers
  4. Computational Facilities Use your own laptop/desktop we can also make use of the computers in this lab (Shantz 338) but you don’t have installation rights on these computers Plus the alarm goes off after hours and campus police will arrive… Platforms Windows is maybe possible but you really should run some variant of Unix… (for your task #1 for this week) Linux (separate bootable partition or via virtualization software) de facto standard for advanced/research software https://www.virtualbox.org/ (free!) Cygwin on Windows http://www.cygwin.com/ Linux-like environment for Windows making it possible to port software running on POSIX systems (such as Linux, BSD, and Unix systems) to Windows. MacOS X Not quite Linux, some porting issues, especially with C programs, can use Virtual Box
  5. Grading Completion of all homework tasks will result in a satisfactory grade (A) Tasks should be completed before the next class. Email me your work (sandiway@email.arizona.edu). Also be prepared to come up and present your work (if called upon).
  6. Homework Task 1: Install Tregex Computer language: java
  7. Topic 1 Treebanks (538: Context-free grammars – pronoun binding) regex search over parse trees (538: Perl regex on strings)
  8. Homework Task 1: Install Tregex We’ll use the program tregexfrom Stanford University to explore the Penn Treebank current version:
  9. Penn Treebank Availability Source: Linguistic Data Consortium (LDC) U. of Arizona is a (fee-paying) member of this consortium Resources are made available to the community through the main library URL http://sabio.library.arizona.edu/search/X
  10. Penn Treebank (V3) Call Record Have it on a usb drive here …
  11. Penn Treebank (V3) Raw data:
  12. Penn Treebank Tagging Guide Arpa94 paper Parse Guide
  13. Penn Treebank
  14. Penn Treebank sections 00-24
  15. Penn Treebank
  16. tregex Tregex is a Tgrep2-style utility for matching patterns in trees. written in Java run-tregex-gui.commandshell script -mx flag, the 300m default memory size may need to be increased depending on the platform
  17. tregex Select the PTB directory TREEBANK_3/parsed/mrg/wsj/ Browse Deselect any unwanted files
  18. tregex Search NP-SBJ << (dominates) vs. < (immediately dominates) NNP
  19. tregex README-tregex.txt
  20. Leftover from last semester 438/538: Language models
  21. Language Models and N-grams given a word sequence w1 w2 w3 ... wn chain rule how to compute the probability of a sequence of words p(w1 w2) = p(w1) p(w2|w1) p(w1 w2 w3) = p(w1) p(w2|w1) p(w3|w1w2) ... p(w1 w2 w3...wn) = p(w1) p(w2|w1) p(w3|w1w2)... p(wn|w1...wn-2 wn-1) note It’s not easy to collect (meaningful) statistics on p(wn|wn-1wn-2...w1) for all possible word sequences
  22. Language Models and N-grams Given a word sequence w1 w2 w3 ... wn Bigram approximation just look at the previous word only (not all the proceedings words) Markov Assumption: finite length history 1st order Markov Model p(w1 w2 w3...wn) = p(w1) p(w2|w1) p(w3|w1w2) ...p(wn|w1...wn-3wn-2wn-1) p(w1 w2 w3...wn)  p(w1) p(w2|w1) p(w3|w2)...p(wn|wn-1) note p(wn|wn-1) is a lot easier to collect data for (and thus estimate well) than p(wn|w1...wn-2 wn-1)
  23. Language Models and N-grams Trigram approximation 2nd order Markov Model just look at the preceding two words only p(w1 w2 w3 w4...wn) = p(w1) p(w2|w1) p(w3|w1w2) p(w4|w1w2w3)...p(wn|w1...wn-3wn-2wn-1) p(w1 w2 w3...wn)  p(w1) p(w2|w1) p(w3|w1w2)p(w4|w2w3)...p(wn |wn-2 wn-1) note p(wn|wn-2wn-1) is a lot easier to estimate well than p(wn|w1...wn-2 wn-1) but harder than p(wn|wn-1 )
  24. Language Models and N-grams estimating from corpora how to compute bigram probabilities p(wn|wn-1) = f(wn-1wn)/f(wn-1w) w is any word Since f(wn-1w) = f(wn-1) f(wn-1) = unigram frequency for wn-1 p(wn|wn-1) = f(wn-1wn)/f(wn-1) relative frequency Note: The technique of estimating (true) probabilities using a relative frequency measure over a training corpus is known as maximum likelihood estimation (MLE)
  25. Motivation for smoothing Smoothing: avoid zero probability estimates Consider what happens when any individual probability component is zero? Arithmetic multiplication law: 0×X = 0 very brittle! even in a very large corpus, many possible n-grams over vocabulary space will have zero frequency particularly so for larger n-grams p(w1 w2 w3...wn)  p(w1) p(w2|w1) p(w3|w2)...p(wn|wn-1)
  26. bigram probabilities Language Models and N-grams Example: wn-1wn bigram frequencies wn wn-1 unigram frequencies sparse matrix zeros render probabilities unusable (we’ll need to add fudge factors - i.e. do smoothing)
  27. Smoothing and N-grams sparse dataset means zeros are a problem Zero probabilities are a problem p(w1 w2 w3...wn)  p(w1) p(w2|w1) p(w3|w2)...p(wn|wn-1) bigram model one zero and the whole product is zero Zero frequencies are a problem p(wn|wn-1) = f(wn-1wn)/f(wn-1) relative frequency bigram f(wn-1wn) doesn’t exist in dataset smoothing refers to ways of assigning zero probability n-grams a non-zero value
  28. Smoothing and N-grams Add-One Smoothing (4.5.1 Laplace Smoothing) add 1 to all frequency counts simple and no more zeros (but there are better methods) unigram p(w) = f(w)/N (before Add-One) N = size of corpus p(w) = (f(w)+1)/(N+V) (with Add-One) f*(w) = (f(w)+1)*N/(N+V) (with Add-One) V = number of distinct words in corpus N/(N+V) normalization factor adjusting for the effective increase in the corpus size caused by Add-One bigram p(wn|wn-1) = f(wn-1wn)/f(wn-1) (before Add-One) p(wn|wn-1) = (f(wn-1wn)+1)/(f(wn-1)+V) (after Add-One) f*(wn-1 wn) = (f(wn-1 wn)+1)* f(wn-1) /(f(wn-1)+V) (after Add-One) must rescale so that total probability mass stays at 1
  29. Smoothing and N-grams Add-One Smoothing add 1 to all frequency counts bigram p(wn|wn-1) = (f(wn-1wn)+1)/(f(wn-1)+V) (f(wn-1 wn)+1)* f(wn-1) /(f(wn-1)+V) frequencies Remarks: perturbation problem add-one causes large changes in some frequencies due to relative size of V (1616) want to: 786  338 = figure 6.4 = figure 6.8
  30. Smoothing and N-grams Add-One Smoothing add 1 to all frequency counts bigram p(wn|wn-1) = (f(wn-1wn)+1)/(f(wn-1)+V) (f(wn-1 wn)+1)* f(wn-1) /(f(wn-1)+V) Probabilities Remarks: perturbation problem similar changes in probabilities = figure 6.5 = figure 6.7
  31. Smoothing and N-grams let’s illustrate the problem take the bigram case: wn-1wn p(wn|wn-1) = f(wn-1wn)/f(wn-1) suppose there are cases wn-1wzero1that don’t occur in the corpus probability mass f(wn-1wn) f(wn-1) f(wn-1wzero1)=0 ... f(wn-1wzerom)=0
  32. Smoothing and N-grams add-one “give everyone 1” probability mass f(wn-1wn)+1 f(wn-1) f(wn-1w01)=1 ... f(wn-1w0m)=1
  33. V = |{wi}| Smoothing and N-grams add-one “give everyone 1” probability mass f(wn-1wn)+1 redistribution of probability mass p(wn|wn-1) = (f(wn-1wn)+1)/(f(wn-1)+V) f(wn-1) f(wn-1w01)=1 ... f(wn-1w0m)=1
  34. Smoothing and N-grams Good-Turing Discounting (4.5.2) Nc = number of things (= n-grams) that occur c times in the corpus N = total number of things seen Formula: smoothed c for Nc given by c* = (c+1)Nc+1/Nc Idea: use frequency of things seen once to estimate frequency of things we haven’t seen yet estimate N0 in terms of N1… and so on but if Nc=0, smooth that first using something like log(Nc)=a+blog(c) Formula: P*(things with zero freq) = N1/N smaller impact than Add-One Textbook Example: Fishing in lake with 8 species bass, carp, catfish, eel, perch, salmon, trout, whitefish Sample data (6 out of 8 species): 10 carp, 3 perch, 2 whitefish, 1 trout, 1 salmon, 1 eel P(unseen new fish, i.e. bass or carp) = N1/N = 3/18 = 0.17 P(next fish=trout) = 1/18 (but, we have reassigned probability mass, so need to recalculate this from the smoothing formula…) revised count for trout: c*(trout) = 2*N2/N1=2(1/3)=0.67 (discounted from 1) revised P(next fish=trout) = 0.67/18 = 0.037
  35. Language Models and N-grams N-gram models data is easy to obtain any unlabeled corpus will do they’re technically easy to compute count frequencies and apply the smoothing formula but just how good are these n-gram language models? and what can they show us about language?
  36. Language Models and N-grams approximating Shakespeare generate random sentences using n-grams Corpus: CompleteWorks of Shakespeare Unigram (pick random, unconnected words) Bigram
  37. Language Models and N-grams Approximating Shakespeare generate random sentences using n-grams Corpus: CompleteWorks of Shakespeare Trigram Remarks: dataset size problem training set is small 884,647 words 29,066 different words 29,0662 = 844,832,356 possible bigrams for the random sentence generator, this means very limited choices for possible continuations, which means program can’t be very innovative for higher n Quadrigram
  38. Language Models and N-grams A limitation: produces ungrammatical sequences Treebank: potential to be a better language model Structural information: contains frequency information about syntactic rules we should be able to generate sequences that are closer to “English”…
  39. Language Models and N-grams Aside: http://hemispheresmagazine.com/contests/2004/intro.htm
  40. Language Models and N-grams N-gram models + smoothing one consequence of smoothing is that every possible concatentation or sequence of words has a non-zero probability
  41. Colorless green ideas examples (1) colorless green ideas sleep furiously (2) furiously sleep ideas green colorless Chomsky (1957): . . . It is fair to assume that neither sentence (1) nor (2) (nor indeed any part of these sentences) has ever occurred in an English discourse. Hence, in any statistical model for grammaticalness, these sentences will be ruled out on identical grounds as equally `remote' from English. Yet (1), though nonsensical, is grammatical, while (2) is not. idea (1) is syntactically valid, (2) is word salad Statistical Experiment (Pereira 2002)
  42. Colorless green ideas examples (1) colorless green ideas sleep furiously (2) furiously sleep ideas green colorless Statistical Experiment (Pereira 2002) wi-1 wi bigram language model
  43. Interesting things to Google example colorless green ideas sleep furiously Second hit
  44. Interesting things to Google example colorless green ideas sleep furiously first hit compositional semantics a greenidea is, according to well established usage of the word "green" is one that is an idea that is new and untried. again, a colorless idea is one without vividness, dull and unexciting. so it follows that a colorless green idea is a new, untried idea that is without vividness, dull and unexciting. to sleep is, among other things, is to be in a state of dormancy or inactivity, or in a state of unconsciousness. to sleep furiously may seem a puzzling turn of phrase but one reflects that the mind in sleep often indeed moves furiously with ideas and images flickering in and out.
  45. Interesting things to Google example colorless green ideas sleep furiously another hit: (a story) "So this is our ranking system," said Chomsky. "As you can see, the highest rank is yellow." "And the new ideas?" "The green ones? Oh, the green ones don't get a color until they've had some seasoning. These ones, anyway, are still too angry. Even when they're asleep, they're furious. We've had to kick them out of the dormitories - they're just unmanageable." "So where are they?" "Look," said Chomsky, and pointed out of the window. There below, on the lawn, the colorless green ideas slept, furiously.
More Related