1 / 13

Smoothing Other Methods for Tagging

Smoothing Other Methods for Tagging. Zeroes. Zeroes. When working with n-gram models (and their variants, such as HMMs), zero probabilities can be real show-stoppers Examples: Zero probabilities are a problem

mick
Download Presentation

Smoothing Other Methods for Tagging

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. SmoothingOther Methods for Tagging

  2. Zeroes

  3. Zeroes • When working with n-gram models (and their variants, such as HMMs), zero probabilities can be real show-stoppers • Examples: • Zero probabilities are a problem • p(w1 w2 w3...wn)  p(w1) p(w2|w1) p(w3|w2)...p(wn|wn-1) bigram model • one zero and the whole product is zero • Zero frequencies are a problem • p(wn|wn-1) = C(wn-1wn)/C(wn-1) relative frequency • word doesn’t exist in dataset and we’re dividing by zero

  4. Smoothing • Add-One Smoothing • add 1 to all frequency counts • Unigram • P(w) = C(w)/N (before Add-One) • N = size of corpus • P(w) = (C(w)+1)/(N+V) (with Add-One) = (C(w)+1)*N/(N+V) (with Add-One) • V = number of distinct words in corpus • N/(N+V) normalization factor adjusting for the effective increase in the corpus size caused by Add-One

  5. Smoothing • Bigram • P(wn|wn-1) = C(wn-1wn)/C(wn-1) (before Add-One) • P(wn|wn-1) = (C(wn-1wn)+1)/(C(wn-1)+V) (after Add-One) = (C(wn-1 wn)+1)* C(wn-1) /(f(wn-1)+V) • N-gram • P(wn|wn-1,n-k) = C(wn-k,…,n)+1 / (C(C(wn-k,…,n-1)+V)

  6. Smoothing • Add-One Smoothing • (C(wn-1 wn)+1)* C(wn-1) /(f(wn-1)+V) Remarks: perturbation problem add-one causes large changes in some frequencies due to relative size of V (1616) want to: 786  338

  7. Smoothing

  8. Smoothing • Other smoothing techniques: • Add delta smoothing: • P(wn|wn-1) = (C(wnwn-1) +  ) / (C(wn) + V) • Similar perturbations to add-1 • Witten-Bell Discounting • Equate zero frequency items with frequency 1 items • Use frequency of things seen once to estimate frequency of things we haven’t seen yet • Smaller impact than Add-1 • Good-Turing Discounting • Nc = frequency of N-grams with frequency c • re-estimate c using formula (c+1)*Nc+1/Nc • Will talk about these and other methods later (n-gram)

  9. Other Methods for Tagging Transformation based tagging

  10. Transformation Based Tagging • Explained in Brill 1995 • Basic method: • Assign words their most likely tags (the stupid tagger) • For example, race would be tagged NN rather than VB because • P(NN|race) = 0.98 • P(VB|race) = 0.02 • Alter tags assigned using transformations • Transformations based on context

  11. Transformation Based Tagging • Transformations based on context • Contextual “triggers” can be just about anything: • Preceding tag • NN → VB, previous tag is TO • One of the preceding n tags • VBP → VB, one of the previous three tags is MD (modal, as in “you may read”) • Next tag • JJR → RBR, next tag is JJ (“a more valuable player”) • One of the preceding n words • VBP → VB, a preceding word is n’t (“should n’t read”) • And others (morphological triggers), or combinations

  12. Transformation Based Tagging • Still a learning based method (like HMMs) • Input: a correctly tagged corpus • Output: transformation rules • Apply rules iteratively to corpus until some threshold is achieved. • Transformation rules drawn from set of hand-written “metarules”, such as: • Tag A → Tag B when the preceding word is z • The transformation rules output are those that reduce the error to some prespecified threshold • Then apply frequent tags and transformations to some raw corpus

  13. Transformation Based Tagging • Benefits: • Can use it for unsupervised learning • Brill 1995 describes a tagger that achieves 95.6% which is quite high for unsupervised • Doesn’t overtrain, which can happen with HMM taggers • Tagger available for download from: • http://www.cs.jhu.edu/~brill/RBT1_14.tar.Z • Paper from: • http://citeseer.ist.psu.edu/brill95transformationbased.html

More Related