Part of speech pos tagging
1 / 24

Part-of-Speech (POS) tagging - PowerPoint PPT Presentation

  • Uploaded on

Part-of-Speech (POS) tagging. See Eric Brill “Part-of-speech tagging”. Chapter 17 of R Dale, H Moisl & H Somers (eds) Handbook of Natural Language Processing , New York (2000): Marcel Dekker

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about ' Part-of-Speech (POS) tagging' - rhona

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Part of speech pos tagging

Part-of-Speech (POS) tagging


Eric Brill “Part-of-speech tagging”. Chapter 17 of R Dale, H Moisl & H Somers (eds) Handbook of Natural Language Processing, New York (2000): Marcel Dekker

D Jurafsky & JH Martin: Speech and Language Processing, Upper Saddle River NJ (2000): Prentice Hall, Chapter 8

CD Manning & H Schütze: Foundations of Statistical Natural Language Processing, Cambridge, Mass (1999): MIT Press, Chapter 10. [skip the maths bits if too daunting]

Word categories
Word categories

  • A.k.a. parts of speech (POSs)

  • Important and useful to identify words by their POS

    • To distinguish homonyms

    • To enable more general word searches

  • POS familiar (?) from school and/or language learning (noun, verb, adjective, etc.)

Word categories1
Word categories

  • Recall that we distinguished

    • open-class categories (noun, verb, adjective, adverb)

    • Closed-class categories (preposition, determiner, pronoun, conjunction, …)

  • While the big four are fairly clearcut, it is less obvious exactly what and how many closed-class categories there may be

Pos tagging
POS tagging

  • Labelling words for POS can be done by

    • dictionary lookup

    • morphological analysis

    • “tagging”

  • Identifying POS can be seen as a prerequisite to parsing, and/or a process in its own right

  • However, there are some differences:

    • Parsers often work with the most simple set of word categories, subcategorized by feature (or attribute-value) schemes

    • Indeed the parsing procedure may contribute to the disambiguation of homonyms

Pos tagging1
POS tagging

  • POS tagging, per se, aims to identify word-category information somewhat independently of sentence structure …

  • … and typically uses rather different means

  • POS tags are generally shown as labels on words:

    John/NPN saw/VB the/AT book/NCN on/PRP the/AT table/NN ./PNC

What is a tagger
What is a tagger?

  • Lack of distinction between …

    • Software which allows you to create something you can then use to tag input text, e.g. “Brill’s tagger”

    • The result of running such software, e.g. a tagger for English (based on the such-and-such corpus)

  • Taggers (even rule-based ones) are almost invariably trained on a given corpus

  • “Tagging” usually understood to mean “POS tagging”, but you can have other types of tags (eg semantic tags)

Tagging vs. parsing

  • Once tagger is “trained”, process consists straightforward look-up, plus local context (and sometimes morphology)

  • Tagger will attempt to assign a tag to unknown words, and to disambiguate homographs

  • “Tagset” (list of categories) usually larger with more distinctions than categories used in parsing


  • Parsing usually has basic word-categories, whereas tagging makes more subtle distinctions

  • E.g. noun sg vs pl vs genitive, common vs proper, +is, +has, … and all combinations

  • Parser uses maybe 12-20 categories, tagger may use 60-100

Simple taggers
Simple taggers

  • Default tagger has one tag per word, and assigns it on the basis of dictionary lookup

    • Tags may indicate ambiguity but not resolve it, e.g. NVB for noun-or-verb

  • Words may be assigned different tags with associated probabilities

    • Tagger will assign most probable tag unless

    • there is some way to identify when a less probable tag is in fact correct

  • Tag sequences may be defined by regular expressions, and assigned probabilities (including 0 for illegal sequences – negative rules)

Rule based taggers
Rule-based taggers

  • Earliest type of tagging: two stages

  • Stage 1: look up word in lexicon to give list of potential POSs

  • Stage 2: Apply rules which certify or disallow tag sequences

  • Rules originally handwritten; more recently Machine Learning methods can be used

  • cf transformation-based tagging, below

How do they work?

  • Tagger must be “trained”

  • Many different techniques, but typically …

  • Small “training corpus” hand-tagged

  • Tagging rules learned automatically

  • Rules define most likely sequence of tags

  • Rules based on

    • Internal evidence (morphology)

    • External evidence (context)

    • Probabilities

What probabilities do we have to learn?

  • Individual word probabilities:

    Probability that a given tag t is appropriate for a givenword w

    • Easy (in principle): learn from training corpus:

    • Problem of “sparse data”:

      • Add a small amount to each calculation, so we get no zeros

run occurs 4800 times in the

training corpus: 3600 times as a

verb, 1200 times as a noun:

P(verb|run) = 0.75

(b)Tag sequence probability:

Probability that a given tag sequence t1,t2,…,tn is appropriate for a givenword sequence w1,w2,…,wn

  • P(t1,t2,…,tn | w1,w2,…,wn ) = ???

    • Too hard to calculate for entire sequence:

      P(t1,t2 ,t3 ,t4 ,...) = P(t2|t1 ) P(t3|t1,t2 ) P(t4|t1,t2 ,t3 )  …

    • Subsequence is more tractable

    • Sequence of 2 or 3 should be enough:

      Bigram model: P(t1,t2) = P(t2|t1 )

      Trigram model: P(t1,t2 ,t3) = P(t2|t1 ) P(t3|t2 )

      N-gram model:

More complex taggers
More complex taggers

  • Bigram taggers assign tags on the basis of sequences of two words (usually assigning tag to wordn on the basis of wordn-1)

  • An nth-order tagger assigns tags on the basis of sequences of n words

  • As the value of n increases, so does the complexity of the statistical calculation involved in comparing probability combinations

Stochastic taggers
Stochastic taggers

  • Nowadays, pretty much all taggers are statistics-based and have been since 1980s (or even earlier ... Some primitive algorithms were already published in 60s and 70s)

  • Most common is based on Hidden Markov Models (also found in speech processing, etc.)

Hidden markov models
(Hidden) Markov Models

  • Probability calculations imply Markov models: we assume that P(t|w) is dependent only on the (or, a sequence of) previous word(s)

  • (Informally) Markov models are the class of probabilistic models that assume we can predict the future without taking too much account of the past

  • Markov chains can be modelled by finite state automata: the next state in a Markov chain is always dependent on some finite history of previous states

  • Model is “hidden” if it is actually a succession of Markov models, whose intermediate states are of no interest

Supervised vs unsupervised training
Supervised vs unsupervised training

  • Learning tagging rules from a marked-up corpus (supervised learning) gives very good results (98% accuracy)

    • Though assigning most probable tag, and “proper noun” to unknowns will give 90%

  • But it depends on having a corpus already marked up to a high quality

  • If this is not available, we have to try something else:

    • “forward-backward” algorithm

    • A kind of “bootstrapping” approach

Forward backward baum welch algorithm
Forward-backward (Baum-Welch) algorithm

  • Start with initial probabilities

    • If nothing known, assume all Ps equal

  • Adjust the individual probabilities so as to increase the overall probability.

  • Re-estimate the probabilities on the basis of the last iteration

  • Continue until convergence

    • i.e. there is no improvement, or improvement is below a threshold

  • All this can be done automatically

Transformation based tagging
Transformation-based tagging

  • Eric Brill (1993)

  • Start from an initial tagging, and apply a series of transformations

  • Transformations are learned as well, from the training data

  • Captures the tagging data in much fewer parameters than stochastic models

  • The transformations learned (often) have linguistic “reality”

Transformation based tagging1
Transformation-based tagging

  • Three stages:

    • Lexical look-up

    • Lexical rule application for unknown words

    • Contextual rule application to correct mis-tags

  • Painting analogy

Transformation based learning
Transformation-based learning

  • Change tag a to b when:

    • Internal evidence (morphology)

    • Contextual evidence

      • One or more of the preceding/following words has a specific tag

      • One or more of the preceding/following words is a specific word

      • One or more of the preceding/following words has a certain form

  • Order of rules is important

    • Rules can change a correct tag into an incorrect tag, so another rule might correct that “mistake”

Transformation based tagging examples
Transformation-based tagging: examples

  • if a word is currently tagged NN, and has a suffix of length 1 which consists of the letter 's', change its tag to NNS

  • if a word has a suffix of length 2 consisting of the letter sequence 'ly', change its tag to RB (regardless of the initial tag)

  • change VBN to VBD if previous word is tagged as NN

  • Change VBD to VBN if previous word is ‘by’

Transformation-based tagging: example

Example after lexical lookup

Booth/NP killed/VBN Abraham/NP Lincoln/NP

Abraham/NP Lincoln/NP was/BEDZ shot/VBD by/BY Booth/NP

He/PPS witnessed/VBD Lincoln/NP killed/VBN by/BY Booth/NP

Example after application of contextual rule ’vbn vbd PREVTAG np’

Booth/NP killed/VBD Abraham/NP Lincoln/NP

Abraham/NP Lincoln/NP was/BEDZ shot/VBD by/BY Booth/NP

He/PPS witnessed/VBD Lincoln/NP killed/VBD by/BY Booth/NP

Example after application of contextual rule ’vbd vbn NEXTWORD by’

Booth/NP killed/VBD Abraham/NP Lincoln/NP

Abraham/NP Lincoln/NP was/BEDZ shot/VBN by/BY Booth/NP

He/PPS witnessed/VBD Lincoln/NP killed/VBN by/BY Booth/NP

Tagging final word
Tagging – final word

  • Many taggers now available for download

  • Sometimes not clear whether “tagger” means

    • Software enabling you to build a tagger given a corpus

    • An already built tagger for a given language

  • Because a given tagger (2nd sense) will have been trained on some corpus, it will be biased towards that (kind of) corpus

    • Question of goodness of match between original training corpus and material you want to use the tagger on