PoS-Tagging theory and terminology

School of Computing FACULTY OF ENGINEERING PoS-Tagging theory and terminology COMP3310 Natural Language Processing Eric Atwell, Language Research Group (with thanks to Katja Markert, Marti Hearst, and other contributors)

Reminder: PoS-tagging programs • Models behind some example PoS-tagging methods in NLTK: • Hand-coded • Statistical taggers • Brill (transformation-based) tagger • NB you don’t have to use NLTK – useful to illustrate

Training and Testing ofMachine Learning Algorithms • Algorithms that “learn” from data see a set of examples and try to generalize from them. • Training set: • Examples trained on • Test set: • Also called held-out data and unseen data • Use this for evaluating your algorithm • Must be separate from the training set • Otherwise, you cheated! • “Gold standard” evaluation set • A test set that a community has agreed on and uses as a common benchmark. DO NOT USE IN TRAINING OR TESTING

PoS word classes in English • Word classes, also called syntactic categories or grammatical categories or Parts of Speech • closed class type: classes with fixed and few members, function words e.g. prepositions; • open class type: large class of members, many new additions, content words e.g. nouns • 8 major word classes: nouns, verbs, adjectives, adverbs, • prepositions, determiners, conjunctions, pronouns • In English, also most (?all) Natural Languages

What properties define “noun”? • Semantic properties: refer to people, places and things • Distributional properties: ability to occur next to determiners, possessives, adjectives (specific locations) • Morphological properties: most occur in singular and plural • These are properties of a word TYPE, • eg “man” is a noun (usually) • Sometimes a given TOKEN may not meet all these criteria … • The men are happy … the man is happy … • They man the lifeboat (?)

Subcategories • Noun • Proper Noun v Common Noun • (Mass noun v Count Noun) • singular v plural • Count v mass (often not covered in PoS-tagsets) • Some tag-sets may have other subcategories, • Eg NNP = common noun with Word Initial Capital • (eg Englishman) • PoS-tagset Often encodes morphological categories like person, number, gender, tense, case . . .

Verb: action or process • VB present/infinitive teach, eat • VBZ 3rd-person-singular present (s-form) teaches, eats • VBG progressive (ing-form) teaching, eating • VBD/VBN past taught, ate/eaten • Intransitive he died, transitive she killed him, … • (transitivity usually not marked in PoS-tags) • Auxiliaries:Modal verb e.g. can, must, may • Have, be, do can be modal or ma verbs • e.g. I have a present v I have given you a present

Adjective: quality or property (of a thing: noun phrase) • English is simple: • JJ big, JJR comparative bigger, JJT superlative biggest • More features in other languages, eg • Agreement (number, gender) with noun • Before a noun v after “be”

Adverb: quality or property of verb or adjective (or other functions…) • A hodge-podge (!) • General adverb often ends –ly slowly, happily (but NOT early) • Place adverb home, downhill • Time adverb now, tomorrow • Degree adverbs very, extremely, somewhat

Function words • Preposition e.g. in of on for over with (to) • Determiner e.g. this that, article the a • Conjunction e.g. and or but because that • Pronoun e.g. personal pronouns • I we (1st person), • you (2nd person), • he she it they (3rd person) • Possessive pronouns my, your, our, their • WH-pronouns what who whoever • Others: negatives (not), interjections (oh), existential there, …

Parts of “multi word expressions” • Particle – like preposition but “part of” a phrasal verb • I looked up her address v I looked up her skirt • I looked her address up v *I looked her skirt up • Big problem for PoS-tagging: common, and ambiguous • Other multi-word idioms: ditto tags

Bigram Markov Model tagger • Naive Method • 1. Get all possible tag sequences of the sentence • 2. Compute the probability of each tag sequence given the • Sentence, using word-tag and tag-bigram probabilites • 3. Take the maximum probability • Problem: This method has exponential complexity! • Solution: Viterbi Algorithm (not discussed in this module)

N-gram tagger • Uses the preceding N-1 predicted tags • Also uses the unigram estimate for the current word

Bigram tagger: problems • Unknown words in new input • Parameter estimation: need a tagged training text, what if this is different genre/dialect/language-type from new input? • Tokenization of training text and new input: contractions (isn’t), multi-word tokens (New York) • crude assumptions • very short distance dependencies • tags are not conditioned on previous words • Unintuitive

Transformation-based tagging • Markov model tagging: small range of regularities only • TB tagging first used by Brill, 1995 • Encodes more complex interdependencies between words • and tags • by learning intuitive rules from a training corpus • exploits linguistic knowledge; rules can be tuned manually

Transformation Templates • Templates specify general, admissible transformations: • Change Tag1 to Tag2 if • The preceding (following) word is tagged Tag3 • The word two before (after) is tagged Tag3 • One of the two preceding (following) words is tagged Tag3 • One of the three preceding (following) words is tagged Tag3 • The preceding word is tagged Tag3 • and the following word is tagged Tag4 • The preceding (following) word is tagged Tag3 • and the word two before (after) is tagged Tag4

Machine Learning Algorithm • Learns rules from tagged training corpus by specialising in templates • 1. Assume you do not know the precise tagging sequence in your training corpus • 2. Tag each word in the training corpus with its most frequent tag, e.g. move => VB • 3. Consider all possible transformations and apply the one that • improves tagging most (greedy search) , • e.g. Change VB to NN if the preceding word is tagged AT • 4. Retag whole corpus applying that rule • 5. Go back to 3 and repeat until no significant improvements are reached • 6. Output all the rules you learnt in order!

Example: 1st cycle • First approximation: Initialise with most frequent tag (lexical information) • The/AT • bear/VB • is/BEZ • on/IN • the/AT • move/VB • to/TO • race/NN • there/RN

Change VB to NN if previous tag is AT • Try all possible transformations, choose the most useful one and apply it: • The/AT • bear/NN • is/BEZ • on/IN • the/AT • move/NN • to/TO • race/NN • there/RN

Change NN to VB if previous tag is TO • Try all possible transformations, choose the most useful one and apply it: • The/AT • bear/NN • is/BEZ • on/IN • the/AT • move/NN • to/TO • race/VB • there/RN

Final set of learnt rules • Brill rules corresponding to syntagmatic patterns • 1. Change VB to NN if previous tag is AT • 2. Change NN to VB if previous tag is TO • Can now be applied to an untagged corpus! • uses pre-encoded linguistic knowledge explicitly • uses wider context + following context • can be expanded to word-driven templates • can be expanded to morphology-driven templates (for unknown words) • learnt rules are intuitive, easy to understand

Combining taggers • Can be combined via backoff: if first tagger finds no tag (None) then try another tagger • This really only makes sense with N-gram taggers: • If trigram tagger finds no tag, backoff to bigram tagger, • if bigram tagger fails then backoff to unigram tagger • Better: combine tagger results by a voting system • Combinatory Hybrid Elementary Analysis of Text • (combines results of morphological analysers / taggers)

PoS-Tagging theory and terminology