Hindi Parts-of-Speech Tagging & Chunking

Hindi Parts-of-Speech Tagging & Chunking Baskaran S MSRI

What's in? • Why POS tagging & chunking? • Approach • Challenges • Unseen tag sequences • Unknown words • Results • Future work • Conclusion NWAI

Intro & Motivation NWAI

POS • Parts-of-Speech • Dionysius Thrax (ca 100 BC) • 8 types – noun, verb, pronoun, preposition, adverb, conjunction, participle and article I get my thing in action. (Verb, that's what's happenin') To work, (Verb!) To play, (Verb!) To live, (Verb!) To love... (Verb!...) - Schoolhouse Rock NWAI

Tagging Assigning the appropriate POS or lexical class marker to words in a given text • Symbols, punctuation markers etc. are also assigned specific tag(s) NWAI

Why POS tagging? • Gives significant information about a word and its neighbours • Adjective near noun • Adverb near verb • Gives clue on how a word is pronounced • OBject as noun • obJECT as verb • Speech synthesis, full parsing of sentences, IR, word sense disambiguation etc. NWAI

Chunking • Identifying simple phrases • Noun phrase, verb phrase, adjectival phrase… • Useful as a first step to Parsing • Named entity recognition NWAI

POS tagging & Chunking NWAI

Stochastic approaches • Availability of tagged corpora in large quantity • Most are based on HMM • Weischedel ’93 • DeRose ’88 • Skut and Brants ’98 – extending HMM to chunking • Zhou and Su ‘00 • and lots more… NWAI

Annotated corpus Tag-sequence probability Word-emit probability HMM • Assumptions • Probability of a word is dependent only on its tag • Approximate the tag history to the most recent two tags NWAI

Structural tags • A triple – POS tag, structural relation & chunk tag • Originally proposed by Skut & Brants ’98 • Seven relations • Enables embedded and overlapping chunks NWAI

Structural relations परीक्षा में भीप्रथम श्रेणीप्राप्त कीऔरविद्यालय मेंकुलपति द्वाराविशेष पुरस्कार भीउन्हीं कोप्राप्त हुआ । SSF NP VG परीक्षा में । End SSF SSF 00 09 VG NP NP Beg परीक्षा श्रेणीप्राप्त 90 99 NWAI

Decoding • Viterbi mostly used (also A* or stack) • Aims at finding the best path (tag sequence) given observation sequence • Possible tags are identified for each transition, with associated probabilities • The best path is the one that maximizes the product of these transition probabilities NWAI

अबजीवन काएकअन्य रूपउनके सामनेआया । JJ NLOC NN PREP PRP QFN RB VFM SYM NWAI

Issues NWAI

1. Unseen tag sequences • Smoothing (Add-One, Good-Turing) and/ or Backoff (Deleted interpolation) • Idea is to distribute some fractional probability (of seen occurrences) to unseen • Good-Turing • Re-estimates the probability mass of lower count N-grams by that of higher counts • - Number of N-grams occurring c times NWAI

2. Unseen words • Insufficient corpus (even after 10 mn words) • Not all of them are proper names • Treat them as rare words that occur once in the corpus - Baayen and Sproat ’96, Dermatas and Kokkinakis ’95 • Known Hindi corpus of 25 K words and unseen corpus of 6 K words • All words vs. Hapax vs. Unknown NWAI

Tag distribution analysis NWAI

3. Features • Can we use other features? • Capitalization • Word endings and Hyphenations • Weishedel ’93 reports about 66% reduction in error rate with word endings and hyphenations • Capitalizations, though useful for proper nouns are not very effective NWAI

Contd… • String length • Prefix & suffix – fixed characters width • Character encoding range • Complete analysis remains to be done • Expected to be very effective for morphologically rich languages • To be experimented with Tamil NWAI

4. Multi-part words • Examples In/ terms/ of/ United/ States/ of/ America/ • More problematic in Hindi United/NNPC States/NNPC of/NNPC America/NNP Central/NNC government/NN NNPC – Compound proper noun, NN - noun NNP – Proper noun, NNC – Compound noun • How does the system identify the last word in multi-part word? • 10% of errors is due to this in Hindi (6 K words tested) NWAI

Results NWAI

Evaluation metrics • Tag precision • Unseen word accuracy • % of unseen words that are correctly tagged • Estimates the goodness of unseen words • % reduction in error • Reduction in error after the application of a particular feature NWAI

Results - Tagger • No structural tags  better smoothing • Unseen data – significantly more unknowns NWAI

Results – Chunk tagger • Training » 22 K, development data » 8 K • 4-cross validation • Test data » 5 K NWAI

Results – Tagging error analysis • Significant issues with nouns/multi-part words • NNP  NN • NNC  NN • Also, • VAUX  VFM; VFM  VAUX and • NVB  NN; NN  NVB NWAI

HMM performance (English) • > 96% reported accuracies • About 85% for unknown words • Advantage • Simple and most suitable with the availability of annotated data NWAI

Conclusion NWAI

Future work • Handling unseen words • Smoothing • Can we exploit other features? • Especially morphological ones • Multi-part words NWAI

Summary • Statistical approaches now include linguistic features for higher accuracies • Improvement required • Tagging • Precision – 79.22% • Unknown words – 41.6% • Chunking • Precision – 60% • Recall – 62% NWAI

Hindi Parts-of-Speech Tagging & Chunking