270 likes | 643 Views
Part-of-Speech Tagging and Chunking with Maximum Entropy Model. Sandipan Dandapat Department of Computer Science & Engineering Indian Institute of Technology Kharagpur. Goal. Lexical Analysis Part-Of-Speech (POS) Tagging : Assigning part-of-speech to each word. e.g. Noun, Verb...
E N D
Part-of-Speech Tagging and Chunking with Maximum Entropy Model Sandipan Dandapat Department of Computer Science & Engineering Indian Institute of Technology Kharagpur
Goal • Lexical Analysis • Part-Of-Speech (POS) Tagging : Assigning part-of-speech to each word. e.g. Noun, Verb... • Syntactic Analysis • Chunking: Identify and label phrases as verb phrase and noun phrase etc.
Machine Learning to Resolve POS Tagging and Chunking • HMM • Supervised (DeRose,88; Mcteer,91; Brants,2000; etc.) • Semi-supervised (Cutting,92; Merialdo,94; Kupiec,92; etc.) • Maximum Entropy (Ratnaparkhi,96; etc.) • TB(ED)L (Brill,92,94,95; etc.) • Decision Tree (Black,92; Marquez,97; etc.)
Our Approach • Maximum Entropy based • Diverse and overlapping features • Language Independence • Reasonably good accuracy • Data intensive • Absence of sequence information
POS Tagging Schema Language Model Raw text Disambiguation Algorithm Tagged text Possible POS Class Restriction … POS tagging
POS Tagging: Our Approach ME Model ME Model: Current state depends on history (features) Raw text Disambiguation Algorithm Tagged text Possible POS Class Restriction … POS tagging
POS Tagging: Our Approach ME Model ME Model: Current state depends on history (features) Raw text Disambiguation Algorithm Tagged text Possible POS Class Restriction … POS tagging
POS Tagging: Our Approach {T} : Set of all tags TMA(wi) : Set of tags computed by Morphological Analyzer ME Model ti {T} or ti TMA(wi) Raw text Disambiguation Algorithm Tagged text … POS tagging
POS Tagging: Our Approach {T} : Set of all tags TMA(wi) : Set of tags computed by Morphological Analyzer ME Model ti {T} or ti TMA(wi) Raw text Beam Search Tagged text … POS tagging
Disambiguation Algorithm Text: Tags: Where, ti{T} , wi{T} = Set of tags
Disambiguation Algorithm Text: Tags: Where, ti TMA(wi), wi{T} = Set of tags
What are Features? • Feature function • Binary function of the history and target Example,
i-3 W1 T1 i-2 i-1 i i+1 i+2 i+3 T2 T3 W1 W2 W3 W4 W4 T4 T5 T6 T7 T4 POS Tagging Features pos word POS_Tag Feature Set Estimated Tag • 40 different experiments were conducted taking several combination from set ‘F’
i-3 W1 T1 T2 W2 i-2 i-1 i i+1 i+2 i+3 T3 T3 W3 W3 W4 T4 T5 T6 T7 W6 W7 POS Tagging Features pos word POS_Tag Feature Set Estimated Tag
Chunking Features pos word POS_Tag Chunk_Tag i-3 W1 T1 C1 W2 W3 i-2 i-1 i i+1 i+2 i+3 C2 C3 T2 T3 T4 T5 T6 Feature Set W4 Estimated Tag C4 C5 C6 C7 W5 W6 W7 T7
Experiments: POS tagging • Baseline Model • Maximum Entropy Model • ME (Bengali, Hindi and Telugu) • ME + IMA ( Bengali) • ME + CMA (Bengali) • Data Used
Tagset and Corpus Ambiguity • Tagset consists of 27 grammatical classes • Corpus Ambiguity • Mean number of possible tags for each word • Measured in the training tagged data (Dermatas et al 1995)
POS Tagging Results on Development Set Overall Accuracy
POS Tagging Results on Development Set Known Words Unknown Words Overall Accuracy
Chunking Results • Two different measures • Per word basis • Per chunk basis Correctly identified groups along with correctly labeled groups
Assessment of Error Types Bengali Hindi Telugu
Results on Test Set • Bengali data has been tagged using ME+IMA model • Hindi and Telugu data has been tagged with simple ME model • Chunk Accuracy has been measured per word basis
Conclusion and Future Scope • Morphological restriction on tags gives an efficient tagging model even when small labeled text is available • The performance of Hindi and Telugu can be improved using the morphological analyzer of the languages • Linguistic prefix and suffix information can be adopted • More features can be explored for chunking