Shallow Parsing for South Asian Languages

Shallow Parsing for South Asian Languages -Himanshu Agrawal

Shallow Parsing • Parts Of Speech Tagging Assigning grammatical classes to words in a natural language sentence. • Text Chunking Dividing the text in syntactically co-related parts of words. Example: [NPHe ] [VPreckons ] [NP the current account deficit ] [VPwill narrow ] [PPto ] [NPonly # 1.8 billion ] [PPin [NP September ]] .

Applications • Direct Applications • Automatic Spell Checking Software • Grammar Suggestions ( MS word pop-ups) • Full Parsing • Indirect Applications • Machine Translation Systems • Web Search ( )

Nature of the problem of Shallow Parsing • A classic problem of classifying input tokens into given classes. • The sequence aspect • The sequence of best classes. • The best sequence of classes. Typically, the classifying information is the language context of the word under consideration.

Shallow Parsing for English • The problem has been well worked upon for English. • Very Efficient Systems Exist Example: • Brill’s Tagger: ’95, Transformation Based Learning. • Adwait Ratnaparkhi: ’99, Parsing with Maximum Entropy • Significant effect on the development of MT systems for European Languages

Shallow Parsing for South Asian Languages • Portability of Shallow Parsing Systems across languages ?? NOT GOOD !! • Inflectional Richness of the Languages. * Training on 22,000 words and Testing on 5000 words.

Challenges with Indian Languages. • Poor Disambiguation between certain POS class categories example • NNP and NNC !! (Error Type 1) • JJ and NN !! (Error Type 2) • Inflectional Richness of the language • Absence of markers like the capitalization of proper nouns and etc. Is that Raj ?

On Improving the performance for Hindi and other South Asian Languages. There can be two ways • Improving the classifying information by the use of better features or using language specific information or both. • Improving the learning by better training and better inference-ing.

A. POS Tagging For better training and inference-ing. • Approach 1: Training on a hierarchical structure of tags • Approach 2: Building a knowledge database from raw / un-annotated text to use as a `look up`.

Approach 1:Training on Hierarchical Tagset • Training in steps, on a hierarchical structure of classes. Training Level  1  2

Approach 1:Training on Hierarchical Tagset • The approach was devised to minimize the number of errors that are made within a family class. • Results 73.33 % • Reason: • No mechanism to correct errors in the part 1 of training • Jittered language constructs while training in part 2.

Approach 2:Building a knowledge database for `look up.` • The Knowledge database consists of words and the POS tags it is known to have occurred with. • How is it important ?? Inflectional richnessVsper class ambiguity

Building the knowledge database • Adding words and their POS tags from the training data. • Training on 22,000 words on Gold Standard POS tags, and creating a training model `A`. • Using model ‘A’ to annotate the raw text consisting of 2 Lakh words. • Extracting the words/POS tags of words tagged with very high confidence measure. And adding them to the database.

Using the knowledge database • For the final tagging • We use model ‘A’ to get the probability of each tag to be associated with a word. ie P(tagi / word) for (every tag) for (every word in the test data) • If a word is found in the database, we choose the tag in its entry, which has the highest probability. • If not found, we let the tag predicted in the first run remain unchanged.

Approach 2 • Results: 84.90 %

Training for Model `A` • We use Linear Chain Implementation of the Conditional Random Fields. Taku Kudo et. Al. 2005 • We use simple language independent features • Word Window [-2, 2]. • Suffix Information as in last 2, 3, 4 chars. • Presence of Special Characters. • Word Length.

B. Chunking • We have followed the approach used by Anirudh, Himanshu ’06 NWAI. • 2 step Training: • Training on Boundary-Label scheme for extracting Chunk Labels. • Training on Boundaries with added information of chunk labels.

Chunking cont. • Training for identifying Chunk tags is also done using a linear chain implementation of CRF. • Features: • Word window of [-2, 2] • POS tag window of [-2, 2] • Chunk Labels, for chunk Boundary Identification [-2, 0]

Chunking • Results 92.69 %

Consolidated Results • **The results below are on calculated on the development data.

Conclusions: • Training on a tag-set optimal for capturing the language patterns. • If training is done in more than one step, esp. such that tags in the subsequent step are directly dependent on the tags in the present step, then it is of importance that there exist a way to re-tag the mis-tagged tokens.

References: • Charles Sutton, An Introduction to Conditional Random Fields for Relational Learning • Adwait Ratnaparkhi ,1998, Maximum Entropy Models For Natural Language Ambiguity Resolution, Dissertation in Computer and Information Science,University Of Pennslyvania,1998. • Akshay Singh, Sushma Bendre, Rajeev Sangal, 2005 ,HMM Based Chunker for Hindi, IIIT Hyderabad. • Thorsten Brants. 2000. TnT - A Statistical Part-of- Speech Tagger Proceedings of the sixth conference on Applied Natural Language Processing (2000) 224–231. • Himanshu Agrawal, Anirudh Mani 2006, Part Of Speech Tagging and Chunking Using Conditional Random Fields: Proceedings of the NLPAI MLcontest workshop, National Workshop on Artificial Intelligence.

Shallow Parsing for South Asian Languages

Shallow Parsing for South Asian Languages

Presentation Transcript

Chunk/Shallow Parsing

Developing Classroom Assessments for South Asian Languages

Chunking Shallow Parsing

Shallow Parsing

Shallow Parsing

Developing Morphological Analysers for South Asian Languages:

Developing Classroom Assessments for South Asian Languages

G228: South Asian Politics

South Asian Nationalism

Chunking: Shallow Parsing

South Asian Languages: Rethinking Curriculum

PaNoLa: Parsing Nordic Languages

Using technology to enhance the teaching of South Asian Languages

South Asian Culture

South Asian History

South Asian Readings

How to differentiate Asian Languages

SOUTH ASIAN RELIGIONS

Languages of South America. Amazonian Languages

Developing Classroom Assessments for South Asian Languages

Parsing XML into programming languages