Automatic Language Identification – A Syntactic Approach

Automatic Language Identification – A Syntactic Approach Mahesh Soundalgekar CFILT, IIT Bombay

The Road Map • Introduction • System Architecture • Classification Approaches • Experimental Results • Summary and Future Work CFILT, IIT Bombay

Introduction • Goal : Efficiently crawl Web pages in a given language; • Marathi in our case • Different languages use the same Devanagari script • E.g Marathi, Sanskrit and Hindi • Necessity to accurately distinguish one language from others • We take a syntactic approach to solve this problem, which has given us excellent results on training data of 2MB with test data of 10 MB CFILT, IIT Bombay

System Architecture HTML Documents in different encodings such as Xdvng, DV-TTYogesh HTML to ASCII Plain Text + Font Information Appropriate Encoding Converter Plain Text in ISCII Encoding Classifier Classification Results CFILT, IIT Bombay

Classification Approaches • Most Frequently Occurring Common Words • e.g. English : the, an, is, at,a etc • N-Grams (Most Frequent Character Sequences) • Bi-grams: th, ’s, re, en • Tri-grams: the, ing, ion, • Quad-grams: tion as in classification, association, gratification etc. CFILT, IIT Bombay

Important Factors • Size of the Training Data – Important to capture the • syntactic essence of a language • Domains of Training Data – Usages vary from domain • to domain, author to author • Size of the Test Data – Small test data may not • contain enough information for classification • Requirement of linguistic knowledge for common • words approach CFILT, IIT Bombay

Classifier Architecture Training Samples Test Document Generate Profile Generate Profiles Category Profiles Document Profile Measure Profile Distances Find minimum Distance Identify category CFILT, IIT Bombay

Common Words Approach • List of selected common words • Matched with the test documents • Closest match will give the language of the document • Advantages: • Intuitive • Computationally Efficient • Space Efficient CFILT, IIT Bombay

Top 5 Marathi Common Words • ´É • +ÉÎhÉ • +É½ä • ªÉÉ • iÉä CFILT, IIT Bombay

N-Grams Approach • JAVA • Bi-grams: _J, JA, AV, VA, A_ • Tri-grams: _JA, JAV, AVA, VA_, A__ • Quad-grams: _JAV, JAVA, AVA_, VA__, A___ • ¨ÉniÉ • Bi-grams: _¨É, ¨Én, , niÉ, iÉ_ • Tri-grams: _¨Én, ¨ÉniÉ,niÉ_, iÉ__ CFILT, IIT Bombay

Measuring Distances Out_of_Place () A ER ING AND ON AR AND ER ED ON max_value 2 1 Max_value 0 Category profile sorted in descending order Test profile sorted in descending order Distance =3 + 2* max_value CFILT, IIT Bombay

Extensions to N-Grams Method • Lowest Granularity • +ÉÊniªÉ = + + É + Ê + n + iÉÂ + ªÉ • Letter Granularity • +ÉÊniªÉ = +É + Ên + iÉÂ + ªÉ • Conjunct Granularity • +ÉÊniªÉ = +É + Ên + iªÉ CFILT, IIT Bombay

Experimental Training Setup CFILT, IIT Bombay

Category Profiles Generated through Training CFILT, IIT Bombay

Classification Results CFILT, IIT Bombay

Summary and Future Work • Good results have been obtained through syntactic classification • Common words technique is computationally most • efficient, but with a lesser accuracy • Our extensions to N-Grams give the desired accuracy • N-grams technique is robust to syntax errors • N-Grams technique does not require linguistic knowledge • We will be Using language identification techniques to identify a good starting set of pages for crawling activities for the general purpose search engine CFILT, IIT Bombay

Automatic Language Identification – A Syntactic Approach