1 / 16

Automatic Language Identification – A Syntactic Approach

Automatic Language Identification – A Syntactic Approach. Mahesh Soundalgekar. The Road Map. Introduction. System Architecture. Classification Approaches. Experimental Results. Summary and Future Work. Introduction. Goal : Efficiently crawl Web pages in a given language;

jkinney
Download Presentation

Automatic Language Identification – A Syntactic Approach

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Automatic Language Identification – A Syntactic Approach Mahesh Soundalgekar CFILT, IIT Bombay

  2. The Road Map • Introduction • System Architecture • Classification Approaches • Experimental Results • Summary and Future Work CFILT, IIT Bombay

  3. Introduction • Goal : Efficiently crawl Web pages in a given language; • Marathi in our case • Different languages use the same Devanagari script • E.g Marathi, Sanskrit and Hindi • Necessity to accurately distinguish one language from others • We take a syntactic approach to solve this problem, which has given us excellent results on training data of 2MB with test data of 10 MB CFILT, IIT Bombay

  4. System Architecture HTML Documents in different encodings such as Xdvng, DV-TTYogesh HTML to ASCII Plain Text + Font Information Appropriate Encoding Converter Plain Text in ISCII Encoding Classifier Classification Results CFILT, IIT Bombay

  5. Classification Approaches • Most Frequently Occurring Common Words • e.g. English : the, an, is, at,a etc • N-Grams (Most Frequent Character Sequences) • Bi-grams: th, ’s, re, en • Tri-grams: the, ing, ion, • Quad-grams: tion as in classification, association, gratification etc. CFILT, IIT Bombay

  6. Important Factors • Size of the Training Data – Important to capture the • syntactic essence of a language • Domains of Training Data – Usages vary from domain • to domain, author to author • Size of the Test Data – Small test data may not • contain enough information for classification • Requirement of linguistic knowledge for common • words approach CFILT, IIT Bombay

  7. Classifier Architecture Training Samples Test Document Generate Profile Generate Profiles Category Profiles Document Profile Measure Profile Distances Find minimum Distance Identify category CFILT, IIT Bombay

  8. Common Words Approach • List of selected common words • Matched with the test documents • Closest match will give the language of the document • Advantages: • Intuitive • Computationally Efficient • Space Efficient CFILT, IIT Bombay

  9. Top 5 Marathi Common Words • ´É • +ÉÎhÉ • +É½ä • ªÉÉ • iÉä CFILT, IIT Bombay

  10. N-Grams Approach • JAVA • Bi-grams: _J, JA, AV, VA, A_ • Tri-grams: _JA, JAV, AVA, VA_, A__ • Quad-grams: _JAV, JAVA, AVA_, VA__, A___ • ¨ÉniÉ • Bi-grams: _¨É, ¨Én, , niÉ, iÉ_ • Tri-grams: _¨Én, ¨ÉniÉ,niÉ_, iÉ__ CFILT, IIT Bombay

  11. Measuring Distances Out_of_Place () A ER ING AND ON AR AND ER ED ON max_value 2 1 Max_value 0 Category profile sorted in descending order Test profile sorted in descending order Distance =3 + 2* max_value CFILT, IIT Bombay

  12. Extensions to N-Grams Method • Lowest Granularity • +ÉÊniªÉ = + + É + Ê + n + iÉ + ªÉ • Letter Granularity • +ÉÊniªÉ = +É + Ên + iÉ + ªÉ • Conjunct Granularity • +ÉÊniªÉ = +É + Ên + iªÉ CFILT, IIT Bombay

  13. Experimental Training Setup CFILT, IIT Bombay

  14. Category Profiles Generated through Training CFILT, IIT Bombay

  15. Classification Results CFILT, IIT Bombay

  16. Summary and Future Work • Good results have been obtained through syntactic classification • Common words technique is computationally most • efficient, but with a lesser accuracy • Our extensions to N-Grams give the desired accuracy • N-grams technique is robust to syntax errors • N-Grams technique does not require linguistic knowledge • We will be Using language identification techniques to identify a good starting set of pages for crawling activities for the general purpose search engine CFILT, IIT Bombay

More Related