Automatic Language Identification – A Syntactic Approach - PowerPoint PPT Presentation

automatic language identification a syntactic approach n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Automatic Language Identification – A Syntactic Approach PowerPoint Presentation
Download Presentation
Automatic Language Identification – A Syntactic Approach

play fullscreen
1 / 16
Automatic Language Identification – A Syntactic Approach
2 Views
Download Presentation
jkinney
Download Presentation

Automatic Language Identification – A Syntactic Approach

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. Automatic Language Identification – A Syntactic Approach Mahesh Soundalgekar CFILT, IIT Bombay

  2. The Road Map • Introduction • System Architecture • Classification Approaches • Experimental Results • Summary and Future Work CFILT, IIT Bombay

  3. Introduction • Goal : Efficiently crawl Web pages in a given language; • Marathi in our case • Different languages use the same Devanagari script • E.g Marathi, Sanskrit and Hindi • Necessity to accurately distinguish one language from others • We take a syntactic approach to solve this problem, which has given us excellent results on training data of 2MB with test data of 10 MB CFILT, IIT Bombay

  4. System Architecture HTML Documents in different encodings such as Xdvng, DV-TTYogesh HTML to ASCII Plain Text + Font Information Appropriate Encoding Converter Plain Text in ISCII Encoding Classifier Classification Results CFILT, IIT Bombay

  5. Classification Approaches • Most Frequently Occurring Common Words • e.g. English : the, an, is, at,a etc • N-Grams (Most Frequent Character Sequences) • Bi-grams: th, ’s, re, en • Tri-grams: the, ing, ion, • Quad-grams: tion as in classification, association, gratification etc. CFILT, IIT Bombay

  6. Important Factors • Size of the Training Data – Important to capture the • syntactic essence of a language • Domains of Training Data – Usages vary from domain • to domain, author to author • Size of the Test Data – Small test data may not • contain enough information for classification • Requirement of linguistic knowledge for common • words approach CFILT, IIT Bombay

  7. Classifier Architecture Training Samples Test Document Generate Profile Generate Profiles Category Profiles Document Profile Measure Profile Distances Find minimum Distance Identify category CFILT, IIT Bombay

  8. Common Words Approach • List of selected common words • Matched with the test documents • Closest match will give the language of the document • Advantages: • Intuitive • Computationally Efficient • Space Efficient CFILT, IIT Bombay

  9. Top 5 Marathi Common Words • ´É • +ÉÎhÉ • +É½ä • ªÉÉ • iÉä CFILT, IIT Bombay

  10. N-Grams Approach • JAVA • Bi-grams: _J, JA, AV, VA, A_ • Tri-grams: _JA, JAV, AVA, VA_, A__ • Quad-grams: _JAV, JAVA, AVA_, VA__, A___ • ¨ÉniÉ • Bi-grams: _¨É, ¨Én, , niÉ, iÉ_ • Tri-grams: _¨Én, ¨ÉniÉ,niÉ_, iÉ__ CFILT, IIT Bombay

  11. Measuring Distances Out_of_Place () A ER ING AND ON AR AND ER ED ON max_value 2 1 Max_value 0 Category profile sorted in descending order Test profile sorted in descending order Distance =3 + 2* max_value CFILT, IIT Bombay

  12. Extensions to N-Grams Method • Lowest Granularity • +ÉÊniªÉ = + + É + Ê + n + iÉ + ªÉ • Letter Granularity • +ÉÊniªÉ = +É + Ên + iÉ + ªÉ • Conjunct Granularity • +ÉÊniªÉ = +É + Ên + iªÉ CFILT, IIT Bombay

  13. Experimental Training Setup CFILT, IIT Bombay

  14. Category Profiles Generated through Training CFILT, IIT Bombay

  15. Classification Results CFILT, IIT Bombay

  16. Summary and Future Work • Good results have been obtained through syntactic classification • Common words technique is computationally most • efficient, but with a lesser accuracy • Our extensions to N-Grams give the desired accuracy • N-grams technique is robust to syntax errors • N-Grams technique does not require linguistic knowledge • We will be Using language identification techniques to identify a good starting set of pages for crawling activities for the general purpose search engine CFILT, IIT Bombay