1 / 20

Language Identification in Web Pages

Language Identification in Web Pages. Bruno Martins, Mário J. Silva Faculdade de Ciências da Universidade Lisboa ACM SAC 2005 DOCUMENT ENGENEERING TRACK (DE-ACM-SAC-2005). Motivation. Goal: Efficiently crawl web pages in a given language, Portuguese in our case.

alanna
Download Presentation

Language Identification in Web Pages

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Language Identification in Web Pages Bruno Martins, Mário J. Silva Faculdade de Ciências da Universidade Lisboa ACM SAC 2005 DOCUMENT ENGENEERING TRACK (DE-ACM-SAC-2005)

  2. Motivation • Goal: Efficiently crawl web pages in a given language, Portuguese in our case. • Necessity to accurately distinguish one language from others. We take a n-gram based approach to solve this problem, which has been reported to give excellent results.

  3. Problems • Web texts are considerably different: • Multilingual documents. • Spelling errors. • Lack of coherent sentences. • Often small amounts of textual data. These considerable differences motivate a revisit to the problem.

  4. Outline • Introduction. • Context and Related Work. • Language identification. • Text categorization with n-grams. • Our Language Identification Algorithm. • Experimental Results. • Future Work. • Conclusions.

  5. Language Identification • Sibun and Reynar provided a good survey. • Variety of features have been tried: • Characters, words, POS tags, n-grams, ... N-gram based methods seem to be the most promising. • Dunning, Damashek, Cavnar & Trenkle, ...

  6. N-grams in text categorization N-grams = n-character slices of a longer string. • “tumba!” is composed of the following n-grams: • Unigrams: _, t, u, m, b, a, !, _ • Bigrams: _t, tu, um, mb, ba, a!, !_ • Trigrams: _tu, tum, umb, mba, ba!, a!_, !__ • Quadgrams: _tum, tumb, umba, mba!, ba!_, a!__, !___ • Quintgrams: _tumb, tumba, umba!, mba!_ , ba!__, a!___, !____ • Advantages: • Efficiently handle spelling and grammatical errors. • No need for tokenization, stemming, ... • Computationally and space efficient.

  7. Outline • Introduction. • Context and Related Work. • Our Language Identification Algorithm. • N-gram categorization approach. • Measuring similarity with n-gram profiles. • Heuristics for Web documents. • Experimental Results. • Future Work. • Conclusions.

  8. N-gram categorization approach • Measure similarity among documents through n-gram statistics. • N-grams of multiple lengths simultaneously (1-5)

  9. N-gram similarity - Cavnar & Treckle

  10. More efficient similarity measures • Lin's information theoretic similarity measure: • Jiang and Conranth's distance formula:

  11. Heuristics for the Web • Use meta-data information, if available and valid. • Matching strings on the language meta tag. • Filter common or automatically generated strings. • “optimized for Internet Explorer” • Weight n-grams according to HTML markup. • Title, bold typeface, subject and description metatags • Handle insufficient data. • Ignore pages with less 40 characters. • Handle multilingualism and hard to decide cases. • Weight largest sentences.

  12. Outline • Introduction. • Context and Related Work. • Our Language Identification Algorithm. • Experimental Results. • Future Work. • Conclusions.

  13. Evaluation Experiments • Language profiles for 23 different languages. • Test collection: 500 documents for each of 12 different languages. • HTML documents crawled from portals and online newspapers. • Tested the classification algorithm in different settings. • Lin's measure was the most accurate. • Heuristics improve performance.

  14. Evaluation Results

  15. Evaluation Results

  16. Application to the Portuguese Web About 3.5 million pages. Multiple file types. Significant portion of the Portuguese Web is written in foreign languages, especially English.

  17. Limitations • Unable to distinguish dialects of the same language? • Portuguese from Portugal and from Brazil. • English and American English? • Possible directions: • Web linkage information. • “Discriminative” n-grams instead of most frequent.

  18. Future Work • Carefully choose better training data. • Smoothing (Good-Turing). • Use n-grams approach for other classification tasks.

  19. Conclusions • N-grams are effective in language guessing. • Text from the Web presents problems. • Lin's similarity measure seems effective.

  20. Thanks for your attention! bmartins@xldb.di.fc.ul.pt http://www.tumba.pt http://tcatng.sourceforge.net

More Related