Language identification in web pages
This presentation is the property of its rightful owner.
Sponsored Links
1 / 20

Language Identification in Web Pages PowerPoint PPT Presentation


  • 82 Views
  • Uploaded on
  • Presentation posted in: General

Language Identification in Web Pages. Bruno Martins, Mário J. Silva Faculdade de Ciências da Universidade Lisboa ACM SAC 2005 DOCUMENT ENGENEERING TRACK (DE-ACM-SAC-2005). Motivation. Goal: Efficiently crawl web pages in a given language, Portuguese in our case.

Download Presentation

Language Identification in Web Pages

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Language Identification in Web Pages

Bruno Martins, Mário J. Silva

Faculdade de Ciências da Universidade Lisboa

ACM SAC 2005 DOCUMENT ENGENEERING TRACK (DE-ACM-SAC-2005)


Motivation

  • Goal: Efficiently crawl web pages in a given language, Portuguese in our case.

  • Necessity to accurately distinguish one language from others.

    We take a n-gram based approach to solve this problem, which has been reported to give excellent results.


Problems

  • Web texts are considerably different:

    • Multilingual documents.

    • Spelling errors.

    • Lack of coherent sentences.

    • Often small amounts of textual data.

      These considerable differences motivate a revisit to the problem.


Outline

  • Introduction.

  • Context and Related Work.

    • Language identification.

    • Text categorization with n-grams.

  • Our Language Identification Algorithm.

  • Experimental Results.

  • Future Work.

  • Conclusions.


Language Identification

  • Sibun and Reynar provided a good survey.

  • Variety of features have been tried:

    • Characters, words, POS tags, n-grams, ...

      N-gram based methods seem to be the most promising.

  • Dunning, Damashek, Cavnar & Trenkle, ...


N-grams in text categorization

N-grams = n-character slices of a longer string.

  • “tumba!” is composed of the following n-grams:

    • Unigrams: _, t, u, m, b, a, !, _

    • Bigrams: _t, tu, um, mb, ba, a!, !_

    • Trigrams: _tu, tum, umb, mba, ba!, a!_, !__

    • Quadgrams: _tum, tumb, umba, mba!, ba!_, a!__, !___

    • Quintgrams: _tumb, tumba, umba!, mba!_ , ba!__, a!___, !____

  • Advantages:

    • Efficiently handle spelling and grammatical errors.

    • No need for tokenization, stemming, ...

    • Computationally and space efficient.


Outline

  • Introduction.

  • Context and Related Work.

  • Our Language Identification Algorithm.

    • N-gram categorization approach.

    • Measuring similarity with n-gram profiles.

    • Heuristics for Web documents.

  • Experimental Results.

  • Future Work.

  • Conclusions.


N-gram categorization approach

  • Measure similarity among documents through n-gram statistics.

  • N-grams of multiple lengths simultaneously (1-5)


N-gram similarity - Cavnar & Treckle


More efficient similarity measures

  • Lin's information theoretic similarity measure:

  • Jiang and Conranth's distance formula:


Heuristics for the Web

  • Use meta-data information, if available and valid.

    • Matching strings on the language meta tag.

  • Filter common or automatically generated strings.

    • “optimized for Internet Explorer”

  • Weight n-grams according to HTML markup.

    • Title, bold typeface, subject and description metatags

  • Handle insufficient data.

    • Ignore pages with less 40 characters.

  • Handle multilingualism and hard to decide cases.

    • Weight largest sentences.


Outline

  • Introduction.

  • Context and Related Work.

  • Our Language Identification Algorithm.

  • Experimental Results.

  • Future Work.

  • Conclusions.


Evaluation Experiments

  • Language profiles for 23 different languages.

  • Test collection: 500 documents for each of 12 different languages.

    • HTML documents crawled from portals and online newspapers.

  • Tested the classification algorithm in different settings.

    • Lin's measure was the most accurate.

    • Heuristics improve performance.


  • Evaluation Results


    Evaluation Results


    Application to the Portuguese Web

    About 3.5 million pages.

    Multiple file types.

    Significant portion of the Portuguese Web is written in foreign languages, especially English.


    Limitations

    • Unable to distinguish dialects of the same language?

      • Portuguese from Portugal and from Brazil.

      • English and American English?

    • Possible directions:

      • Web linkage information.

      • “Discriminative” n-grams instead of most frequent.


    Future Work

    • Carefully choose better training data.

    • Smoothing (Good-Turing).

    • Use n-grams approach for other classification tasks.


    Conclusions

    • N-grams are effective in language guessing.

    • Text from the Web presents problems.

    • Lin's similarity measure seems effective.


    Thanks for your attention!

    [email protected]

    http://www.tumba.pt

    http://tcatng.sourceforge.net


  • Login