Language identification in web pages
This presentation is the property of its rightful owner.
Sponsored Links
1 / 20

Language Identification in Web Pages PowerPoint PPT Presentation


  • 77 Views
  • Uploaded on
  • Presentation posted in: General

Language Identification in Web Pages. Bruno Martins, Mário J. Silva Faculdade de Ciências da Universidade Lisboa ACM SAC 2005 DOCUMENT ENGENEERING TRACK (DE-ACM-SAC-2005). Motivation. Goal: Efficiently crawl web pages in a given language, Portuguese in our case.

Download Presentation

Language Identification in Web Pages

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Language identification in web pages

Language Identification in Web Pages

Bruno Martins, Mário J. Silva

Faculdade de Ciências da Universidade Lisboa

ACM SAC 2005 DOCUMENT ENGENEERING TRACK (DE-ACM-SAC-2005)


Motivation

Motivation

  • Goal: Efficiently crawl web pages in a given language, Portuguese in our case.

  • Necessity to accurately distinguish one language from others.

    We take a n-gram based approach to solve this problem, which has been reported to give excellent results.


Problems

Problems

  • Web texts are considerably different:

    • Multilingual documents.

    • Spelling errors.

    • Lack of coherent sentences.

    • Often small amounts of textual data.

      These considerable differences motivate a revisit to the problem.


Language identification in web pages

Outline

  • Introduction.

  • Context and Related Work.

    • Language identification.

    • Text categorization with n-grams.

  • Our Language Identification Algorithm.

  • Experimental Results.

  • Future Work.

  • Conclusions.


Language identification in web pages

Language Identification

  • Sibun and Reynar provided a good survey.

  • Variety of features have been tried:

    • Characters, words, POS tags, n-grams, ...

      N-gram based methods seem to be the most promising.

  • Dunning, Damashek, Cavnar & Trenkle, ...


Language identification in web pages

N-grams in text categorization

N-grams = n-character slices of a longer string.

  • “tumba!” is composed of the following n-grams:

    • Unigrams: _, t, u, m, b, a, !, _

    • Bigrams: _t, tu, um, mb, ba, a!, !_

    • Trigrams: _tu, tum, umb, mba, ba!, a!_, !__

    • Quadgrams: _tum, tumb, umba, mba!, ba!_, a!__, !___

    • Quintgrams: _tumb, tumba, umba!, mba!_ , ba!__, a!___, !____

  • Advantages:

    • Efficiently handle spelling and grammatical errors.

    • No need for tokenization, stemming, ...

    • Computationally and space efficient.


Outline

Outline

  • Introduction.

  • Context and Related Work.

  • Our Language Identification Algorithm.

    • N-gram categorization approach.

    • Measuring similarity with n-gram profiles.

    • Heuristics for Web documents.

  • Experimental Results.

  • Future Work.

  • Conclusions.


N gram categorization approach

N-gram categorization approach

  • Measure similarity among documents through n-gram statistics.

  • N-grams of multiple lengths simultaneously (1-5)


N gram similarity cavnar treckle

N-gram similarity - Cavnar & Treckle


More efficient similarity measures

More efficient similarity measures

  • Lin's information theoretic similarity measure:

  • Jiang and Conranth's distance formula:


Language identification in web pages

Heuristics for the Web

  • Use meta-data information, if available and valid.

    • Matching strings on the language meta tag.

  • Filter common or automatically generated strings.

    • “optimized for Internet Explorer”

  • Weight n-grams according to HTML markup.

    • Title, bold typeface, subject and description metatags

  • Handle insufficient data.

    • Ignore pages with less 40 characters.

  • Handle multilingualism and hard to decide cases.

    • Weight largest sentences.


Language identification in web pages

Outline

  • Introduction.

  • Context and Related Work.

  • Our Language Identification Algorithm.

  • Experimental Results.

  • Future Work.

  • Conclusions.


Language identification in web pages

Evaluation Experiments

  • Language profiles for 23 different languages.

  • Test collection: 500 documents for each of 12 different languages.

    • HTML documents crawled from portals and online newspapers.

  • Tested the classification algorithm in different settings.

    • Lin's measure was the most accurate.

    • Heuristics improve performance.


  • Language identification in web pages

    Evaluation Results


    Language identification in web pages

    Evaluation Results


    Application to the portuguese web

    Application to the Portuguese Web

    About 3.5 million pages.

    Multiple file types.

    Significant portion of the Portuguese Web is written in foreign languages, especially English.


    Language identification in web pages

    Limitations

    • Unable to distinguish dialects of the same language?

      • Portuguese from Portugal and from Brazil.

      • English and American English?

    • Possible directions:

      • Web linkage information.

      • “Discriminative” n-grams instead of most frequent.


    Language identification in web pages

    Future Work

    • Carefully choose better training data.

    • Smoothing (Good-Turing).

    • Use n-grams approach for other classification tasks.


    Language identification in web pages

    Conclusions

    • N-grams are effective in language guessing.

    • Text from the Web presents problems.

    • Lin's similarity measure seems effective.


    Language identification in web pages

    Thanks for your attention!

    [email protected]

    http://www.tumba.pt

    http://tcatng.sourceforge.net


  • Login