Language Identification in Web Pages. Bruno Martins, Mário J. Silva Faculdade de Ciências da Universidade Lisboa ACM SAC 2005 DOCUMENT ENGENEERING TRACK (DE-ACM-SAC-2005). Motivation. Goal: Efficiently crawl web pages in a given language, Portuguese in our case.
Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.
Bruno Martins, Mário J. Silva
Faculdade de Ciências da Universidade Lisboa
ACM SAC 2005 DOCUMENT ENGENEERING TRACK (DE-ACM-SAC-2005)
We take a n-gram based approach to solve this problem, which has been reported to give excellent results.
These considerable differences motivate a revisit to the problem.
N-gram based methods seem to be the most promising.
N-grams in text categorization
N-grams = n-character slices of a longer string.
Heuristics for the Web
About 3.5 million pages.
Multiple file types.
Significant portion of the Portuguese Web is written in foreign languages, especially English.
Thanks for your attention!