Download
web page language identification based on urls n.
Skip this Video
Loading SlideShow in 5 Seconds..
Web Page Language Identification Based on URLs PowerPoint Presentation
Download Presentation
Web Page Language Identification Based on URLs

Web Page Language Identification Based on URLs

164 Views Download Presentation
Download Presentation

Web Page Language Identification Based on URLs

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. Web Page Language Identification Based on URLs Reporter: 鄭志欣 Advisor:Hsing-Kuo Pao

  2. Reference Web page language identification based on URLs, E. Baykan, M. Henzinger, and I. Weber., In 34th International Conference on Very Large Data Bases (VLDB), pages 176-188. ACM, 2008

  3. Outline Introduction Language Identification Based On URLs Experimental Setup Experimental Results Conclusions

  4. Introduction • Given only the URL of a web page, can we identify its language? • Web crawlers • Personalized Web Browser • We consider the problem of determining the language of a web page using only its URL. • English , French , German , Spanish , and Italian • .com(60%), .org (10%) • www.wasserbett-test.com

  5. Introduction • Applying machine learning techniques • Features • Word features • N-grams features • Custom-made features • Machine learning algorithm • Naïve Bayes • Decision Tree • Relative Entropy • Maximum Entropy

  6. Outline Introduction Language Identification Based On URLs Experimental Setup Experimental Results Conclusions

  7. Extracting Feature Vectors • Words as features • Remove “www” , ”index”, ”html” …,etc. • For example, http://www.internetwordstats.com/africa2.htm • Split into : internetwordstats , com , africa • cnn , gov are indicative of English • Produits ,recherche are indicative of French

  8. Trigrams as features • Start with the some token as the method above(word as features) • Eg, weather • “_we” , “wea” , “eat” , “ath” ,”the” ,”her” , “er_” • “_th” , “ing” are very common in English

  9. Custom-made features • Top-level domain country code • OpenOffice dictionaries • Dictionary with city names • Number of hyphens

  10. Classification Algorithms Country code top-level domain only (ccTLD) Country code top-level domain plus (ccTLD+) Naïve bayes (NB) Decision Tees (DT) Relative Entropy(RE) Maximum Entropy(ME)

  11. Outline Introduction Language Identification Based On URLs Experimental Setup Experimental Results Conclusions

  12. DataSet • The algorithms were evaluated on three different data sets • Open Directory Project • Microsoft’s Live Search • 1260 pages form a large web crawl labels by hand

  13. Outline Introduction Language Identification Based On URLs Experimental Setup Experimental Results Conclusions

  14. P = n+p(+|+)/ (n+p(+|+) + n−(1 − p(−|−))) = p(+|+) = p(−|−) F = 2/(1/R+1/P)

  15. Human Performance

  16. Baseline : ccTLD

  17. Conclusions This paper shows that high quality language identifiers for web pages can be built based on URLs alone. The largest challenge is to identify English-looking URLs of non-English web pages.