1 / 23

Finding More Bilingual Webpages with High Credibility via Link Analysis

BUCC2013, Sofia, Bulgaria. Finding More Bilingual Webpages with High Credibility via Link Analysis. Chengzhi Zhang , Nanjing University of Science and Technology Xuchen Yao , Johns Hopkins University Chunyu Kit , City University of Hong Kong 8 August 2013. 3 ideas.

leigh
Download Presentation

Finding More Bilingual Webpages with High Credibility via Link Analysis

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. BUCC2013, Sofia, Bulgaria Finding More Bilingual Webpages with High Credibility via Link Analysis Chengzhi Zhang , Nanjing University of Science and Technology Xuchen Yao , Johns Hopkins University Chunyu Kit , City University of Hong Kong 8 August 2013

  2. 3 ideas • Bilingual URL Pattern Detection • Deep Webpage Recovery • Incremental Bilingual Website Exploration

  3. Bilingual URL Pattern Detection • a URL pattern: <en, zh> (Kit and Ng, 2007) • www.legco.gov.hk/yr99-00/en/fc/esc/e0.htm • www.legco.gov.hk/yr99-00/zh/fc/esc/e0.htm • Improvement: • pairing up speed goes up from O(|U|2) to O(|U|) • U is the set of all URLs within a website • approach: inverted index for URLs • token-based pair -> char-based pair • weak pairs: <1e, 1c>, <2e, 2c>, ... • http://.../1e/i.html <-> http://.../1c/i.html • enchanced: <e,c> • supports multiple languages • better mining multilingual websites such as EU and UN

  4. Bilingual URL Pattern Detection • a URL pattern: <en, zh> (Kit and Ng, 2007) • www.legco.gov.hk/yr99-00/en/fc/esc/e0.htm • www.legco.gov.hk/yr99-00/zh/fc/esc/e0.htm • Improvement: • pairing up speed goes up from O(|U|2) to O(|U|) • U is the set of all URLs within a website • approach: inverted index for URLs • token-based pair -> char-based pair • weak pairs: <1e, 1c>, <2e, 2c>, ... • http://.../1e/i.html <-> http://.../1c/i.html • enchanced: <e,c> • supports multiple languages • better mining multilingual websites such as EU and UN

  5. Bilingual URL Pattern Detection • a URL pattern: <en, zh> (Kit and Ng, 2007) • www.legco.gov.hk/yr99-00/en/fc/esc/e0.htm • www.legco.gov.hk/yr99-00/zh/fc/esc/e0.htm • Improvement: • pairing up speed goes up from O(|U|2) to O(|U|) • U is the set of all URLs within a website • approach: inverted index for URLs • token-based pair -> char-based pair • weak pairs: <1e, 1c>, <2e, 2c>, ... • http://.../1e/i.html <-> http://.../1c/i.html • enchanced: <e,c> • supports multiple languages • better mining multilingual websites such as EU and UN

  6. Bilingual URL Pattern Detection • a URL pattern: <en, zh> (Kit and Ng, 2007) • www.legco.gov.hk/yr99-00/en/fc/esc/e0.htm • www.legco.gov.hk/yr99-00/zh/fc/esc/e0.htm • Improvement: • pairing up speed goes up from O(|U|2) to O(|U|) • U is the set of all URLs within a website • approach: inverted index for URLs • token-based pair -> char-based pair • weak pairs: <1e, 1c>, <2e, 2c>, ... • http://.../1e/i.html <-> http://.../1c/i.html • enchanced: <e,c> • supports multiple languages • better mining multilingual websites such as EU and UN

  7. Top 20 Keys

  8. Number of matched URLs (top 10)

  9. Keys in domain “gov.hk”(rescue local weak keys if they are globally strong)

  10. 3 ideas • Bilingual URL Pattern Detection • Deep Webpage Recovery • Incremental Bilingual Website Exploration

  11. Deep Webpage Recovery • deep webpage: pages that are not linked by any other static pages (not searchable) until created dynamically • mostly triggered by JavaScript or Flash actions • http://www.fehd.gov.hk/tc_chi/cagenda 20070904.htm • we have discovered patterns <tc_chi, english>, <tc_chi, en>, <tc_chi, eng>, ..., then try: • wget http://www.fehd.gov.hk/english/cagenda 20070904.htm • wget http://www.fehd.gov.hk/en/cagenda 20070904.htm • wget http://www.fehd.gov.hk/eng/cagenda 20070904.htm • ...

  12. various structures of websites with deep pages

  13. Deep Webpage Recovery • deep webpage: pages that are not linked by any other static pages (not searchable) until created dynamically • mostly triggered by JavaScript or Flash actions • http://www.fehd.gov.hk/tc_chi/cagenda 20070904.htm • we have discovered patterns <tc_chi, english>, <tc_chi, en>, <tc_chi, eng>, ..., then try: • wget http://www.fehd.gov.hk/english/cagenda 20070904.htm • wget http://www.fehd.gov.hk/en/cagenda 20070904.htm • wget http://www.fehd.gov.hk/eng/cagenda 20070904.htm • ...

  14. Deep Webpage Recovery • deep webpage: pages that are not linked by any other static pages (not searchable) until created dynamically • mostly triggered by JavaScript or Flash actions • http://www.fehd.gov.hk/tc_chi/cagenda 20070904.htm • we have discovered patterns <tc_chi, english>, <tc_chi, en>, <tc_chi, eng>, ..., then try: • wget http://www.fehd.gov.hk/english/cagenda 20070904.htm • wget http://www.fehd.gov.hk/en/cagenda 20070904.htm • wget http://www.fehd.gov.hk/eng/cagenda 20070904.htm • ...

  15. Deep Webpage Recovery • deep webpage: pages that are not linked by any other static pages (not searchable) until created dynamically • mostly triggered by JavaScript or Flash actions • http://www.fehd.gov.hk/tc_chi/cagenda 20070904.htm • we have discovered patterns <tc_chi, english>, <tc_chi, en>, <tc_chi, eng>, ..., then try: • wget http://www.fehd.gov.hk/english/cagenda 20070904.htm • wget http://www.fehd.gov.hk/en/cagenda 20070904.htm • wget http://www.fehd.gov.hk/eng/cagenda 20070904.htm • ...

  16. 3 ideas • Bilingual URL Pattern Detection • Deep Webpage Recovery • Incremental Bilingual Website Exploration

  17. Incremental Bilingual Website Exploration • Intuition: bilingual websites tend to link to other bilingual websites. • Measures: • Linkout(w): total number of outgoing links from website w • PageRank(w): (Brin and Page, 1998) • WeightedPageRank(w): weighted by "how bilingual" w is (the more bilingual URLs, the more "bilingual" w is)

  18. Incremental Bilingual Website Exploration • Intuition: bilingual websites tend to link to other bilingual websites. • Measures: • Linkout(w): total number of outgoing links from website w • PageRank(w): (Brin and Page, 1998) • WeightedPageRank(w): weighted by "how bilingual" w is (the more bilingual URLs, the more "bilingual" w is)

  19. Incremental Bilingual Website Exploration • Intuition: bilingual websites tend to link to other bilingual websites. • Measures: • Linkout(w): total number of outgoing links from website w • PageRank(w): (Brin and Page, 1998) • WeightedPageRank(w): weighted by "how bilingual" w is (the more bilingual URLs, the more "bilingual" w is)

  20. Incremental Bilingual Website Exploration • Intuition: bilingual websites tend to link to other bilingual websites. • Measures: • Linkout(w): total number of outgoing links from website w • PageRank(w): (Brin and Page, 1998) • WeightedPageRank(w): weighted by "how bilingual" w is (the more bilingual URLs, the more "bilingual" w is)

  21. Discovering related webistes from seed websites(select the top K most related websites) [Linkout, PageRank, WeightedPageRank]

  22. Evaluationon number of URL pairs found and precision total websites: 12,800

  23. Conclusion • Unsupervised bilingual pair detection (no heuristics) • http://mega.ctl.cityu.edu.hk/~czhang22/pupsniffer-eval/Data/Pattern_Credibility_LargeThan100.txt • A large collection of English-Chinese webpages

More Related