230 likes | 327 Views
BUCC2013, Sofia, Bulgaria. Finding More Bilingual Webpages with High Credibility via Link Analysis. Chengzhi Zhang , Nanjing University of Science and Technology Xuchen Yao , Johns Hopkins University Chunyu Kit , City University of Hong Kong 8 August 2013. 3 ideas.
E N D
BUCC2013, Sofia, Bulgaria Finding More Bilingual Webpages with High Credibility via Link Analysis Chengzhi Zhang , Nanjing University of Science and Technology Xuchen Yao , Johns Hopkins University Chunyu Kit , City University of Hong Kong 8 August 2013
3 ideas • Bilingual URL Pattern Detection • Deep Webpage Recovery • Incremental Bilingual Website Exploration
Bilingual URL Pattern Detection • a URL pattern: <en, zh> (Kit and Ng, 2007) • www.legco.gov.hk/yr99-00/en/fc/esc/e0.htm • www.legco.gov.hk/yr99-00/zh/fc/esc/e0.htm • Improvement: • pairing up speed goes up from O(|U|2) to O(|U|) • U is the set of all URLs within a website • approach: inverted index for URLs • token-based pair -> char-based pair • weak pairs: <1e, 1c>, <2e, 2c>, ... • http://.../1e/i.html <-> http://.../1c/i.html • enchanced: <e,c> • supports multiple languages • better mining multilingual websites such as EU and UN
Bilingual URL Pattern Detection • a URL pattern: <en, zh> (Kit and Ng, 2007) • www.legco.gov.hk/yr99-00/en/fc/esc/e0.htm • www.legco.gov.hk/yr99-00/zh/fc/esc/e0.htm • Improvement: • pairing up speed goes up from O(|U|2) to O(|U|) • U is the set of all URLs within a website • approach: inverted index for URLs • token-based pair -> char-based pair • weak pairs: <1e, 1c>, <2e, 2c>, ... • http://.../1e/i.html <-> http://.../1c/i.html • enchanced: <e,c> • supports multiple languages • better mining multilingual websites such as EU and UN
Bilingual URL Pattern Detection • a URL pattern: <en, zh> (Kit and Ng, 2007) • www.legco.gov.hk/yr99-00/en/fc/esc/e0.htm • www.legco.gov.hk/yr99-00/zh/fc/esc/e0.htm • Improvement: • pairing up speed goes up from O(|U|2) to O(|U|) • U is the set of all URLs within a website • approach: inverted index for URLs • token-based pair -> char-based pair • weak pairs: <1e, 1c>, <2e, 2c>, ... • http://.../1e/i.html <-> http://.../1c/i.html • enchanced: <e,c> • supports multiple languages • better mining multilingual websites such as EU and UN
Bilingual URL Pattern Detection • a URL pattern: <en, zh> (Kit and Ng, 2007) • www.legco.gov.hk/yr99-00/en/fc/esc/e0.htm • www.legco.gov.hk/yr99-00/zh/fc/esc/e0.htm • Improvement: • pairing up speed goes up from O(|U|2) to O(|U|) • U is the set of all URLs within a website • approach: inverted index for URLs • token-based pair -> char-based pair • weak pairs: <1e, 1c>, <2e, 2c>, ... • http://.../1e/i.html <-> http://.../1c/i.html • enchanced: <e,c> • supports multiple languages • better mining multilingual websites such as EU and UN
Keys in domain “gov.hk”(rescue local weak keys if they are globally strong)
3 ideas • Bilingual URL Pattern Detection • Deep Webpage Recovery • Incremental Bilingual Website Exploration
Deep Webpage Recovery • deep webpage: pages that are not linked by any other static pages (not searchable) until created dynamically • mostly triggered by JavaScript or Flash actions • http://www.fehd.gov.hk/tc_chi/cagenda 20070904.htm • we have discovered patterns <tc_chi, english>, <tc_chi, en>, <tc_chi, eng>, ..., then try: • wget http://www.fehd.gov.hk/english/cagenda 20070904.htm • wget http://www.fehd.gov.hk/en/cagenda 20070904.htm • wget http://www.fehd.gov.hk/eng/cagenda 20070904.htm • ...
Deep Webpage Recovery • deep webpage: pages that are not linked by any other static pages (not searchable) until created dynamically • mostly triggered by JavaScript or Flash actions • http://www.fehd.gov.hk/tc_chi/cagenda 20070904.htm • we have discovered patterns <tc_chi, english>, <tc_chi, en>, <tc_chi, eng>, ..., then try: • wget http://www.fehd.gov.hk/english/cagenda 20070904.htm • wget http://www.fehd.gov.hk/en/cagenda 20070904.htm • wget http://www.fehd.gov.hk/eng/cagenda 20070904.htm • ...
Deep Webpage Recovery • deep webpage: pages that are not linked by any other static pages (not searchable) until created dynamically • mostly triggered by JavaScript or Flash actions • http://www.fehd.gov.hk/tc_chi/cagenda 20070904.htm • we have discovered patterns <tc_chi, english>, <tc_chi, en>, <tc_chi, eng>, ..., then try: • wget http://www.fehd.gov.hk/english/cagenda 20070904.htm • wget http://www.fehd.gov.hk/en/cagenda 20070904.htm • wget http://www.fehd.gov.hk/eng/cagenda 20070904.htm • ...
Deep Webpage Recovery • deep webpage: pages that are not linked by any other static pages (not searchable) until created dynamically • mostly triggered by JavaScript or Flash actions • http://www.fehd.gov.hk/tc_chi/cagenda 20070904.htm • we have discovered patterns <tc_chi, english>, <tc_chi, en>, <tc_chi, eng>, ..., then try: • wget http://www.fehd.gov.hk/english/cagenda 20070904.htm • wget http://www.fehd.gov.hk/en/cagenda 20070904.htm • wget http://www.fehd.gov.hk/eng/cagenda 20070904.htm • ...
3 ideas • Bilingual URL Pattern Detection • Deep Webpage Recovery • Incremental Bilingual Website Exploration
Incremental Bilingual Website Exploration • Intuition: bilingual websites tend to link to other bilingual websites. • Measures: • Linkout(w): total number of outgoing links from website w • PageRank(w): (Brin and Page, 1998) • WeightedPageRank(w): weighted by "how bilingual" w is (the more bilingual URLs, the more "bilingual" w is)
Incremental Bilingual Website Exploration • Intuition: bilingual websites tend to link to other bilingual websites. • Measures: • Linkout(w): total number of outgoing links from website w • PageRank(w): (Brin and Page, 1998) • WeightedPageRank(w): weighted by "how bilingual" w is (the more bilingual URLs, the more "bilingual" w is)
Incremental Bilingual Website Exploration • Intuition: bilingual websites tend to link to other bilingual websites. • Measures: • Linkout(w): total number of outgoing links from website w • PageRank(w): (Brin and Page, 1998) • WeightedPageRank(w): weighted by "how bilingual" w is (the more bilingual URLs, the more "bilingual" w is)
Incremental Bilingual Website Exploration • Intuition: bilingual websites tend to link to other bilingual websites. • Measures: • Linkout(w): total number of outgoing links from website w • PageRank(w): (Brin and Page, 1998) • WeightedPageRank(w): weighted by "how bilingual" w is (the more bilingual URLs, the more "bilingual" w is)
Discovering related webistes from seed websites(select the top K most related websites) [Linkout, PageRank, WeightedPageRank]
Evaluationon number of URL pairs found and precision total websites: 12,800
Conclusion • Unsupervised bilingual pair detection (no heuristics) • http://mega.ctl.cityu.edu.hk/~czhang22/pupsniffer-eval/Data/Pattern_Credibility_LargeThan100.txt • A large collection of English-Chinese webpages