1 / 31

Large Scale Crawling the Web for Parallel Texts

Large Scale Crawling the Web for Parallel Texts. Chikayama Taura lab. M1 Dai Saito. One thing was certain, that the WHITE kitten had had nothing to do with it. --it was the black kitten's fault entirely. 一つ確実なのは、 白い子ネコはなんの関係も なかったということ。. ―― もうなにもかも、 黒い子ネコのせいだったのです。. Parallel Texts.

torn
Download Presentation

Large Scale Crawling the Web for Parallel Texts

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Large Scale Crawling the Web for Parallel Texts Chikayama Taura lab. M1 Dai Saito

  2. One thing was certain, that the WHITE kitten had had nothing to do with it. --it was the black kitten's fault entirely. 一つ確実なのは、 白い子ネコはなんの関係も なかったということ。 ――もうなにもかも、 黒い子ネコのせいだったのです。 Parallel Texts • Parallel texts : • Parallel corpus : a set of parallel texts Translated pair of multilingual texts 日本語 English

  3. Parallel Texts • Useful resource for • Statistical machine translation • Dictionary construction • But… existing corpora are small • Number • Not enough • Need human resource • Language • English-French • Genre • Public Document • Software Manual

  4. Parallel Texts from the Web • Crawling parallel texts from the Web • Very large number of texts exist • Varied languages are used • Low human resource Problems - How to detect parallel texts automatically - Calculation cost :

  5. Not parallel ① ② ② Not parallel Parallel Texts from the Web Maybe parallel Web Parallel Texts Not parallel Parallel texts

  6. Agenda • Introduction • Related work • Proposal • Detecting parallel texts • Large scale crawling • Experiment • Conclusion

  7. STRAND [Resnik et al. 03] • URL Matching • Removing language-specific substrings[LSSs](Japanese : ja, jp, jpn, euc, sjis,…) • Matching LSSs-removed URLs • Making a detailed comparison http://www.hostname.com/index.html.en http://www.hostname.com/index.html.en http://www.hostname.com/index.html.ja http://www.hostname.com/index.html.ja

  8. URL Matching Experiment • URL Matching for URLs of crawled pages • 90,000,000URLs • English⇔Japanese • Seeing only URL • 90,000,000 →4,000 • Too strict? • Useless pages are included japanese.php english.php index.html.ja index.html.en

  9. DOM Tree Alignment [Lei et al. 06] link • Searching linked pages • “alt” tag • link name • HTML→DOM Tree • Parallel link: a pair of the same hyperlinks in parallel texts “English version” “In English” etc… link

  10. Pros and Cons • URL Matching High speed and Easy to implement Small number of pages • DOM Tree High accuracy and Small storage Execution speed is slow ○ × ○ ×

  11. Agenda • Introduction • Related work • Proposal • Detecting parallel texts • Large scale crawling • Experiment • Conclusion

  12. Detecting Parallel Texts • [Fukushima 06] • Reducing comparison cost • without HTML Information • word(noun)→semantic ID→comparison

  13. Semantic ID Conversion • Constructing a graph from dictionaries • Treating Japaneseand English texts on same level • # of Semantic ID:about 10,000 1 Sense 感覚 意味 Movie 2 映画 Film Hobby 趣味 3 Taste 味

  14. Texts to Vector テキスト 955 … 辞書 1704 … 数列 3173 辞書を使ってテキストを数列に変える。 1704 955 3173 sort (955, 1704, 3173) +position information

  15. Comparison • tscore (translation score) T1:(106, 335, 455, 567, 1704, 3173, 7421) T2:(335, 567, 567, 1704, 4014, 5449, 7421) score= 0 1 2

  16. tscore threshold • Fry Corpus[05 Fry] • F-measure • tscore threshold 0.102 • Speed 250,000 pairs/sec

  17. Agenda • Introduction • Related work • Proposal • Detecting parallel texts • Large scale crawling • Experiment • Conclusion

  18. Large Scale Crawling • Calculation cost of each comparison • Calculation cost of entire crawling • Number of comparisons: • URL matching is too strict • Alt tag or link name are not applied for all parallel pages

  19. HTML on the Web to Natural Language • Guess language • English, SJIS, EUC-JP, UTF-8 • Convert character code • Remove HTML Tag • For crawling, <a> or <link> tag are used • <title>, <Hn> tag may be useful

  20. Calculation Cost Reduction • Distance score of vectors • Compare only near vectors • distance score : tscore • Set a label of the nearest sample text for all texts Distance score of two texts is far, then,they are not parallel texts.

  21. Calculation Cost Reduction • Flow • Select sample texts (<<n) • When crawling, calculate distance score with sample texts • Classify top m score • Compare only for texts in the same group

  22. Sampling • Number of sample • Accuracy (risk of miss labeling) • Calculation cost • Size of the group • should be equal • Large group are divided into small recursively

  23. Crawling link pages Same links from parallel texts will be parallel texts • Evaluation of same links • DOM Tree [Lei et al. 06] • Evaluate function • Position of <A> tag • Pages in same host • Diff of URLs • hoge.html.en -> fuga.html.en : hoge - fuga • hoge.html.ja -> fuga.html.ja : hoge – fuga

  24. Agenda • Introduction • Related work • Proposal • Detecting parallel texts • Large scale crawling • Experiment • Conclusion

  25. Evaluation of tscore • Fry Corpus [Fry 05] • 200(japanese) x 200(english) • Flow • Convert all texts to vector • Calculate distance score for all pairs(40000) • Check scores of real parallel texts are high • Score of parallel texts should be top

  26. Evaluation of tscore (1,1,1,2,4,4,…) • NOT XOR (3,1,0,2,…) • Other distance score • AND sparse (3,1,0,2,0) (3,1,0,2,0) 3 2 • EUCLID • COS (3,0,0,1,2) (3,0,0,1,2) • AND - XOR (3,1,0,2,0) 0 (3,0,0,1,2)

  27. Evaluation of tscore • Number of miss score ([200+200]texts)

  28. Calculation Time • Fry Corpus • 200, 400, 800,1600, 3200 • NORMALtscore(Top3) • # of samples : √(# of All) • Miss labeling : 11 (in 200 pairs)

  29. Agenda • Introduction • Related work • Proposal • Detecting parallel texts • Large scale crawling • Experiment • Conclusion

  30. Conclusion and Future work • Parallel texts from the Web • Detecting parallel texts • Large scale crawling • Future work • Crawling many texts from the Web • Crawling with parallel link structure • Detecting parallel in real HTML texts • Proper sampling

  31. Thank you for your attention!

More Related