1 / 34

Extracting Parallel Texts from Massive Web Documents

Extracting Parallel Texts from Massive Web Documents. Chikayama Taura lab. M2 Dai Saito. Construct Parallel Corpora from the Web. --it was the black kitten's fault entirely. One thing was certain, that the WHITE kitten had had nothing to do with it. ―― もうなにもかも、 黒い子ネコのせいだったのです。.

zytka
Download Presentation

Extracting Parallel Texts from Massive Web Documents

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Extracting Parallel Texts from Massive Web Documents Chikayama Taura lab. M2 Dai Saito

  2. Construct Parallel Corpora from the Web --it was the black kitten's fault entirely. One thing was certain, that the WHITE kitten had had nothing to do with it. ――もうなにもかも、 黒い子ネコのせいだったのです。 一つ確実なのは、 白い子ネコはなんの関係も なかったということ。 Purpose • Parallel corpus : a set of parallel texts • Parallel texts : translated pairs of texts 日本語 English

  3. Parallel Texts • Useful resource for • Statistical machine translation • Dictionary construction • But… existing corpora are not enough • Amount • Small • Large human resource • Genre • Public Documents • Software Manuals • Language • Limited • English-French

  4. Parallel Texts from the Web • Extracting Parallel Texts from Massive Web Documents • Very large amount of texts • Varied languages • Small human resource

  5. Problems • How to detect parallel texts automatically • How to reduce calculation cost • To construct parallel corpus • Extract candidate pairs • Judge whether they really are parallel texts Web

  6. Agenda • Introduction • Related work • Proposal • Detect parallel texts • Extract candidate pairs • Experiment • Conclusion

  7. STRAND [Resnik et. al. 03] • URL Matching • Remove language-specific substrings[LSSs](Japanese : ja, jp, jpn, euc, sjis,…) • Match LSSs-removed URLs • Make a detail comparison http://www.hostname.com/index.html.en http://www.hostname.com/index.html.en http://www.hostname.com/index.html.ja http://www.hostname.com/index.html.ja

  8. link link DOM Tree Alignment [Lei et. al. 06] • HTML→DOM Tree • Searching linked pages • “alt” tag • link name • Parallel link: a pair of the same hyperlinks in parallel texts “English version” “In English” etc…

  9. Agenda • Introduction • Related work • Proposal • Detect parallel texts • Extract candidate pairs • Experiment • Conclusion

  10. … … … Outline Web Crawler Extract candidate pairs Detect parallel texts

  11. Detecting parallel texts • Low comparison cost • without HTML Information • word (noun) • semantic ID • comparison [Fukushima et.al. 06]

  12. Semantic ID Conversion • Constructing a graph from dictionaries • Treating Japaneseand English texts in the same level • # of Semantic IDs:about 10,000 1 Sense 感覚 意味 Movie 2 映画 Film Hobby 趣味 3 Taste 味

  13. Texts to Vector テキスト 955 … 辞書 1704 … 数列 3173 辞書を使ってテキストを数列に変える。 1704 955 3173 sort (955, 1704, 3173) +position information

  14. Comparison • tscore (translation score) T1:(106, 335, 455, 567, 1704, 3173, 7421) T2:(335, 567, 567, 1704, 4014, 5449, 7421) score= 0 1 2 3 4 tscore = 4/(7+7)

  15. tscore threshold • Fry Corpus[05 Fry] 400 pair • F-measure • Speed 200,000 pairs/sec • tscore threshold 0.102

  16. Agenda • Introduction • Related work • Proposal • Detect parallel texts • Extract candidate pairs • Experiment • Conclusion

  17. Extract candidate pairs • Calculation cost of each comparison • Calculation cost of extracting parallel texts • A number of comparison: n^2 • URL matching is too strict • Japanese and English • 90,000,000URL → 4,000 URL pairs → 1,000 real pairs

  18. Calculation Cost Reduction Sample →Reducing the number of comparison • distance score : tscore • Compare only texts close to each other Distance of each parallel texts and a sample text should be equal English 日本語

  19. Calculation Cost Reduction • Flow • Select sample texts (<<n) • Calculate distance score with sample texts • Classify top m score • Compare only for texts in the same group

  20. Sampling • Number of sample • Calculation cost • Accuracy (low risk of miss labeling) • Methods to select sample • Random • k-means

  21. k-means k=2 • Select k samples • Classify all texts • Calculate centers • Re-classify

  22. Calculation of tscore in k-means Text1:(106, 335, 455, 567, 1704, 3173, 7421) Text2:(335, 567, 567, 1704, 4014, 5449, 7421) tscore = 4/(7+7) normal k-means Text1:(106, 335, 455, 567, 1704, 3173, 7421) Average1:((567, 0.2), (4014, 0.14), (7421, 0.5), …) tscore = (0.2+0.5)

  23. Converting HTML on the Web • Guess language • English, SJIS, EUC-JP, UTF-8 • Convert character code • Remove HTML Tag • Morphological Analysis→pickup noun

  24. Agenda • Introduction • Related work • Proposal • Detect parallel texts • Extract candidate pairs • Experiment • Conclusion

  25. Experiment • Calculation Cost • Accuracy v.s. Calculation time • Clustering • k-means

  26. Environment • Dataset:Fry Corpus [Fry 05] • Corpus of Japanese-English news pages • Convert HTML to Semantic ID in advance • Machine • CPU : Xeon 2.4GHz Dual • Memory : 2GB • OS : Linux (Debian)

  27. Calculation Cost • Fry Corpus • 200 - 6400 pairs Normal All-to-All Random sampling (Top3) • # of texts grows, gap becomes wider • Low cost with n^2 samples

  28. Accuracy v.s. Calculation time • Fry Corpus • 400 pairs • Random sampling • # of sample grows, • Miss classification ratio → high • Execution time → low • Trade off with Miss classification ratio and Execution time

  29. Sample selection with k-means • Accuracy and Execution time with k-means • Flow • Random sampling • number of samples : √n • Calculating the center and re-sampling • Measuring Miss-classification ratio and Execution time

  30. Evaluation of k-means • Low miss-classification ratio→High biased

  31. Agenda • Introduction • Related work • Proposal • Detect parallel texts • Extract candidate pairs • Experiment • Conclusion

  32. Conclusion and Future work • Parallel texts from the Web • Detecting parallel texts • Extracting candidate pairs • Random sampling • k-means

  33. Future work • Better clustering methods • Hierarchical • Dimension reduction • About 10,000 dimension is too high • Processing real HTML texts from the Web

  34. Thank you for your attention!

More Related