1 / 33

The Web as a Parallel Corpus

The Web as a Parallel Corpus. Philip Resnik, Noah A. Smith, Computational Linguistics, 29, 3, pp. 349 – 380, MIT Press,2004. University of Maryland, Johns Hopkins University. Abstract. STRAND system for mining parallel text on the World Wide Web reviewing the original algorithm and results

zyta
Download Presentation

The Web as a Parallel Corpus

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The Web as a Parallel Corpus Philip Resnik, Noah A. Smith, Computational Linguistics, 29, 3, pp. 349 – 380, MIT Press,2004. University of Maryland, Johns Hopkins University

  2. Abstract • STRAND system for mining parallel text on the World Wide Web • reviewing the original algorithm and results • presenting a set of significant enhancements • use of supervised learning based on structural features of documents to improve classification performance • new content-based measure of translational equivalence • adaptation of the system to take advantage of the Internet Archive for mining parallel text from the Web on a large scale • construction of a significant parallel corpus for a low-density language pair

  3. Introduction • Parallel corpora, bitexts; • for automatic lexical acquisition (Gale and Church 1991; Melamed 1997) • provide indispensable training data for statistical translation models (Brown et al. 1990; Melamed 2000; Och and Ney 2002) • provide the connection between vocabularies in cross-language information retrieval (Davis and Dunning 1995; Landauer and Littman 1990; see also Oard 1997)

  4. Recent works at UM&JHU • exploit parallel corpora in order to develop monolingual resources and tools, using a process of annotation, projection, and training • given a parallel corpus in English and a less resource-rich language • project English annotations across the parallel corpus to the second language • using word-level alignments as the bridge, and then use robust statistical techniques in learning from the resulting noisy annotations (Cabezas, Dorr, and Resnik 2001; Diab and Resnik 2002;Hwa et al. 2002; Lopez et al. 2002; Yarowsky, Ngai, and Wicentowski 2001; Yarowsky and Ngai 2001; Riloff, Schafer, and Yarowsky 2002).

  5. parallel corpora as a critical resource • not readily available in the necessary quantities • heavily on French-English • because the Canadian parliamentary proceedings (Hansards) in English and French were the only large bitext available • United Nations proceedings (LDC) • religious texts (Resnik, Olsen, and Diab 1999) • software manuals (Resnik and Melamed 1997; Menezes and Richardson 2001) • tend to be unbalanced, representing primarily governmental or newswire-style texts

  6. World Wide Web • People tend to see the Web as a reflection of their own way of viewing the world • a huge semantic network • an enormous historical archive • a grand social experiment • a great big body of text waiting to be mined • a huge fabric of linguistic data often interwoven with parallel threads

  7. STRAND • (Resnik 1998, 1999) • structural translation recognition acquiring natural data • identify pairs of Web pages that are mutual translations • Incorporating new work on content-based detection of translations (Smith 2001, 2002) • efficient exploitation of the Internet Archive

  8. Finding parallel text on the Web • Location of pages that might have parallel translations • Generation of candidate pairs that might be translations • Structural filtering out of nontranslation candidate pairs

  9. Locating Pages • AltaVista search engine’s advanced search to search for two types of Web pages: parents and siblings. • Ask AV with regular expressions • (anchor:"english" OR anchor:"anglais") • (anchor:"french" OR anchor:"fran¸cais"). • spider component • The results reported here do not make use of the spider.

  10. Generating Candidate Pairs • with URL • http://mysite.com/english/home en.html, on which one combination of substitutions might produce the URLhttp://mysite.com/big5/home ch.html. • Another possible criterion for matching is the use of document lengths. • text E in language 1 and text F in language 2, length(E) = C * length(F), where C is a constant tuned for the language pair

  11. Structural Filtering • linearize the HTML structure and ignore the actual linguistic content of the documents • align the linearized sequences using a standard dynamic programming technique (Hunt and McIlroy 1975)

  12. STRAND Results (1) • Recall in this setting is measured relative to the set of candidate pairs that was generated • Precision • “Was this pair of pages intended to provide the same content in the two different languages?” • Asking the question in this way leads to high rates of interjudge agreement, as measured using Cohen’s µ measure

  13. STRAND Results (2) • Using the manually set thresholds for dp and n,we have obtained 100% precision and 68.6% recall in an experiment using STRAND to find English-French Web pages (Resnik 1999) • to obtain English-Chinese pairs and in a similar formal evaluation, we found that the resulting set had 98% precision and 61% recall

  14. Assessing the STRAND Data • a translation lexicon automatically extracted from the French-English STRAND data could be combined productively with a bilingual French-English dictionary in order to improve retrieval results using a standard cross-language IR test collection (English queries against the CLEF-2000 French collection) • backing off from the dictionary to the STRAND translation lexicon accounted for over 8% of the lexicon matches (by token) • reducing the number of untranslatable terms by a third and producing a statistically significant 12% relative improvement in mean average precision as compared to using the dictionary alone

  15. 30 human-translated sentence pairs from the FBIS (Release 1) English-Chinese parallel corpus, sampled at random. • 30 Chinese sentences from the FBIS corpus, sampled at random, paired with their English machine translation output from AltaVista’s Babelfishhttp://babel.sh.altavista.com • 30 paired items from Chinese-English Web data, sampled at random from “sentence-like” aligned chunks as identified using the HTML-based chunk alignment process of STRAND

  16. Optimizing Parameters Using Machine Learning • Using the English-French data, we constructed a ninefold cross-validation experiment using decision tree induction to predict the class assigned by the human judges • The decision tree software was the widely used C5.0

  17. Content-Based Matching • Ma and Liberman (1999) point out, not all translators create translated pages that look like the original page • structure-based matching is applicable only in corpora that include markup • other applications for translation detection exist • a generic score of translational similarity that is based upon any word-to-word translation lexicon • Define a CL sim score, tsim

  18. Quantifying Translational Similarity 原來的方法只用Doc的structure 現在要用字字間的sim • 所以要有lexicon • (Melamed’s [2000] Method A) • link be a pair (x, y) • maximum-weighted bipartitie matching problem O(max(|X|,|Y|)3)

  19. Exploiting the Internet Archive • The Internet Archive 120TB (+8TB per month) http://www.archive.org/web/researcher/ is a nonprofit organization attempting to archive the entire publicly available Web, preserving the content and providing free access to researchers, historians, scholars, and the general public. • Wayback Machine • Temporal database, but not stored in temporal order • Need decompression • Almost data are stored in compressed plain-text files

  20. STRAND on the Archive • Extracting URLs from index files using simple pattern matching • Combining the results from step 1 into a single huge list • Grouping URLs into buckets by handles • Generating candidate pairs from buckets

  21. URLs similarity • We arrived at an algorithmically simple solution that avoids this problem but is still based on the idea of language-specific substrings (LSSs). • The idea is to identify a set of language-specific URL substrings that pertain to the two languages of interest (e.g., based on language names, countries, character codeset labels, abbreviations, etc.)

  22. Building an English-Arabic Corpus • Finding English-Arabic Candidate Pairs on the Internet Archive • 24 top-level national domains • Egypt (.eg), Saudi Arabia (.sa), Kuwait (.kw) … • 21 .com • emiratesbank.com, checkpoint.com …

  23. Future Work • Weight in the dictionary can be exploited • IR IDF scores for discerning • Bootstrapping • Seed to form high-precision initial classifiers

  24. Search Engine Estimate • According to (Dec. 31, 2002) www.searchengineshowdown.com • Google 3,033 millions • Altavista 1,689 millions 1:1.795estimates are based on exact counts obtained from AlltheWeb multiplied by the percentage of Relative Size Showdown as compared to the number found by AlltheWeb • Our estimate for Chinese page count • ASBC去掉 標點符號 ABC ㄅㄆㄇ 未辨識 等符號共使用5915個term[單字詞] (共出現7,844,439次)用這5919個term的中間2000名跟google和altavista的relative count 5021倍 2181倍google是altavista的2.3倍

More Related