The web as a parallel corpus
1 / 13

The Web as a Parallel Corpus - PowerPoint PPT Presentation

  • Updated On :

The Web as a Parallel Corpus. A paper by Philip Resnik and Noah A. Smith ( 2003, Computational Linguistics ) My interpretation of their research. Contents:. Introduction to parallel corpora

Related searches for The Web as a Parallel Corpus

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'The Web as a Parallel Corpus' - oki

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
The web as a parallel corpus l.jpg

The Web as a Parallel Corpus

A paper by Philip Resnik and Noah A. Smith

(2003, Computational Linguistics)

My interpretation of their research.

Contents l.jpg

  • Introduction to parallel corpora

  • The STRAND Web-mining architecture (estb.1999)

  • Content-Based Matching

  • Exploiting the Internet Archive

  • Conclusions and Further Work

Introduction to parallel corpora l.jpg
Introduction to parallel corpora

  • The Rosetta Stone dates back from around 190 BC. The three texts on the RS are of the same content in hieroglyphs, demotic and Greek. (3 different languages)

  • Canadian Hansard and Hong Kong Hansard are two other famous parallel corpora, especially because they are available electronically and are of high standards.

  • Motivation:Bitexts provide indispensable training data for statistical translation models.

  • The Web can be mined for suitable bilingual and multilingual texts.

Strand web mining architecture 1 l.jpg
STRAND: Web-Mining Architecture(1)

  • Structural Translation Recognition Acquired Natural Data (STRAND) is the authors’ software to search for pairs of Web pages that are translations of each other.

  • Using more parallel texts is always to the advantage of machine translation research and implementation.

  • How STRAND works?

  • 1)Location of pages that might have parallel translations: Looking for “parent pages” and “sibling pages”. The web page writer most probably has a language link such as “Chinese”or “Arabic” imbedded in the page.

  • 2)Generation of candidate pairs that might be translations: Seeing if the pairs have the same HTML structure.

  • 3)Structural filtering out the non-translation candidate pairs: Searching the content of the pairs.

Strand web mining architecture 2 l.jpg
STRAND: Web-Mining Architecture(2)

  • 1) Locating pairs: Candidate pairs are typically from one Web-site. STRAND looks for ‘sibling’ pairs. These pages are often linked to each other by links which offers the user: Francais, espanol, or other options.

  • 2)Generating pairs:For many web sites The URLs are compared: …………...

  • 3)Structural filtering: First look at the HTML structure. Web-page writers often use the same or very similar template. Next, we use a markup analyzer using three(3) tokens to produce a linear reproduction of each of the two “candidate web-pages”………………..over

Strand web mining architecture 3 l.jpg
STRAND: Web-Mining Architecture(3)

  • Candidate pairs:

  • <HTML> <HTML>

  • <TITLE>City Hall</TITLE> <TITLE>Hotel de Ville</TITLE>

  • <BODY> <BODY>

  • <H1>Regional Government<H1> Les affaires……….

  • The business………

  • Candidate pairs: Now formed into 2 linear alignments



  • [Chunk: 8] [Chunk: 12]



  • [Chunk:18] [Chunk: 138]

  • ……………….over

Using these 2 linear alignments l.jpg
Using these 2 linear alignments

  • We use four scaler values to characterize the quality of the alignment:

  • dp(difference percentage)= mismatches of alignments (that is, tokens that don’t match)

  • n= number of aligned non-markup text chunks.

  • r= correlation of lengths of the aligned non-markup chunks

  • p= level of significance of the correlation r.

  • Next the analysts can manually set the thresholds of these parameters and check the results. 100% precision and 68.6% recall has been obtained using STRAND to find English-French Web pages.

Optimizing parameters using machine learning l.jpg
Optimizing Parameters Using Machine Learning

  • A ninefold cross-validation experiment using decision tree induction was used to predict the class assigned by the human judges. The learned classifiers were substantially different from the manually-set (heuristic) thresholds.

  • Manually-set: 31% of good document pairs were discarded

  • ML-set: 16% of good pairs are discarded.(4%false positive)

  • Other Related Work

  • Some analysts use Parallel Text Miner (PT Miner) using already existing search engines to locate pages that are likely to be in the other language of interest.

Slide9 l.jpg

  • Other Related Work /Other Linguistic Researchers

  • Some analysts use Parallel Text Miner (PT Miner) using already existing search engines to locate pages that are likely to be in the other language of interest. Then a final filtering stage is undertaken to clean the corpus.

  • Bilingual Internet Text Search(BITS) is used by other researchers and utilizes different matching techniques.

  • STRAND, PTMiner, and BITS are all largely independent of linguistic knowledge about particular languages, and therefore very easily ported to new language pairs.

  • Reskin has looked into:English-Arabic, English-Chinese(big5), and English-Basque.

Mining the web l.jpg
Mining the Web

  • Researchers can and do mine the internet every day. An American physicist (Barabasi) has had his team look at the size, shape and structure of the internet as well as hit-frequencies of numerous Web pages.

  • Spiders or crawlers are used in research.

  • The Internet Archive( ) is also instrumental in obtaining useful information.

The internet archive l.jpg
The internet Archive

  • The internet archive is a nonprofit organization attempting to archive the entirely publicly available Web, preserving the content and providing free access to researchers, historians, scholars, and the general public.

  • (120terabytes of information in 2002)

  • Over 10 billion Web pages.

  • Properties of the Archive:

  • 1)The archive is a temporal database, but it is not stored in temporal order.

  • 2)Extracting a document is an expensive operation.(text extraction.)

  • 3)Computational complexity must be keep low for mining this database.

  • 4)Data relevant for linguistic purposes are clearly available.

  • 5)A suite of tools exist for linguistic processing of the archive.

Building an english arabic corpus l.jpg
Building an English-Arabic Corpus

  • Step 1:Search for English-Arabic pairs. Look at 24 top-level national domains for countries where Arabis is spoken: Egypt(.eg), Saudi Arabia(.sa), Kuwait(.kw). Also other .com domains believed to be useful to Arabic-speaking people.

  • Step 2:Resnik et al. mined two crawls of the internet archive comprising 8TB and 12TB. Relevant domains numbered 19,917,923 pages.

  • Step 3: Only 8,294 pairs of English-Arabic bitexts were found.


Conclusions and further work l.jpg
Conclusions and Further Work

  • Initial web searches for parallel texts were undertaken in 1998. Resnick’s report is from 2002. The author laments the lack of different languages available on the internet as well as the lack of data made available by some countries.

  • The growth of both the internet and the internet archive will considerably add to the expansion of parallel corpora.

  • Chen and Nie(2000), for example have found around 15,000 English-Chinese document pairs.

  • One of the early STRAND projects for English-Chinese parallel texts found over 70,000 pairs.

  • Because STRAND expects pages to be very similar in structure terms, the resulting document collections are particularly amenable to sentence- or segment-level alignment.