the use of machine translation tools for cross lingual text mining n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
The use of machine translation tools for cross-lingual text-mining PowerPoint Presentation
Download Presentation
The use of machine translation tools for cross-lingual text-mining

Loading in 2 Seconds...

play fullscreen
1 / 12

The use of machine translation tools for cross-lingual text-mining - PowerPoint PPT Presentation


  • 167 Views
  • Uploaded on

The use of machine translation tools for cross-lingual text-mining . Blaz Fortuna Jozef Stefan Institute, Ljubljana John Shawe-Taylor Southampton University. Outline. Cross-lingual text mining Kernel CCA Machine translation Information retrieval experiment Classification experiment

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'The use of machine translation tools for cross-lingual text-mining' - mirella


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
the use of machine translation tools for cross lingual text mining

The use of machine translation tools for cross-lingual text-mining

Blaz Fortuna

Jozef Stefan Institute, Ljubljana

John Shawe-Taylor

Southampton University

outline
Outline
  • Cross-lingual text mining
  • Kernel CCA
  • Machine translation
  • Information retrieval experiment
  • Classification experiment
  • Conclusions
cross lingual text mining
Cross-lingual text mining

When applying text mining to a multilingual text corpora specific language issues appear:

Information retrieval: retrieved documents should depend only on the meaning of the query and not its language.

Classification: only one classifier should be learned and not a separate classifier for each language

Clustering: documents should be grouped into clusters based on their content, not on the language they are written in.

kcca kernel canonical correlation analysis

loss, income, company, quarter

verlust, einkommen, firma, viertel

wage, payment, negotiati-ons, union

zahlung, volle, gewerkschaft, verhand-lungsrunde

KCCA (Kernel Canonical Correlation Analysis)

KCCA learns a semantic representation of the text from a corpus of unlabeled paired documents.

  • On input we have set of paired documents (for each document we have a version in each language)
  • On output we get set of mappings from native language space into “language independent space” – subspace with semantic dimensions

[Vinokourov et. al, 2002]

KCCA

Semantic

dimensions

paired training set and machine translation
Paired training set and machine translation

KCCA needs paired dataset for training. When there is no paired dataset available we have two options:

  • We use human made dataset from some other domain.
    • This could be unreliable because of a big semantic and vocabulary gap.
  • We use machine translation tools to generate paired dataset.
    • In our experiments we used Google Language Tools for translating documents.
experiments
Experiments
  • We investigated how the quality of machine translation generated train set compares with a true human generated paired corpus.
  • Two major issues are addressed:

How much do we win or lose by using machine translation when a human generated corpus is available for

    • the target domain?
    • only for a different domain?
experiment 1 information retrieval
Experiment #1 – Information retrieval

We compared two paired corpora:

  • Hansard corpus: aligned pairs of text chunks from the official records of the 36th Canadian Parliament Proceedings. [Germann, 2001]
  • Artificial corpus: half of the English and half of the French translations from Hansard corpus were replaced by machine translation.

Queries were generated from each test document by extracting 5 words with the highest TFIDF weights and using them as a query.

The goal was to retrieve the paired document.

Experimental procedure (for each corpus):

(1) KCCA trained on 1500 paired documents,

(2) All 896 test documents (in both languages) projected into the KCCA semantic space,

(3) Each query was projected into the KCCA semantic space and documents were retrieved using nearest neighbour based on cosine distance to the query.

results
Results

For 65% of queries the correct document appeared on the first place.

For 95% of queries the correct document appeared among first 10 results.

There is no difference when

query and document are in

the same language

When query and document are from different languages, there is around 5-10% drop in retrieval accuracy

experiment 2 classification
Experiment #2 – Classification

Reuters multilingual corpora (English and French) was used as a dataset. [Reuters, 2004]

  • First paired train set, Hansard, was taken from previous experiment; different domain than news articles.
  • Second paired train set was generated from the Reuters dataset using machine translation (Google).

Experimental procedure (for each corpus):

(1) KCCA trained on 1500 paired documents,

(2) Whole Reuters corpus was projected into the KCCA semantic space,

(3) Linear SVM classifier was learned in KCCA semantic space on a subset of 3000 documents and tested on a subset of 50.000 (results are averaged over 5 random splits).

results1
Results

#KCCA dimensions: 800

FE … French training set, English testing set.

Artificial paired training set generates significantly better semantic space than train set taken from a different domain!

conclusions
Conclusions

We have shown that the machine translation can be used to generate training set for Kernel CCA which can give almost as good performance as a train set made by human translators.

When no hand made translations are available this can significantly decrease the cost of a multi-lingual text mining.

We would like also to thank Miha Grcar for making an automated interface to Google Language Tools!