Lecture 10: Term Translation Extraction & Cross-Language Information Retrieval
This presentation is the property of its rightful owner.
Sponsored Links
1 / 65

Lecture 10: Term Translation Extraction & Cross-Language Information Retrieval PowerPoint PPT Presentation


  • 81 Views
  • Uploaded on
  • Presentation posted in: General

Lecture 10: Term Translation Extraction & Cross-Language Information Retrieval. Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering, National Cheng Kung University 2004/11/24. References:

Download Presentation

Lecture 10: Term Translation Extraction & Cross-Language Information Retrieval

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Lecture 10 term translation extraction cross language information retrieval

Lecture 10: Term Translation Extraction & Cross-Language Information Retrieval

Wen-Hsiang Lu (盧文祥)

Department of Computer Science and Information Engineering,

National Cheng Kung University

2004/11/24

  • References:

    • Wen-Hsiang Lu (Advisors: Lee-Feng Chien and Hsi-Jian Lee.) (2003) Term Translation Extraction Using Web Mining Techniques, PhD thesis, Department of Computer Science and Information Engineering, National Chiao Tung University.


Outline

Outline

  • Background & Research Problems

  • Anchor Text Mining for Term Translation Extraction

  • Transitive Translation for Multilingual Translation

  • Web Mining for Cross-Language Information Retrieval and Web Search Applications


Part i background research problems

Part I Background & Research Problems


Motivation

Motivation

  • Demands on multilingual translation lexicons

    • Machine translation (MT)

    • Cross-language information retrieval (CLIR)

    • Information exchange in electronic commerce (EC)

  • Web mining

    • Explore multilingual and wide-scoped hypertext resources on the Web


Research problems

Research Problems

  • Difficulties in automatic construction of multilingual translation lexicons

    • Techniques: Parallel/comparable corpora

    • Bottlenecks: Lacking diverse/multilingual resources

  • Difficulties in query translation for cross-language information retrieval (CLIR) [Fig1]

    • Techniques: Bilingual dictionary/machine translation/ parallel corpora

    • Bottlenecks: Multiple-senses/short/diverse/unknown query [Fig2]


Cross language information retrieval

Cross-Language Information Retrieval

  • Query in source language and retrieve relevant documents in target languages

Query Translation

Target

Translation

Information Retrieval

Source

Query

Hussein

Target Documents

海珊/侯賽因/哈珊/胡笙(TC)

侯赛因/海珊/哈珊(SC)


Difficulties in query translation using machine translation systems

Difficulties in Query Translation using Machine Translation Systems

Chinese translation:全國宮殿博物館

English source query :National Palace Museum


Research paradigm

Research Paradigm

New approach

Live Translation

Lexicon

Web Mining

Anchor-Text

Mining

Term-Translation

Extraction

Applications

Internet

Search-Result

Mining

Cross-Language

Information Retrieval

Cross-Language

Web Search


Multilingual anchor texts hyperlink structure

Multilingual Anchor Texts & Hyperlink Structure


Language mixed texts in search result pages

Language-Mixed Texts in Search Result Pages


Research results

Research Results

  • Anchor text mining for term translation extraction

    • ACM SIGIR’01(poster), IEEE ICDM’01, ACM Trans. on Asian Language Information Processing 2002

    • Reviewers’ encouraging comments

      • “… the approach seems to be quite novel. To my knowledge, there has not been a proposal of uses of anchor texts like this one.”

  • Transitive translation for multilingual translation

    • COLING’02, ACM Trans. on Information Systems (first paper from Taiwan since 1986), ACL’04

    • Reviewers’ encouraging comments

      • “This is a nicely written, technically sound paper that pursues a clever and original idea …”

      • “… the idea of using anchor texts from the Web to learn cross-lingual information retrieval algorithms is very good …”

      • “I enjoyed the paper and thought the underlying work was interesting and valuable …”


Research results cont

Research Results (cont.)

  • Web mining for cross-language Web search

    • ROCLING’03, ACM SIGIR’04

    • Improve precision rate from 0.207 (dictionary-based) to 0.241 on NTCIR-2 Chinese-English CLIR evaluation task

    • Reviewers’ encouraging comments

      • “It gives us insight into the value of the Web as a dynamic information source. Although the experiments are restricted to Chinese-English documents, also developers for other languages may find this work stimulating.”

      • “The idea is interesting, and is relatively new. It may give inspiration to other researchers working in the same area.”

  • LiveTrans: Experimental CLWS system [LiveTrans]


Lecture 10 term translation extraction cross language information retrieval

LiveTrans: Cross-Language Web Search System

  • http://livetrans.iis.sinica.edu.tw/lt.html[LiveTrans]

    • Mirror: http://wmmks.csie.ncku.edu.tw/lt.html[LiveTrans]

  • System functions

    • Query-translation suggestion

    • Retrieval of Web pages and images.

    • Multilingual search: English, traditional Chinese, simplified Chinese, Japanese or Korean

    • Gloss translation for retrieved page titles

    • Fusion of retrieval results


Lecture 10 term translation extraction cross language information retrieval

Research Results (cont.)

  • Summary of contributions

    • Present an innovative approach

      • Significantly reduce the difficulty of unknown-term translation.

      • CLIR can be improved especially for short queries.

    • Develop a practical cross-language Web search engine

      • Without relying on translation dictionary

      • A live dictionary with a significant number of multilingual term translations obtained.

    • Present a new problem for further investigation in Web Mining


Related research

Related Research

  • Automatic extraction of multilingual translations

    • Statistical translation model (Brown 1993)

    • Parallel corpus (Melamed 2000; Wu & Chang 2003)

    • Non-parallel/comparable corpus (Fung 1998; Rapp 1999)

    • Web mining

      • Parallel corpus collection (Nie 1999; Resnik 1999)

      • Comparable corpus collection: Anchor texts and search-result pages (Lu et al. 2002, 2003)

      • Strength: Huge amounts of Web data with link structure


Related research cont

Related Research (cont.)

  • Query translation for cross-language information retrieval

    • Dictionary-/MT-based approach (Ballesteros & Croft 1997; Hull & Grefenstette 1996)

    • Corpus-based approach (Dumais 1997; Nie 1999)

    • Combined approach (Chen & Bian 1999; Kwok 2001)

    • Improving techniques

      • Query expansion and phrase translation (Ballesteros & Croft 1997)

      • Translation disambiguation (Ballesteros & Croft 1998; Chen & Bian 1999)

      • Proper name transliteration (Chen et al. 1998; Lin & Chang 2003)

      • Probabilistic retrieval/language models (Hiemstra & de Jong 1999; Lavrenko 2002)

      • Unknown query translation (Lu et al. 2002, 2003)


Related research cont1

Related Research (cont.)

  • Cross-language Web search (CLWS)

    • Practical CLWS services have not lived up to expectations

      • Keizai (Ogden et al. 1999): English query/Japanese, Korean Web news

      • MTIR (Bian & Chen 1999): Chinese query/English pages/translation

      • MuST: Multilingual Summarization and Translation (Hovy & Lin 1998)

        • English/Indonesian/Spanish/Arabic/Japanese, Web news summarization or translation

      • TITAN (Hayashi et al.1997): English-Japanese retrieval/translated pages titles

  • Challenges

    • Web queries are often

      • Short: 2-3 words (Silverstein et al. 1998)

      • Diverse: wide-scoped topic

      • Unknown (out of vocabulary): 74% is unavailable in CEDICT Chinese-English electronic dictionary containing 23,948 entries.

    • E.g.

      • Proper name: 愛因斯坦 (Einstein), 海珊 (Hussein)

      • New terminology: 嚴重急性呼吸道症候群 (SARS), 院內感染 (Nosocomial infections)


Part ii anchor text mining for term translation extraction

Part IIAnchor Text Mining for Term Translation Extraction


Anchor text set

Anchor-Text Set

  • Anchor text (link text)

    • The descriptive text of a linkon a Web page

  • Anchor-text set

    • A set of anchor texts pointing to the same page (URL)

    • Multilingual translations

      • Yahoo/雅虎/야후

      • America/美国/アメリカ

  • Anchor-text-set corpus

    • A collection of anchor-text sets

야후-USA

Korea

Yahoo Search Engine

Yahoo! America

http://www.yahoo.com

  • アメリカのYahoo!

美国雅虎

雅虎搜尋引擎

Japan

Taiwan

China


Processing of term translation extraction

Processing of Term Translation Extraction

Term Translation

Extraction

Source Query Term

Target Translation

Compute similarity using probabilistic inference model.

Collect Web pages and build up anchor-text-set corpus.

Anchor-Text-Set

Corpus

Translation

Lexicon

Term

Extraction

Anchor-Text

Extraction

Term Similarity

Estimation

Web

Pages

Web

Spider

Internet

Extract key terms as translation candidate.


Lecture 10 term translation extraction cross language information retrieval

Example for Term Translation Extraction

s: Source Query Term

t: Target Translations

Term Translation

Extraction

Yahoo

雅虎

- in USA

Yahoo

www.yahoo.com

(#in-link= 187)

...

Set u1

搜尋引擎

雅虎

.......

Co-occurrence

Yahoo

Taiwan -

www.yahoo.com.tw

(#in-link= 21)

Set u2

...

台灣

雅虎

Chinese-English Anchor-Text-Set Corpus

Page Authority


Probabilistic inference model

Probabilistic Inference Model

Conventional translation model

  • Asymmetric translation models:

  • Symmetric model with link information:

Co-occurrence

Page authority


Experimental environment

Experimental Environment

  • Anchor-text-set corpora

    • 109,416 traditional-Chinese-English sets (from 1,980,816 pages)

    • 157,786 simplified-Chinese-English sets (from 2,179,171 pages)

  • Test query set

    • Query logs:

      • Dreamer log: 228,566 unique query terms

      • GAIS log: 114,182 unique query terms

    • Core terms: 9,709 most popular query terms, frequencies >10 in two logs

    • Test set: 622 English terms selected from core terms

  • Average top-n inclusion rate (ATIR)


Performance with different estimation models

Performance with Different Estimation Models

  • Using different models

    • MA: Asymmetric model

    • MAL: Asymmetric model with link information

    • MS: Symmetric model

    • MSL: Symmetric model with link information

  • The symmetric inference model with link information was useful to improve the translation accuracy.


Performance with different term extraction methods and query log set sizes

Performance with Different Term Extraction Methods and Query-Log-Set Sizes

  • The query-log-based method achieved better performance.

  • The medium-sized query-log set achieved the best performance


Performance comparison

Performance Comparison

  • Example: Test term "sakura“

    • Query-log set (9,709 terms)

      • Top 5 extracted translations:台灣櫻花, 櫻花, 蜘蛛網, 純愛, 螢幕保護

    • Query-log set (228,566 terms)

      • Top 10 extracted translations:庫洛魔法使, 櫻花建設, 模仿, 櫻花大戰, 美夕, 台灣櫻花, 櫻花, 蜘蛛網, 純愛, 螢幕保護

  • Test results of 9,709 core terms [TTE9709]

  • Promising results


Part iii transitive translation for multilingual translation

Part IIITransitive Translation for Multilingual Translation


Transitive translation for multilingual translation

Transitive Translation for Multilingual Translation

  • Problem

    • Insufficient anchor-text-set corpus for certain language pairs

    • E.g., Chinese-Japanese, Chinese-French, etc.

  • Goal

    • A generalized model for multilingual translation

  • Idea

    • Transitive translation model: Extract translations via intermediate (third) language, e.g., English (Borin 2000; Gollins & Sanderson 2001)

    • To reduce interference errors, integrates a competitive linking algorithm.


Transitive translation combining direct and indirect translation

Transitive Translation: Combining Direct and Indirect Translation

  • Direct Translation Model

  • Indirect Translation Model

  • Transitive Translation Model

Direct Translation

t

s

新力(Traditional Chinese)

ソニー

(Japanese)

Indirect

Translation

m

Sony (English)

s : source term

t : target translation

m: intermediate translation


Lecture 10 term translation extraction cross language information retrieval

Promising Results for Automatic Construction

of Multilingual Translation Lexicons


Lecture 10 term translation extraction cross language information retrieval

Indirect Association Problem

  • Indirect association error (Melamed 2000)

    • t1co-occurs often with s than t

    • E.g., 思科  system (translation error)

0.11

s

思科

system

t1

0.07

Cisco

t


Lecture 10 term translation extraction cross language information retrieval

Competitive Linking Algorithm

  • Concepts of competitive linking (CL) algorithm (Melamed 2000)

    • Determine the most possible translation pairs between source and target sets.

    • Assumption: each term has only one translation.

    • Method:

      • Greedily select the most possible edges.

      • Select less possible edges when no conflicting with previous selections.

  • Integration of anchor-text-mining and CL Algorithm

    • Build a bipartite graph using our proposed translation model.

    • Use the extended CL algorithm to filter out indirect association errors.


Lecture 10 term translation extraction cross language information retrieval

Bipartite Graph Construction

S

Step 1

Step 2

T

s

思科

system

t1

s

思科

system

t1

Cisco

Cisco

t2

t2

系統

St1

資訊

網路

St2

電腦

Bipartite graph G = (S∪T, E)


Lecture 10 term translation extraction cross language information retrieval

Extended Competitive Linking Algorithm

  • Pick up k most possible translations for a source term

Step 2

Step 1

s

思科

0.l1

system

t1

s

思科

system

t1

0.07

0.23

Cisco

Cisco

t2

t2

系統

系統

0.01

St1

St1

資訊

資訊

0.03

網路

0.004

網路

St2

St2

電腦

電腦


Lecture 10 term translation extraction cross language information retrieval

Construct bipartite graph G = (S∪T, E)

Direct_Translation_with_CL (s, U, Vt)

Input: source term s

Web pages of concern U

translation vocabulary set Vt

Output: target translation set R

Compute

edge weight wij

Sort wijChoose edge ei*j* with highest weight

Y

N

si* = s ?

R = R ∪{tj*}

Y

|R| = k ?

Remove all edges linking to si* or tj*

Re-estimate wij for remaining edges

N

Remove all edges linking to tj*

Re-estimate wij for remaining edges

N

|E| = 0 ?

Y

Return R


Lecture 10 term translation extraction cross language information retrieval

Performance of Proposed Models with CL Algorithm

  • Test query set: 258 terms (from 9,709 core terms)

  • Anchor-text-set corpora Traditional Chinese-Simplified Chinese : 4,516 sets Traditional Chinese-English: 109,416 sets Simplified Chinese-English: 157,786 sets

  • Source/Target/Intermediate languages: Traditional Chinese/Simplified Chinese/English


Lecture 10 term translation extraction cross language information retrieval

Effective Translation Using CL Algorithm


Part iv web mining for cross language information retrieval and web search applications

Part IVWeb Mining for Cross-Language Information Retrieval and Web Search Applications


Lecture 10 term translation extraction cross language information retrieval

Web Mining for Cross-Language Information

Retrieval and Web Search Applications

  • Goal: Web mining to benefit CLIR and CLWS

    • Mining query translations from the Web

  • Idea: Integrated Web mining approach

    • Anchor-text-mining approach

      • Probabilistic inference model

      • Transitive translation model

    • Search-result-mining approach

      • Chi-square test

      • Context-vector analysis


Search result mining approach

Search-Result-Mining Approach

  • Goal: Enhance translation coverage for diverse queries

  • Idea

    • Comparable corpus: Language-mixed texts in search-result pages

    • Utilize co-occurrence relation and context information

      • Chi-square test

      • Context-vector analysis

  • Procedure of query translation based on search-result-mining


Chi square test

Chi-Square Test

  • Idea

    • Makes good use of all relations of co-occurrence between the source and target terms.

  • Similarity measure (Gale & Church 1991)

  • 2-way contingency table

a: # of pages containing both terms s and t

b: # of pages containing term s but not t

c: # of pages containing term t but not s

d: # of pages containing neither term s nor t

N: the total number of pages, i.e., N= a+b+c+d


Context vector analysis

Context-Vector Analysis

  • Idea

    • Take co-occurring context terms as feature vectors of the source/target terms.

  • Similarity measure

  • Weighting scheme: TF*IDF

s: ws1, ws2, …, wsm

t: wt1, wt2, …, wtm


Translation selection based on chi square test and context vector analysis

Translation Selection based onChi-Square Test and Context-Vector Analysis

  • For each candidate

    • Chi-square test

      • Retrieve page frequencies by submitting the Boolean queries ‘s∩t’, ‘~s∩t’, and ‘s∩~t’ to search engines

      • Compute the similarity Sχ2(s, t)

    • Context-vector analysis

      • Retrieve the top m search results by submitting t to search engines, and generate its feature vector

      • Compute the similarity SCV(s, t)


Integrated web mining approach

Integrated Web Mining Approach

  • Idea: Take both complementary advantages

    • Anchor-text-mining: good precision rate

    • Search-result-mining: good coverage rate

  • Combined similarity measure

    m: an assigned weight for each similarity measure SmRm(s, t): the similarity ranking between s and t using Sm


Lecture 10 term translation extraction cross language information retrieval

Test Bed

  • Test query set

    • 430 popular Chinese/English query terms

      • Filter out terms without translations (from 9,709 core terms)

      • OOV: 64% (274/430) are out of vocabulary

    • 200 random Chinese query terms

      • Randomly select from top 19,124 terms in Dreamer log

      • OOV: 82.5% (165/200)

    • 50 scientist names (proper names)

      • Randomly select from 256 scientists (Science/People in the Yahoo! Directory)

      • OOV: 76% (38/50)

    • 50 disease names (technical terms)

      • Randomly select from 664 diseases (Health/Diseases and Conditions in the Yahoo! Directory)

      • OOV: 72% (36/50)


Lecture 10 term translation extraction cross language information retrieval

Examples of Proper Name and Technical Term


Lecture 10 term translation extraction cross language information retrieval

Performance of Web Mining for Popular Queries


Lecture 10 term translation extraction cross language information retrieval

Performance of Web Mining for

Random Queries/Proper Names/Technical Terms


Lecture 10 term translation extraction cross language information retrieval

CLIR on NTCIR-2 Evaluation Task

  • The test collection (Chen & Chen 2001)

    • 132,173 Chinese news documents (200MB)

    • 50 English query topics

  • Title-query (title section only)

    • Short: Average 3.8 English words

    • Low performance: 55% of monolingual performance (Kwok 2001)

    • Difficulty: CLIR may fail if anyone key word in short queries can not be translated correctly.

  • Can Web mining solve short query translation?


Lecture 10 term translation extraction cross language information retrieval

Integration of Web Mining and Probabilistic Retrieval Model

  • Probabilistic retrieval model (Xu 2001; Hiemstra & de Jong 1999)

    • The Web mining approach:P(e | c) = Pweb(e | c) ≈ SCombined(e, c)

    • The dictionary-based approach: P(e | c) = Pdic(e | c) ≈ 1/nene: the number of translations of c

    • The hybrid approach: P(e | c) = [Pweb(e | c) + Pdic(e | c)] / 2

Q: English query

D: Chinese Document

e: English query term

c: Chinese translation

P(e): background probability

P(e|c): translation probability

P(c|D): generation probability


Lecture 10 term translation extraction cross language information retrieval

Performance of Query Translation and CLIR

for NTCIR-2 English-Chinese Retrieval Task


Lecture 10 term translation extraction cross language information retrieval

Performance Analysis for Query Translation & CLIR

  • Query translation

    • Effective

      • Local place names: “Chilan” (棲蘭), “Meinung” (美濃)

      • Foreign names: “Jordan” (喬登, 喬丹), “Kosovar” (科索沃), “Carter” (卡特)

      • Aliases/Synonyms: “Disney” (迪士尼, 迪斯尼, 迪斯奈, 迪斯奈, 狄斯奈, 狄士尼)

    • Ineffective

      • Common terms: “victim” (受難者), “abolishment” (廢止)

      • Native Chinese names: “Bai Xiao-yan” (白曉燕), “Bai-feng bean” (白鳳豆)

    • Multiple senses

      • Title query Q01: “The assembly parade law and freedom of speech”

        • “assembly” => “組合語言” (error), “集會” (correct)

        • “speech” => “演講”, “語音” (error), “言論” (correct)

  • CLIR

    • Effective

      • Q23: ”Disneyland”: MAP (mean average precision) from 0 to 0.721

      • Q46: “Ma Yo-yo cello recital”: MAP from 0.205 to 0.446


Lecture 10 term translation extraction cross language information retrieval

Conclusion

  • Practical CLWS services have not lived up to expectations due to lacking multilingual translationsfor diverse unknown queries.

  • The Web mining approach, which combines anchor-text-mining and search-result-mining approaches, are complementary in the precision and coverage rates for query translation.

  • Anchor texts and search-result pages are useful comparable corpora for query translation, which are contributed continuously by a huge number of volunteers (page authors) around the world.

  • LiveTrans can generate translation suggestions and provide an practical CLWS service for the retrieval of both Web pages and images.


Lecture 10 term translation extraction cross language information retrieval

Future Work

  • Currently, the LiveTrans system cannot fully perform in real time. It is necessary to find an more efficient way to reduce the computation cost.

  • Employ more language processing techniques to improve the accuracy in phrase translation, word segmentation, unknown word extraction and proper name transliterations.

  • Develop an automatic way to collect and exploit other Web resources like bilingual/multilingual Web pages.

  • Enhance the LiveTrans system to handle more Asian and European language translation, such as Japanese, Korean, France, etc.

  • Apply our Web-mining translation techniques to enhance current machine translation techniques and design a computer-aided English writing system.


  • Login