1 / 23

Vietnamese-English Cross Language Search Information Retrieval (CLIR) -

Vietnamese-English Cross Language Search Information Retrieval (CLIR) - Discovering Noun Phrases for Translation CSC 177 Presentation Nguyen Doan H, Ph.D. Outline. Motivations Crosslingual Query Noun phrase translation extraction Experiments and results Conclusion and next steps.

Download Presentation

Vietnamese-English Cross Language Search Information Retrieval (CLIR) -

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Vietnamese-English Cross Language Search Information Retrieval (CLIR) - Discovering Noun Phrases for Translation CSC 177 Presentation Nguyen Doan H, Ph.D

  2. Outline • Motivations • Crosslingual Query • Noun phrase translation extraction • Experiments and results • Conclusion and next steps

  3. Motivations – Unknown Translations • Words that outside scope of bilingual dictionary • Brand names, Place names, Personal names • Titles (music, book, video) • Terminologies (Science, Computer, Medical, Space, Farming etc) • Compound nouns • Meaning might not be inferable from individual components • Might required expert knowledge for translation • Might have multiple correct translations • Applicability • Cross-language Information Retrieval (CLIR) • Machine Translation (MT) • Machine-Readable Dictionary (MRD) • Most of the words are Out-Of-Vocabulary (OOV)

  4. Examples Example 1: Computer Terminology (phần mềm -> software)

  5. Examples Example 2: Personal Name (ca sĩ Quang Dũng -> Singer Quang Dung)

  6. Searching the web for translation? • Parallel Data on the Web: Vietnamese to English Translation

  7. Searching the web for translation? • Comparable corpus on the web:

  8. Searching the web for translation? • Mixed language web pages: English Translation

  9. Our Approach • Extensions to CMU’s Ying Zhang 2005 paper (Credit) • Addressing issues focusing to Vietnamese-English OOV translations • Proper name translation is using pattern recognition technique and not by phonetic similarity and string alignment • Detection of borrowed English words • Improving translation suggestions by utilizing contextual information

  10. Crosslingual Query to Obtain Mixed Languages WebPages • Extend the source query, VS , with extended words/phrases VEX: (tend to frequently co-occur) • VS : phần mềm → ? • VSVEX : phần mềm miễn phí • Translate the extended words/phrases, VEX, , to English, EEX: • VEX : miễn phí → EEX : free • Submit both source query and translated words/phrases to a search engine • VSEEX : phần mềm free

  11. How to Find This VEX ? Overture Search Log • Find co-occurred terms in web log • Use co-occurred terms in search query (in CLIR) • Search Google, with VS, and select Vietnamese words, VEX, with high frequency

  12. Original Source Query

  13. Crosslingual Query

  14. Our Approach: Noun Phrase Translation Extraction • Proper noun recognition & Transliteration • Preprocessing • Frequency-Distance Model • Contextual Ordering Model & Result Ranking

  15. Yahoo Search API - XML Data Returning Snippet

  16. Proper name recognition & Transliteration • Extract and concatenate Title, Summary, and URL • Recognize that proper name text pattern • is likely to appear in capital with the • first letter • Compute the likelihood of a query text is a proper name • Once recognized, map Vietnamese vowels to English vowels: • i.e á → a, à → a … , ũ → u… • Suggest a translation candidate VN: Quang Dũng → Eng: Quang Dung • Compute and assign a weight to a translation candidate

  17. Preprocessing (Query: Thuật toán genetic) • Extracting and concatenation of Title, Summary, and URL Thuật toán-Cấu trúc dữ liệu ... (Reserve Polish Notation – RPN), một thuật toán "kinh điển" trong lĩnh vực trình biên dịch. ... THUẬT GIẢI DI TRUYỀN – GENETIC ALGORITHM - Kỳ 2 ... ity.vnuit.edu.vn/thuattoan/index.htm • Mark query, normalize text, remove noise text ~123456789 cấu trúc dữ liệu reserve polish notation – rpn một ~123456789 kinh điển trong lĩnh vực trình biên dịch thuẬt giẢi di truyỀn – ~987654321 algorithm kỳ 2 ity vnuit edu vn thuattoan index htm • Mark recognized Vietnamese text with VNW tag ~123456789 VNW VNW VNW VNW reserve polish notation VNW rpn ~123456789 VNW VNW trong VNW VNW VNW VNW VNW VNW di VNW VNW ~987654321 algorithm VNW ity vnuit edu vn thuattoan index htm • Group continuous English words and build word list ['~123456789', 'VNW', 'VNW', 'VNW', 'VNW', '', '', 'reserve_polish_notation', 'VNW', 'rpn', '~123456789', 'VNW', 'VNW', 'trong', 'VNW', 'VNW', 'VNW', 'VNW', 'VNW', 'VNW', 'di', 'VNW', 'VNW', '~987654321', 'algorithm', 'VNW', 'ity', 'vnuit', 'edu', 'vn', 'thuattoan', 'index', 'htm']

  18. Frequency-Distance Model • Frequency-Distance model: • Frequency of co-occurrence • Distance of either VS or EEX within a snippet text • For all doc returned summaries • Example: Thuật toán genetic

  19. Contextual Ordering Model & Result Ranking • EstimateCloseness Probability • Overall Score for each candidate • Sort score and present top 5 suggestions

  20. Sample Program Output # 1(dân ca -> folk or traditional music)

  21. Sample Program Output # 2 (Quang Dũng -> Quang Dung)

  22. Sample of Translation Results

  23. Contributions Recognize and translate important phrases Translate: persons, locations, concepts Low cost for implementation with reasonable performance Future work Experiment with a larger set of test data Integration with Vietnamese-English CLIR work Automate the generation of extended words/phrase to derived English extended word Experiment on “Refine Result” concept for search engine Conclusion and Next Steps

More Related