1 / 52

Cross Lingual Information Retrieval (CLIR)

Cross Lingual Information Retrieval (CLIR). Rong Jin. The Problem. Increasing pressure for accessing information in foreign language: find information written in foreign languages read and interpret that information merge it with information in other languages

chul
Download Presentation

Cross Lingual Information Retrieval (CLIR)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Cross Lingual Information Retrieval (CLIR) Rong Jin

  2. The Problem • Increasing pressure for accessing information in foreign language: • find information written in foreign languages • read and interpret that information • merge it with information in other languages • Need for multilingual information access

  3. Why Cross Lingual IR is Important? • Internet is no longer monolingual and non-English content is growing rapidly • Non-English speakers represent the fastest growing group of new internet users • In 1997, 8.1 million Spanish speaking users • In 2000, 37 million ……..

  4. 2000 2005 English English Confidential, unpublished information • Manning & Napier Information Services 2000

  5. 2. Multilingual Text Processing • Character encoding • Language recognition • Tokenization • Stop word removal • Feature normalization (stemming) • Part-of-speech tagging • Phrase identification

  6. Character Encoding • Language (alphabet) specific native encoding: • Chinese GB, Big5, • Western European ISO-8859-1 (Latin1) • Russian KOI-8, ISO-8859-5, CP-1251 • UNICODE (ISO/IEC 10646) • UTF-8 variable-byte length • UTF-16, UCS-2 fixed double-byte

  7. Tokenization • Punctuation separated from words – incl. word separation characters. • “The train stopped.”  “The”, “train”, “stopped”, “.” • String split into lexical units - incl. Segmentation (Chinese) and compound-splitting (German)

  8. Chinese Segmentation

  9. Chinese Segmentation Frank Petroleum Detection

  10. German Segmentation • Unrestricted compounding in German • Abendnachrichtensendungsblock • Use compound analysis together with CELEX German dictionary (360,000 words) • Treuhandanstalt  { treuhand, anstalt } • Use n-gram representation • Treuhandanstalt  {Treuha, reuhan, treuhand, euhand, … }

  11. Language Barrier Query Representation Document Representation Document User Query CLIR - Approaches • Machine Translation • Bilingual Dictionaries • Parallel/Comparable Corpora Marco Pantani of Italy became the first Italian to win the Tour de France of 1998 … 誰在1998 年贏得環法自行車大賽

  12. Machine Translation • Translate all documents into the query language Chinese Documents English Documents Chinese Queries Machine Translation Lucene

  13. Machine Translation (MT) • Translate all documents into the query language • Not viable on large collections (MT is computationally expensive) • Not viable if there are many possible query languages Chinese Documents English Documents Chinese Queries Machine Translation Lucene

  14. Machine Translation • Translate the query into languages of the content being searched English Documents English Queries Chinese Queries Machine Translation Lucene

  15. Machine Translation • Translate the query into languages of the content being searched • Query translation is inadequate for CLIR • no context for accurate translation • system selects preferred target term English Documents English Queries Chinese Queries Machine Translation Lucene

  16. Example of Translating Queries Who won the Tour de France in 1998?

  17. Using Dictionaries • Bilingual machine-readable dictionaries (in-house or commercial) • Look-up query terms in dictionary and replace with translations in document languages English Documents Bilingual Dictionary English Queries Chinese Queries Lucene

  18. Using Dictionaries Problems • ambiguity • many terms are out-of-vocabulary • lack of multiword terms • phrase identification • bilingual dictionary needed for every query-document language pair of interest

  19. Word Sense Disambiguation

  20. Word Sense Disambiguation The sign for independent press to disappear

  21. Using Corpora • Parallel Corpora • translation equivalent • e.g. UN corpus in French, Spanish & English • Comparable Corpora • Similar for topic, style, time etc. • Hong Kong TV broadcast news in both Chinese and English

  22. Using Corpora How to bridge the language barrier using the parallel corpora ? d1 a a c e d2 b c d a Query: A E d3 e d a

  23. Translate Query using Parallel Corpus (I) d1 a a c e d2 b c d a Query: A E d3 e d a

  24. Translate Query using Parallel Corpus (I) d1 a a c e d2 b c d a Query: A E d3 e d a Query: ce

  25. Translate Query using Parallel Corpus (I) d1 a a c e d2 b c d a Query: A E d3 e d a Query: ce

  26. Translate Query using Parallel Corpus (II) • Learn word-to-word translation probabilities from parallel corpa • Compute the relevance of a document d to a given query q by estimating the probability of translating document d into query q

  27. Translate Query using Parallel Corpus (II) Word-to-Word Translation Probabilities Q = (A E), d1 = (a a c e)

  28. Translate Query using Parallel Corpus (II) Word-to-Word Translation Probabilities Q = (A E), d1 = (a a c e)

  29. Translate Query using Parallel Corpus (II) Word-to-Word Translation Probabilities Q = (A E), d1 = (a a c e)

  30. Translate Query using Parallel Corpus (II) Q = (A E), d1 = (a a c e)

  31. Translate Query using Parallel Corpus (II) Q = (A E), d1 = (a a c e)

  32. Translate Query using Parallel Corpus (II) How to obtain the translation probabilities ?

  33. Approach I: Co-occurrence Counting

  34. Approach I: Co-occurrence Counting • Co-occurrence based translation model e.g. p(A|a) = co(a, A) / occur(a) = 4/4 = 1

  35. Approach I: Co-occurrence Counting • P(B|c) = co(B, c)/occ(c) = 2/4 = 0.5

  36. Approach I: Co-occurrence Counting • Any problem ?

  37. Approach I: Co-occurrence Counting • Many large translation probabilities • Usually one word of one language corresponds motly to a single word in another language

  38. Approach I: Co-occurrence Counting • Many large translation probabilities • Usually one word of one language corresponds motly to a single word in another language • We may over-count the co-occurrence statistics

  39. Approach I: Overcounting • co(A, a) = 4 implies that all occurrence of ‘A’ is due to the occurrence of ‘a’

  40. Approach I: Overcounting

  41. Approach I: Overcounting • If we believe that the first two occurrences of ‘A’ is due to ‘a’, then, co(A, b) = 1, not 3 • But, we have no idea if the first two occurrences of ‘A’ is due to ’a’ X x

  42. How to Compute Co-occurrence ? • IBM statistical translation model • There are translation models published by IBM research • We will only discuss IBM Translation Model I • It uses an iterative procedure to eliminate the over counting problem

  43. Step 1: Compute co-occurrence

  44. Step 1: Compute co-occurrence • Assume that translation probabilities are proportional to co-occurrence

  45. Step 2: Compute Conditional Prob. • Assume that translation probabilities are proportional to co-occurrence

  46. Step 3: Re-estimate co-occurrence • ‘A’ can be caused by one of the words ‘b’, ‘c’, ‘a’, ‘d’ • co(A,a) for sentence 1 should be computed by taking account of the competition

  47. Step 3: Re-estimate co-occurrence

  48. Step 3: Re-estimate co-occurrence

  49. Step 3: Re-estimate co-occurrence co(A,a) = 0.41 + 0.37 + 0.48 + 0 + 0.36 = 1.62

  50. Step 3: Re-estimate co-occurrence

More Related