1 / 25

Cross-Language Information Retrieval

Cross-Language Information Retrieval. Applied Natural Language Processing October 29, 2009 Douglas W. Oard. What Do People Search For?. Searchers often don’t clearly understand The problem they are trying to solve What information is needed to solve the problem

agrata
Download Presentation

Cross-Language Information Retrieval

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Cross-Language Information Retrieval Applied Natural Language Processing October 29, 2009 Douglas W. Oard

  2. What Do People Search For? • Searchers often don’t clearly understand • The problem they are trying to solve • What information is needed to solve the problem • How to ask for that information • The query results from a clarification process • Dervin’s “sense making”: Need Gap Bridge

  3. End-user Search Taylor’s Model of Question Formation Q1 Visceral Need Q2 Conscious Need Intermediated Search Q3 Formalized Need Q4 Compromised Need (Query)

  4. Design Strategies • Foster human-machine synergy • Exploit complementary strengths • Accommodate shared weaknesses • Divide-and-conquer • Divide task into stages with well-defined interfaces • Continue dividing until problems are easily solved • Co-design related components • Iterative process of joint optimization

  5. Human-Machine Synergy • Machines are good at: • Doing simple things accurately and quickly • Scaling to larger collections in sublinear time • People are better at: • Accurately recognizing what they are looking for • Evaluating intangibles such as “quality” • Both are pretty bad at: • Mapping consistently between words and concepts

  6. Process/System Co-Design

  7. Predict Nominate IR System Query Formulation Query Search Ranked List Selection Query Reformulation and Relevance Feedback Document Examination Document Source Reselection Delivery Supporting the Search Process Source Selection Choose

  8. IR System Query Formulation Query Search Ranked List Selection Document Indexing Index Examination Document Acquisition Collection Delivery Supporting the Search Process Source Selection

  9. Search Component Model Utility Human Judgment Information Need Document Query Formulation Query Document Processing Query Processing Representation Function Representation Function Query Representation Document Representation Comparison Function Retrieval Status Value

  10. Relevance • Relevance relates a topic and a document • Duplicates are equally relevant, by definition • Constant over time and across users • Pertinence relates a task and a document • Accounts for quality, complexity, language, … • Utility relates a user and a document • Accounts for prior knowledge

  11. “Okapi” Term Weights TF component IDF component

  12. term frequency document frequency query term query document length document average document length term frequency in query A Ranking Function: Okapi BM25

  13. Estimating TF and DF for Query Terms f1 f2 f3 f4 0.4 20 5 2 50 50 40 30 200 0.3 e1 0.4 0.3 0.2 0.1 0.2 0.4*20 + 0.3*5 + 0.2*2 + 0.1*50 = 14.9 0.1 0.4*50 + 0.3*40 + 0.2*30 + 0.1*200 = 58

  14. Learning to Translate • Lexicons • Phrase books, bilingual dictionaries, … • Large text collections • Translations (“parallel”) • Similar topics (“comparable”) • Similarity • Similar pronunciation, similar users • People

  15. Hieroglyphic Demotic Greek

  16. Statistical Machine Translation Señora Presidenta , había pedido a la administración del Parlamento que garantizase Madam President , I had asked the administration to ensure that

  17. Bidirectional: Unidirectional: merveilles//0.92 merveille//0.03 emerveille//0.03 merveilleusement//0.02 se//0.31 demande//0.24 demander//0.08 peut//0.07 merveilles//0.04 question//0.02 savoir//0.02 on//0.02 bien//0.01 merveille//0.01 pourrait//0.01 si//0.01 sur//0.01 me//0.01 t//0.01 emerveille//0.01 ambition//0.01 merveilleusement//0.01 veritablement//0.01 cinq//0.01 hier//0.01 Bidirectional Translation wonders of ancient world (CLEF Topic 151)

  18. Experiment Setup • Test collections • Document processing • Stemming, accent-removal (CLEF French) • Word segmentation, encoding conversion (TREC Chinese) • Stopword removal (all collections) • Training statistical translation models (GIZA++) Parallel corpus Europarl FBIS et al. Languages English-French English-Chinese # of sentence pairs 672,247 1,583,807 Models (iterations) M1(10), HMM(5), M4(5) M1(10)

  19. Pruning Translations Translations Cumulative Probability Threshold 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 f1(0.32) f2(0.21) f3(0.11) f4(0.09) f5(0.08) f6(0.05) f7(0.04) f8(0.03) f9(0.03) f10(0.02) f11(0.01) f12(0.01) f1 f1 f1 f1 f1 f2 f1 f2 f1 f2 f3 f1 f2 f3 f4 f1 f2 f3 f4 f5 f1 f2 f3 f4 f5 f6 f7 f1 f2 f3 f4 f5 f6 f7 f8 f9 f10 f11 f12

  20. Q D Unidirectional without Synonyms (PSQ) CLEF French TREC-5,6 Chinese • Statistical significance vs monolingual (Wilcoxon signed rank test) • CLEF French: worse at peak • TREC-5,6 Chinese: worse at peak

  21. (Q) (D) v.s. Q D Bidirectional with Synonyms (DAMM) CLEF French TREC-5,6 Chinese • DAMM significantly outperformed PSQ • DAMM is statistically indistinguishable from monolingual at peak • IMM: nearly as good as DAMM for French, but not for Chinese

  22. Indexing Time Dictionary-based vector translation, single Sun SPARC in 2001

  23. Key Capabilities • Map across languages • For human understanding • For automated processing The Problem Space • Retrospective search • Web search • Specialized services (medicine, law, patents) • Help desks • Real-time filtering • Email spam • Web parental control • News personalization • Real-time interaction • Instant messaging • Chat rooms • Teleconferences

  24. Making a Market • Multitude of potential applications • Retrospective search, email, IM, chat, … • Natural consequence of language diversity • Limiting factor is translation readability • Searchability is mostly a solved problem • Leveraging human translation has potential • Translation routing, volunteers, cacheing

More Related