1 / 78

Managing Morphologically Complex Languages in Information Retrieval

Managing Morphologically Complex Languages in Information Retrieval. Kal Järvelin & Many Others University of Tampere. 1. Introduction. Morphologically complex languages unlike English, Chinese rich inflectional and derivational morphology rich compound formation

purity
Download Presentation

Managing Morphologically Complex Languages in Information Retrieval

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Managing Morphologically Complex Languages in Information Retrieval Kal Järvelin & Many Others University of Tampere

  2. 1. Introduction • Morphologically complex languages • unlike English, Chinese • rich inflectional and derivational morphology • rich compound formation • U. Tampere experiences 1998 - 2008 • monolingual IR • cross-language IR • focus: Finnish, Germanic languages, English

  3. Methods for Morphology Variation Management Reductive Methods Generative Methods Stemming Lemmatiz- ation Infl stem Generation Word Form Generation Rules + Dict Rules + Dict Inflectional Stems FCG Rule- based Rule- based Inflectional Stems enhanced Generating All Forms

  4. Agenda • 1. Introduction • 2. Reductive Methods • 3. Compounds • 4. Generative Methods • 5. Query Structures • 6. OOV Words • 7. Conclusion

  5. 2. Normalization • Reductive methods, conflation • stemming • lemmatization • + conflation -> simpler searching • + smaller index • + provides query expansion • Stemming available for many languages (e.g. Porter stemmer) • Lemmatizers less available and more demanding (dictionary requirement)

  6. Alkula 2001 • Boolean environment, inflected index, Finnish: • manual truncation vs. automatic stemming • stemming improves P and hurts R • many derivatives are lost • Boolean environment, infl vs. lemma index, Finnish: • manual truncation vs. lemmatization • lemmatization improves P and hurts R • many derivatives are lost, others correctly avoided • Differences not great between automatic methods

  7. Kettunen & al 2005 • Ranked retrieval, Finnish: • Three problems • how lemmatization and inflectional stem generation compare in a best-match environment? • is a stemmer realistic for the handling Finnish morphology? • feasibility of simulated truncation in a best-match system? • Lemmatized vs inflected form vs. stemmed index.

  8. Kettunen & al. 2005 • Method Index MAP Change % • FinTWOL lemmas 35.0 -- • Inf Stem Gen inflform 34.2 - 2.3 • Porter stemmed 27.7 - 20.9 • Raw inflform 18.9 - 46.0 • But very long queries for inflectional stem generation & expansion (thousands of words); weaker generations shorter but progressively deteriorating results. • (InQuery/TUTK/graded-35/regular; )

  9. Kettunen & al. 2005

  10. MonoIR: Airio 2006 InQuery/CLEF/TD/TWOL&Porter&Raw

  11. CLIR: Inflectional Morphology • NL queries contain inflected form source keys • Dictionary headwords are in basic form (lemmas) • Problem significance varies by language • Stemming • stem both the dictionary and the query words • but may cause all too many translations • Stemming in dictionary translation best applied after translation.

  12. Lemmatization in CLIR • Lemmatization • easy to access dictionaries • but tokens may be ambiguous • dictionary translations not always in basic form • lemmatizer’s dictionary coverage • insufficient -> non-lemmatized source keys, OOVs • too broad coverage -> too many senses provided

  13. CLIR Findings: Airio 2006 English -> X InQuery/UTAClir/CLEF/GlobalDix/TWOL&Porter

  14. Agenda • 1. Introduction • 2. Reductive Methods • 3. Compounds • 4. Generative Methods • 5. Query Structures • 6. OOV Words • 7. Conclusion

  15. 3. Compounds • Compounds, compound word types • determinative: Weinkeller, vinkällare, life-jacket • copulative: schwartzweiss, svartvit, black-and-white • compositional: Stadtverwaltung, stadsförvaltning • non-compositional: Erdbeere, jordgubbe, strawberry • Note on spelling : compound word components are spelled together (if not -> phrases)

  16. Compound Word Translation • All compounds are not in dictionary • some languages are very productive • small dictionaries: atomic words, old non-compositional compounds • large dictionaries: many compositional compounds added • Compounds remove phrase identification problems, but cause translation and query formulation problems

  17. Joining morphemes complicate compound analysis & translation Joining morpheme types in Swedish <omission> flicknamn -s rättsfall -e flickebarn -a gästabud -u gatubelysning -o människokärlek Joining morpheme types in German -s Handelsvertrag -n Affenhaus -e Gästebett -en Fotographenaus- bildung -er Gespensterhaus -es Freundeskreis -ens Herzensbrecher <omission> Sprachwissen-schaft Joining Morphemes Suggestive finding that the treatment of joining morphemes improves MAP by 2 % - Hedlund 2002, SWE->ENG, 11 Qs

  18. A Finnish natural language query: lääkkeet sydänvaivoihin (medicines for heart problems) The output of morphological analysis lääke sydänvaiva, sydän, vaiva Dictionary translation and the output of component tagging: lääke ---> medication drug sydänvaiva - ”not in dict” sydän ---> heart vaiva ---> ailment, complaint, discomfort, inconvenience, trouble, vexation Many ways to combine components in query Compound Processing, 2

  19. Compound Processing, 3 • Sample English CLIR query: • #sum( #syn( medication drug )heart #syn( ailment, complaint, discomfort, inconvenience, trouble, vexation )) • i.e. translating as if source compounds were phrases • Source compound handling may vary here: • #sum( #syn( medication drug ) #syn(#uw3( heart ailment ) #uw3( heart complaint ) #uw3( heart discomfort ) #uw3( heart inconvenience ) #uw3( heart trouble ) #uw3( heart vexation ))) • #uw3 = proximity operator for three intervening words, free word order • i.e. forming all proximity combinations as synonym sets.

  20. Compound Processing, 4 • No clear benefits seen from using proximity combinations. • We did neither observe a great effect in changing the proximity operator (OD vs. UW) • Some monolingual results follow (Airio 2006)

  21. InQuery/CLEF/Raw&TWOL&Porter

  22. English Swedish Finnish Morphological complexity increases

  23. Hedlund 2002 • Compound translation as compounds: • 47 German CLEF 2001 topics, English docs collection. • comprehensive dictionary (many compounds) vs. small dict (no compounds) • mean AP 34.7% vs. 30.4% • dictionary matters ... • Alternative approach: if not translatable, split and translate components

  24. CLEF Ger -> Eng InQuery/UTAClir/CLEF/Duden/TWOL/UW 5+n

  25. CLIR Findings: Airio 2006 English -> InQuery/UTAClir/CLEF/GlobalDix/TWOL&Porter

  26. Eng->Fin Eng->Ger Eng->Swe

  27. Agenda • 1. Introduction • 2. Reductive Methods • 3. Compounds • 4. Generative Methods • 5. Query Structures • 6. OOV Words • 7. Conclusion

  28. 4. Generative Methods Variation handling Reductive Methods Generative Methods Stemming Lemmatiz- ation Infl stem Generation Word Form Generation Rules + Dict Rules + Dict Inflectional Stems FCG Rule- based Rule- based Inflectional Stems, ench Generating All Forms

  29. Generative Methods: inf stems • Instead of normalization, generate inflectional stems for an inflectional index. • then using stems harvest full forms from the index • long queries ...

  30. ... OR ... • Instead of normalization, generate full inflectional forms for an inflectional index. • Long queries? Sure! • Sounds absolutely crazy ...

  31. ... BUT! • Are morphologically complex languages that complex in IR in practice? • Instead of full form generation, only generate sufficient forms -> FCG • In Finnish, 9-12 forms cover 85% of all occurrences of nouns

  32. Kettunen & al 2006: Finnish IR MAP for relevance level Method Liberal Normal Stringent TWOL 37.8 35.0 24.1 FCG12 32.7 30.0 21.4 FCG6 30.9 28.0 21.0 Snowball 29.8 27.7 20.0 Raw 19.6 18.9 12.4 ... monolingual ...

  33. Kettunen & al 2007: Other Langs IR MAP for Language Method Swe Ger Rus TWOL 32.6 39.7 .. FCG 30.6 /4 38.0 /4 32.7 /2 FCG 29.1 /2 36.8 /2 29.2 /6 Snowball 28.5 39.1 34.7 Raw 24.0 35.9 29.8 Results for long queries ... monolingual ...

  34. CLIR Findings: Airio 2008

  35. Agenda • 1. Introduction • 2. Reductive Methods • 3. Compounds • 4. Generative Methods • 5. Query Structures • 6. OOV Words • 7. Conclusion

  36. 5. Query Structures • Translation ambiguity such as ... • Homonymy: homophony, homography • Examples: platform, bank, book • Inflectional homography • Examples: train, trains, training • Examples: book, books, booking • Polysemy • Examples: back, train • ... a problem in CLIR.

  37. Ambiguity Resolution • Methods • Part-of-speech tagging (e.g. Ballesteros & Croft ‘98) • Corpus-based methods Ballesteros & Croft ‘96; ‘97; Chen & al. ‘99) • Query Expansion • Collocations • Query structuring - the Pirkola Method (1998)

  38. Query Structuring Concepts? • From weak to strong query structures by recognition of ... • concepts • expression weights • phrases, compounds • Queries may be combined ... query fusion no yes Weighting ? Weighting ? no yes no yes ~~ ~~ Phrases ? Phrases ? no yes no yes ~~ ~~ #wsum(1 3 #syn(a #3(b c)) 1 #syn(d e)) #sum(a b c d e)

  39. Structured Queries in CLIR • CLIR performance (Pirkola 1998, 1999) • English baselines, manual Finnish translations • Automatic dictionary translation FIN -> ENG • natural language queries (NL) vs. concept queries (BL) • structured vs. unstructured translations • single words (NL/S) vs. phrases marked (NL/WP) • general and/or special dictionary translation • 500.000 document TREC subcollection • probabilistic retrieval (InQuery) • 30 health-related requests

  40. The Pirkola Method • All translations of all senses provided by the dictionary are incorporated in the query • All translations of each source language word are combined by the synonym operator, synonym groups by #and or #sum • this effectively provides disambiguation

  41. An Example • Consider the Finnish natural language query: • lääke sydänvaiva [= medicine heart_problem] • Sample English CLIR query: • #sum( #syn( medication drug ) heart #syn( ailment, complaint, discomfort, inconvenience, trouble, vexation ) ) • Each source word forming a synonym set

  42. TREC Query Translation Test Set-up Translated Finnish Request English Request Finnish NL Query Finnish BL Query General Dict Med. Dict. General Dict Med. Dict. Baseline Queries Translated English Queries InQuery Unix-server

  43. Unstructured NL/S Queries Baseline Only 38% of the average baseline precision (sd&gd) #sum(tw11, tw12, ... , tw21, tw22, ... twn1, ... , twnk)

  44. Structured Queries w/ Special Dictionary 77% of the average baseline precision (sd & gd) Structure doubles precision in all cases Baseline #and(#syn(tw11, tw12, ... ), #syn(tw21, tw22, ...), #syn( twn1, ..., twnk))

  45. Query Structuring, More Results

  46. Transit CLIR – Query Structures Average precision for the transitive, bilingual and monolingual runs of CLEF 2001 topics (N = 50)

  47. Transitive CLIR Results, 2

  48. Transitive CLIR Effectiveness Lehtokangas & al 2008

  49. TransCLIR + pRF effectiveness

  50. Agenda • 1. Introduction • 2. Reductive Methods • 3. Compounds • 4. Generative Methods • 5. Query Structures • 6. OOV Words • 7. Conclusion

More Related