1 / 45

Website Term Browser Un sistema interactivo y multilingüe de búsqueda textual basado en técnicas lingüísticas

Departamento de Lenguajes y Sistemas Informáticos UNIVERSIDAD NACIONAL DE EDUCACIÓN A DISTANCIA TESIS DOCTORAL. Website Term Browser Un sistema interactivo y multilingüe de búsqueda textual basado en técnicas lingüísticas. Anselmo Peñas Padilla Directores Julio Gonzalo Arroyo

delores
Download Presentation

Website Term Browser Un sistema interactivo y multilingüe de búsqueda textual basado en técnicas lingüísticas

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Departamento de Lenguajes y Sistemas Informáticos UNIVERSIDAD NACIONAL DE EDUCACIÓN A DISTANCIA TESIS DOCTORAL Website Term BrowserUn sistema interactivo y multilingüe de búsqueda textual basado en técnicas lingüísticas Anselmo Peñas Padilla Directores Julio Gonzalo Arroyo María Felisa Verdejo Maíllo

  2. Structure I. Problem definition and goals II. Experiments in Lexical Ambiguity and Indexing III.Website Term Browser IV.Evaluation framework

  3. I. Problem definition and goals Formulation Query Refinement Search engine Document ranking Docs. Classic Information Retrieval Information need Retrieve documents relevant to user’sinformation need • Pre-supposes: • Static information needs • Value is found in the retrieved set of documents(not in searching process) • Ignores • Task (purpose) that origins the information need • Changes in the information needs • Interactivity • Imprecise information needs • Users develop strategies without system aid Help users to express and precise their information needs

  4. I. Problem definition and goals Language barriers Problems in query formulation • Users don’t know the appropriate domain terminology • Users can’t express their information need in a foreign language • Translinguality • Natural Language characteristics • Lexical ambiguity • Terminology variation Help users to overcome language barriers

  5. I. Problem definition and goals Keyphrase navigation (Phrasier) Controlled vocabularies indexing & browsing String Processing Terminology Free text indexing Phrase indexing & browsing (Phind) Disambiguation Conceptual indexing General approaches Information Retrieval Natural Language Processing

  6. I. Problem definition and goals Natural Language Processing • Help users to express and precise their information needs? • Open field in IR • Help users to overcome language barriers? • Phrase extraction and normalization • Explicit disambiguation (POS, WSD) Bad strategies or too much error in automatic processing? • Conceptual indexing

  7. I. Problem definition and goals Goals • Study the role of automatic linguistic techniques within classic IR model • Phrase indexing, POS tagging, WSD • Semantic distinction of phrases • Viability of conceptual indexing Section II: Experiments in Lexical Ambiguity and Indexing

  8. I. Problem definition and goals Goals • Develop a model • to help users to express and precise their information needs • to help users to overcome language barriers • Bringing to users the collection terminology • Morpho-syntactic, semantic & translingual variations • Without needs of thesauri construction • Establish an appropriate evaluation framework Sections III & IV:Website Term Browser

  9. Keyphrase navigation (Phrasier) Controlled vocabularies indexing & browsing String Processing Terminology Free text indexing Terminology Retrieval & Term browsing (WTB) Phrase indexing & browsing (Phind) Disambiguation Conceptual indexing Automatic Terminology Extraction Proposed approach Information Retrieval Natural Language Processing

  10. Structure I. Problem definition and goals II. Experiments in Lexical Ambiguity and Indexing III.Website Term Browser IV. Evaluation framework

  11. II. Experiments in Lexical Ambiguity and Indexing Contents • Morpho-syntactic ambiguity in IR • Phrase indexing • Semantic distinction of lexical compounds in IR • Conceptual indexing • ITEM Search Engine • Conclusions IR-SEMCOR, hand annotated test collection

  12. II. Experiments in Lexical Ambiguity and Indexing Plain matches ...particl cross the wall... ...canadianred cross... ...boat to cross mississippi river... Query cross POS Tagged matches ...particl_Ncross_V the_D wall_N... ...canadian_ADJ red_ADJcross_N... ...boat_N to_TOcross_V mississippi_N river_N... Query cross_N Morpho-syntactic ambiguity in IR Texts ...particlecrosses the wall... ...canadianred cross... ...boat to cross mississippi river...

  13. II. Experiments in Lexical Ambiguity and Indexing Plain matches ...a guide for the fisher who... ...arboreal carnivorous called fisher cat... ...information on cat care... Query fisher Phrase indexing matches ...a guide for the fisher who... ...arboreal carnivorous called fisher_cat... ...information on cat care... Query fisher Phrase indexing Texts ...a guide for the fisher who... ...information on cat care... ...arboreal carnivorous called fisher cat...

  14. II. Experiments in Lexical Ambiguity and Indexing department aspirin powder is_a is_a is_a fisher cat purchasing department aspirin powder Exocentric Appositional Endocentric Semantic distinction of compounds Types of lexical compounds Automatic classification through WordNet • Endocentric: one component is hyperonym • Appositional: all components are hyperonyms • Exocentric: no components are hyperonyms

  15. II. Experiments in Lexical Ambiguity and Indexing WSD n09151839 Conceptual Indexing Conceptual Index n03114639 n05727069 n09151839 Texts ...spring... ...muelle... ...spring... ...fountain... ...fuente... ...spring... ...springtime... ...primavera... Query spring This model can improve text retrieval (Gonzalo 1998; Gonzalo 1999) • Depending on WSD error rate

  16. Synset indexing with no errors in WSD

  17. II. Experiments in Lexical Ambiguity and Indexing Conceptual Indexing • Although explicit disambiguation strategies applied to Indexing • POS tagging • Phrase indexing • Word Sense Disambiguation don’t produce a significative improvement in IR • Conceptual indexing based on synsets • Needs automatic WSD accuracy near to state-of-the-art (60%) • Permit Cross-Language Information Retrieval • Qualitative evaluation justifies a prototype development

  18. Textual representation: query is translatedinto the target language Conceptual representation: query and documentsare compared ata conceptual level Selection of newspaper determines the target language Selection of query language Selection of WSD strategy Retrieved documents

  19. II. Experiments in Lexical Ambiguity and Indexing ITEM Search Engine Conceptual indexing seems atractive but there are some unsolved challenges: • Low accuracy in Word Sense Disambiguation due to • Unrestricted domains in EWN • Fine grain distinction of senses • Indexing units  translation units • Loss of information in word by word disambiguation • High cost, low benefit • Users perceive a slower and less transparent system

  20. II. Experiments in Lexical Ambiguity and Indexing Conclusions • Don’t subordinate NLP to classic IR model • Even an improvement of 10% wouldn’t change users perception • Think of users • Find new paradigms in Information Access • In a higher level, closer to users • Consider users tasks • Consider users interaction • New places for NLP techniques in IR • Interaction over partial NLP processing • A proposal: Terminology Retrieval & Term Browsing

  21. Structure I. Problem definition and goals II. Experiments in Lexical Ambiguity and Indexing III.Website Term Browser IV. Evaluation framework

  22. III. Website Term Browser Contents • Terminology Retrieval • Term extraction • Indexing • Retrieval model • Query expansion and translation • Website Term Browser interface

  23. III. Transition to an interactive model Lemma Document Phrase Terminology Retrieval Term Browsing • Navigate through relevant terminology • Access information from retrieved terms Terminology Retrieval • Retrieve relevant terms related to the query • Phrase extraction • Phrase indexing • Phrase retrieval • Recall is more important than precision in term extraction • Relaxing linguistic processing is possible • Premise: don’t lose phrases

  24. III. Transition to an interactive model Term extraction Syntactic pattern(Spanish, English, French, Italian, Catalan) [ phr_content ] [ phr_closed | phr_content ]* [ phr_content ] phr_content: noun, adjective, number, infinitive, participle phr_closed: article, preposition, conjunction • Needs POS tagging • High computational cost • Tagging oriented to phrase detection

  25. III. Transition to an interactive model Lemma Document Phrase Lemma Document Phrase Indexing Steps • Text pre-processing and listing of words • Word tagging (oriented to phrase detection) • Phrase detection & lemmatization of components • Document indexing & statistics (document frequency) Phrase selection (Subsumption & Lexicalization degree) Phrase indexing

  26. III. Transition to an interactive model Tokenising Lexicon tok1 tok2 tok3 lem11 lem12 ... lem31 lem32 ... Lemmatising lem11 lem21 lem31 lem12 lem22 lem32 ··· ··· ··· EWN & Dic. Phrase index Document index Expansion / Translation exp31 exp32 ... tran31 tran32 ... Phrase retrieval Document retrieval exp21 exp22 ... tran21 tran22 ... exp11 exp12 ... tran11 tran12 ... Term ranking Document ranking terms documents Retrieval model query

  27. III. Transition to an interactive model Ambiguity Reduction Nuclear taste proscription process? Nuclear test ban treaty? Query expansion and translation de Prohibición embargo entredicho interdicción interdicto proscripción ban interdiction prohibition proscription de Pruebas cata, catadura degustación ensayo escandallo experimento gustación muestreo, tanteo demonstrate establish, exhibit experiment experimentation fall, fitting indicate, point present, proof prove, run sample, sampling shew,show, taste test, trial, try Nucleares nuclear nuclear Expansion Tratados acuerdo capitulación concertación convenio cuidar, pacto manejar procesar accord discourse handle manage pact process treat treatise treaty Translation

  28. Query in Spanish Hierarchy of terms Ranking of documents English Spanish Catalan

  29. Structure I. Problem definition and goals II. Experiments in Lexical Ambiguity and Indexing III.Website Term Browser IV. Evaluation framework

  30. V. Evaluation framework Evaluation of Terminology Retrieval Compare • Terminology Retrieval • Hand-crafted Multilingual Thesaurus

  31. V. Evaluation framework Evaluation of Terminology Retrieval Recall of mono-lexical terms (lemmas) • Monolingual: 85% - 95% • Translingual: 55% - 65% Recall of poly-lexical terms (phrases) • Monolingual: 40% - 65% • Translingual: 10% - 45% Loss of recall due to • Phrase extraction (mainly POS tagging): 3% - 17% • Phrase indexing (mainly lemmatization): 2% - 34% • Phrase selection: 12% - 37% • Lack of connections between different languages in EWN • Lack in EWN adjective hierarchies

  32. V. Evaluation framework Usefulness of Term Browsing Previous experiences in interactivity evaluation (TREC) need: • Precise queries • Laboratory conditions • Controlled users • There aren’t differences between systems • Identify better approaches is not possible A new framework is here proposed • Real work environment • Register users interaction • Compare the use of • Term area provided by WTB • Document ranking provided by Google

  33. QUERY RECONSULT WITH TERM EXPLORE TERM EXPLORE DOCUMENT

  34. V. Evaluation framework Usefulness of Term Browsing • 2318 sessions with interaction • An average of 5.16 actions per session • EXPLORE_TERM is used in 65% LOG FILE 539 2001/03/14 12:10:33 QUERY UNED 193.146.241.164 ozone hole 2001/03/14 12:11:20 EXPLORE_TERM 539684: degradación de la capa de ozono 2001/03/14 12:11:29 EXPLORE_DOC http://www.uned.es/doctorado/0108.htm ... EXPLORE_TERM RECONSULT EXPLORE_DOC ...

  35. V. Evaluation framework Usefulness of Term Browsing All queries 1 word queries >1 word queries First action EXPLORE_DOC 42% 47% 39% afterQUERYEXPLORE_TERM51% 45% 55% RECONSULT7%8%6% Last action before finishingQUERY 50% 57% 46% the session withEXPLORE_TERM44% 38% 47% exploreDOCRECONSULT 6% 5% 7%

  36. Structure I. Problem definition and goals II. Experiments in Lexical Ambiguity and Indexing III.Website Term Browser IV. Evaluation framework

  37. Conclusions Lexical Ambiguity has been studied using IR-Semcor • Evaluation free of automatic processing errors • Explicit disambiguation at indexing doesn’t seem to improve retrieval (POS, WSD, Semantic distinction of lexical compounds) • Conceptual indexing based on EuroWordNet synsets needs to solve some challenges • Think of users to find new places for NLP

  38. Conclusions A search model based on extraction, retrieval and browsing of terminology has been developed • User oriented • Interaction over terminological information • Intermediate way between free-searching and thesaurus-guided searching • Without needs of thesaurus construction • Bringing to users the collection terminology • Morpho-syntactic & semantic variations • Translinguality

  39. Conclusions An evaluation framework for Terminology Retrieval and Term Browsing has been established • Points the way to improve Terminology Retrieval • Users appreciate Term Browsing • WTB phrasal information can substantially complement the document ranking provided by the search engines

  40. Departamento de Lenguajes y Sistemas Informáticos UNIVERSIDAD NACIONAL DE EDUCACIÓN A DISTANCIA TESIS DOCTORAL Website Term BrowserUn sistema interactivo y multilingüe de búsqueda textual basado en técnicas lingüísticas Anselmo Peñas Padilla Directores Julio Gonzalo Arroyo María Felisa Verdejo Maíllo

More Related