La indexación con técnicas lingüísticas en el modelo clásico de Recuperación de Información

Jornadas de Tratamiento y Recuperación de la Información JOTRI 2002 La indexación con técnicas lingüísticas en el modelo clásico de Recuperación de Información Julio Gonzalo, Anselmo Peñas y Felisa Verdejo Grupo de Procesamiento de Lenguaje Natural Dpto. Lenguajes y Sistemas Informáticos UNED

Content • Goal • Morpho-syntactic ambiguity in IR • Phrase indexing • Conceptual indexing • Conclusions

Information need Formulation Query Refinement Search engine Document ranking Docs. Goal • Indexing with automatic linguistic techniques within the classic IR model • POS tagging • Phrase indexing • WSD & Conceptual Indexing Bad strategies or too much error in automatic processing? IR-Semcor, hand-annotated test collection • Lemmas and phrases • Senses • Synsets

Plain matches ...particl cross the wall... ...canadianred cross... ...boat to cross mississippi river... Query cross POS Tagged matches ...particl_Ncross_V the_D wall_N... ...canadian_ADJ red_ADJcross_N... ...boat_N to_TOcross_V mississippi_N river_N... Query cross_N Morpho-syntactic ambiguity in IR Texts ...particlecrosses the wall... ...canadianred cross... ...boat to cross mississippi river...

Morpho-syntactic ambiguity in IR • Documents matched are ranked much higher(there are less competing documents) • Manual POS tagging misses relevant matches • Query: ...talented baseball player...(talent_ADJ) • Doc: ...top talents of the time...(talent_N) • Missing Match • Automatic makes more mistakes, but not always correlated to retrieval decrease • Query:summer_N shoes_N design_V(design_V) • Doc:Italian_ADJ designed_Vsandals_N(design_V) • Match

Plain matches ...a guide for the fisher who... ...arboreal carnivorous called fisher cat... ...information on cat care... Query fisher Phrase indexing matches ...a guide for the fisher who... ...arboreal carnivorous called fisher_cat... ...information on cat care... Query fisher Phrase indexing Texts ...a guide for the fisher who... ...information on cat care... ...arboreal carnivorous called fisher cat...

Phrase indexing • Phrase indexing harms retrieval sometimes • Query:Candidate in governor’s_race • Doc:Opened his race for governor • Missing match • Phrase meaning is highly compositional • Needs semantic distinction

WSD n09151839 Conceptual Indexing Conceptual Index n03114639 n05727069 n09151839 Texts ...spring... ...muelle... ...spring... ...fountain... ...fuente... ...spring... ...springtime... ...primavera... Query spring This model can improve text retrieval (Gonzalo 1998; Gonzalo 1999) • Depending on WSD error rate

Word Sense Disambiguation • (Sanderson 1994) introduced fixed error rates inpseudo-words disambiguation banana  banana/education/toy/gun/forest  WSD  toy to conclude(over Reuters collection) • WSD must be above 90% accuracy • Reproduce Sanderson’s experiment (over IR-Semcor) • Compare precision in retrieval over synsets with WSD errors n07062238 spring WSD n04985670 (error) {spring,springtime}{spring, hook}

Pseudo-words with no errors in WSD  text

Synset indexing with no errors in WSD

Conceptual Indexing • Although explicit disambiguation strategies applied to Indexing • POS tagging • Phrase indexing • Word Sense Disambiguation don’t produce a significative improvement in IR • Conceptual indexing based on synsets • Needs automatic WSD accuracy near to state-of-the-art (60%) • Permit Cross-Language Information Retrieval • Qualitative evaluation (Item Search engine) • Some unsolved challenges (mainly WSD) • Users perceive a slower and less transparent system

Conclusions Think of users • Even an improvement of 10% wouldn’t change users perception • Don’t subordinate NLP to classic IR model • Find new paradigms in Information Access • In a higher level, closer to users • Consider users tasks • Consider users interaction

Jornadas de Tratamiento y Recuperación de la Información JOTRI 2002 La indexación con técnicas lingüísticas en el modelo clásico de Recuperación de Información Julio Gonzalo, Anselmo Peñas y Felisa Verdejo Grupo de Procesamiento de Lenguaje Natural Dpto. Lenguajes y Sistemas Informáticos UNED

IR-Semcor test collection • 254 hand-annotated documents in English • 82 hand-annotated queries in English with ~6.8 relevant documents each Example The Fulton County Grand Jury investigates possible irregularities in Atlanta’s primary election • Lemmas and phrase annotation The Fulton_County_Grand_Jury investigate possible irregularity in atlanta primary_election • Sense annotation Fulton_County_Grand_Jury investigate2 possible2 irregularity1 atlanta1 primary_election1 • Synset annotation (actually synset offsets or ILI-records) Fulton_County_Grand_Jury v00441414 a00036893 n00412042 n5608324 n00103176 { investigate, carry_out_an_investigation_of } { primary_election, primary } { possible, potential } { irregularity, abnormality } { Atlanta, capital_of_Georgia }

Doc 1 Doc 1 Doc 1 Query 1 Doc 2 Doc 2 Doc 2 Query 2 Doc 1 Doc 1 Doc 1 Doc 1 Doc 1 Doc 1 Doc 171 Query 82 Doc 83 Doc ~100 Doc 254 Assume the summary of a text is relevant to all fragments of the original Semcor document Hand-annotated sumaries only for chunked docs IR-Semcor test collection Semcor 1.5 IR-Semcor Semcor 1.6

Textual representation: query is translatedinto the target language Conceptual representation: query and documentsare compared ata conceptual level Selection of newspaper determines the target language Selection of query language Selection of WSD strategy Retrieved documents

Keyphrase navigation (Phrasier) Controlled vocabularies indexing & browsing String Processing Terminology Free text indexing Terminology Retrieval & Term browsing (WTB) Phrase indexing & browsing (Phind) Disambiguation Conceptual indexing Automatic Terminology Extraction Approaches Information Retrieval Natural Language Processing

II. Experiments in Lexical Ambiguity and Indexing department aspirin powder is_a is_a is_a fisher cat purchasing department aspirin powder Exocentric Appositional Endocentric Semantic distinction of compounds Types of lexical compounds Automatic classification through WordNet • Endocentric: one component is hyperonym • Appositional: all components are hyperonyms • Exocentric: no components are hyperonyms

La indexación con técnicas lingüísticas en el modelo clásico de Recuperación de Información

La indexación con técnicas lingüísticas en el modelo clásico de Recuperación de Información

Presentation Transcript

LING / C SC 439/539 Statistical Natural Language Processing