Similarity Measures for Query Expansion in TopX

SimilarityMeasuresfor Query Expansion in TopX Caroline Gherbaoui Universität des Saarlandes Naturwissenschaftlich-Technische Fak. I Fachrichtung 6.2 - Informatik Max-Planck-Institut für Informatik AG 5 - Datenbanken und Informationssysteme Prof. Dr. Gerhard Weikum

Overview • background knowledge • similarity measures for the query expansion • evaluation of the computed similarity values • changes in TopX • conclusion

Background • top-k query processing • provides k most relevant results • query expansion • extends source query terms • word sense disambiguation • extracts correct meaning • ontology • amount of terms with their meanings and semantic relations

Word Sense Disambiguation „java, coffee“ „island“ „coffee“ „java “ „programming language“ …

Query Expansion „COFFEE“ „drink, espresso“

TopX • top-k retrieval engine • text and XML data • word sense disambiguation • query expansion • ontology

TopX – WordNet Ontology • lexicon for the English language • hierarchical relations • one relation  one direction • ~160,000 words • ~120,000 synsets • ~210,000 relations

TopX – YAGO Ontology • Wikipedia and WordNet • hierarchical and not hierarchical relations • one relation  two directions • ~2,100,000 words • ~2,200,000 concepts • ~6,000,000 relations

Similarity Measures • Dice similarity • the already used measure in TopX • NAGA similarity • applied measure for YAGO • Best WordNet similarity • measure with best result among WordNet measures

Dice Similarity Measure • sdfsdf • measures the intersection of two regions

NAGA Similarity Measure • sdfasfsdf • combination of the confidence of a relation and the informativeness of a relation

Best WordNet Similarity Measure • sdfsdfsdf • product of the transfer function of the path length and the transfer function of the concept depth

Evaluation

Evaluation • DICE measure  applicable • also on the YAGO ontology • NAGA measure  applicable • with omitting of the forward direction • Best WordNet measure  not applicable • due to the density of YAGO

Changes for TopX • tuning of some procedures • Dijkstra algorithm • word sense disambiguation • query expansion • extension of configuration file

Conclusion • larger knowledge base • more flexibility • increased complexity • further measure for the similarity computation  NAGA similarity

Questions?

Similarity Measures for Query Expansion in TopX