Chapter 5: Query Operations

Chapter 5: Query Operations Hassan Bashiri April 2009

Cross-Language • What is CLIR? • Users enter their query in one language and the search engine retrieves relevant documents in other languages. English Query French Documents Retrieval System

Cross-Language Text Retrieval Query Translation Document Translation Text Translation Vector Translation Controlled Vocabulary Free Text Knowledge-based Corpus-based Ontology-based Dictionary-based Term-aligned Sentence-aligned Document-aligned Unaligned Thesaurus-based Parallel Comparable 11

Query Language • Visual languages • Example: library shown on the screen. Act: take books, open catalogs, etc. • Better Boolean queries: “I need books by Cervantes AND Lope de Vega”?!

IR Interface • Query interface • Selection interface • Examination interface • Document delivery

User Query Formulation Detection Selection Index Examination Indexing Docs Delivery Retrieval System Model

Starfield

Query Formulation • No detailed knowledge of collection and retrieval environment • difficult to formulate queries well designed for retrieval • Need many formulations of queries for good retrieval • First formulation: naïve attempt to retrieve relevant information • Documents initially retrieved: • Examined for relevance information • Improved query formulations for retrieving additional relevant documents • Query reformulation: • Expanding original query with new terms • Reweighting the terms in expanded query

Three approaches • Approaches based on feedback from users (relevance feedback) • Approaches based on information derived from set of initially retrieved documents (local set of documents) • Approaches based on global information derived from document collection

User relevance feedback • Most popular query reformulation strategy • Cycle: • User presented with list of retrieved documents • User marks those which are relevant • In practice: top 10-20 ranked documents are examined • Incremental • Select important terms from documents assessed relevant by users • Enhance importance of these terms in a new query • Expected: • New query moves towards relevant documents and away from non-relevant documents • For Instance • Q1:US Open • Q2:US Open Robocup

User relevance feedback • Two basic techniques • Query expansion Add new terms from relevant documents • Term reweighting Modify term weights based on user relevance judgements

Query Expansion and Term Reweighting for the Vector Model • basic idea • Relevant documents resemble each other • Non-relevant documents have term-weight vectors which are dissimilar from the ones for the relevant documents • The reformulated query is moved to closer to the term-weight vector space of relevant documents

Query Expansion and Term Reweighting for the Vector Model (Continued) Dr: set of relevant documents, as identified by the user Dn: set of non-relevant documents the retrieved documents collection Cr: set of relevant documents set of non-relevant documents

User relevance feedback: Vector Space Model : set of relevant documents, as identified by the user, among the retrieved documents; : set of non-relevant documents among the retrieved documents; : set of relevant documents among all documents in the collection; : number of documents in the sets , respectively; : tuning constants.

Calculate the modified query • Standard-Rochio • Ide-Regular • Ide-Dec-Hi • , , : tuning constants (usually, >) • =1 (Rochio, 1971) • ===1 (Ide, 1971) • =0: positive feedback the highest ranked non-relevant document Similar performance

Analysis • advantages • simplicity • good results • disadvantages • No optimality criterion is adopted

User relevance feedback: Probabilistic Model • The similarity of a document dj to a query q : the probability of observing the term ki in the set R of relevant documents : the probability of observing the term ki in the set R of non-relevant documents Initial search:

User relevance feedback: Probabilistic Model Feedback search:

User relevance feedback: Probabilistic Model Feedback search: No query expansion occurs

User relevance feedback: Probabilistic Model For small values of |Dr| and |Dr,i| (i.e., |Dr|=1, |Dr,i|=0) Alternative 1: Alternative 2:

Analysis • advantages • Feedback process is directly related to the derivation of new weights for query terms • The term reweighting is optimal • disadvantages • Document term weights are not considered • No query expansion is used

Query Expansion Similarity Thesaurus Global Statistical Thesaurus Query Expansion Context Analysis Association Clustering Local Clustering Metric Clustering Scalar Clustering

Automatic Local Analysis • user relevance feedback • Known relevant documents contain terms which can be used to describe a larger cluster of relevant documents with assistance from the user (clustering) • automatic analysis • Obtain a description (i.t.o terms) for a larger cluster of relevant documents automatically • global strategy: global thesaurus-like structure is trained from all documentsbefore querying • local strategy: terms from the documents retrieved for a given query are selected at query time

Query Expansion based on a Similarity Thesaurus • Query expansion is done in three steps as follows: • Represent the query in the concept space used for representation of the index terms • Based on the global similarity thesaurus, compute a similarity sim(q,kv) between each term kv correlated to the query terms and the whole query q. • Expand the query with the top r ranked terms according to sim(q,kv)

Query Expansion – Step 1 • To the query q is associated a vector q in the term-concept space given by • where wi,q is a weight associated to the index-query pair[ki,q]

Query Expansion – Step 2 • Compute a similarity sim(q,kv) between each term kv and the user query q • where cu,v is the correlation factor

Query Expansion – Step 3 • Add the top r ranked terms according to sim(q,kv) to the original query q to form the expanded query q’ • To each expansion term kv in the query q’ is assigned a weight wv,q’ given by • The expanded query q’ is then used to retrieve new documents to the user

Query Expansion - Sample • Doc1 = D, D, A, B, C, A, B, C • Doc2 = E, C, E, A, A, D • Doc3 = D, C, B, B, D, A, B, C, A • Doc4 = A • c(A,A) = 10.991 • c(A,C) = 10.781 • c(A,D) = 10.781 • ... • c(D,E) = 10.398 • c(B,E) = 10.396 • c(E,E) = 10.224

Query Expansion - Sample • Query: q = A E E • sim(q,A) = 24.298 • sim(q,C) = 23.833 • sim(q,D) = 23.833 • sim(q,B) = 23.830 • sim(q,E) = 23.435 • New query: q’ = A C D E E • w(A,q')= 6.88 • w(C,q')= 6.75 • w(D,q')= 6.75 • w(E,q')= 6.64

Query Expansion • Methods of local analysis extract information from local set of documents retrieved to expand the query • An alternative is to expand the query using information from the whole set of documents

Local Cluster • stem • V(s): a non-empty subset of words which are grammatical variants of each othere.g., {polish, polishing, polished} • A canonical form s of V(s) is called a steme.g., polish • local document set Dl • the set of documents retrieved for a given query • local vocabulary Vl (Sl) • the set of all distinct words (stems) in the local document set

Local Cluster • basic concept • Expanding the query with terms correlated to the query terms • The correlated terms are presented in the local clusters built from the local document set • local clusters • association clusters: co-occurrences of pairs of terms in documents • metric clusters: distance factor between two terms • scalar clusters: terms with similar neighborhoods have some synonymity relationship

Association Clusters • idea • Based on the co-occurrence of stems (or terms) inside documents • association matrix • fsi,j: the frequency of a stem si in a document dj (Dl) • m=(fsi,j): an association matrix with |Sl| rows and |Dl| columns • : a local stem-stem association matrix

: a correlation between the stems su and sv an element in su,v=cu,v: unnormalized matrix : normalized matrix local association cluster around the stem su Take u-th row Return the set of n largest values su,v (uv)

Metric Clusters • idea • Consider the distance between two terms in the computation of their correlation factor • local stem-stem metric correlation matrix • r(ki,kj): the number of words between keywords ki and kj in a same document • cu,v: metric correlation between stems su and sv

su,v=cu,v: unnormalized matrix : normalized matrix local metric cluster around the stem su Take u-th row Return the set of n largest values su,v (uv)

Scalar Clusters The row corresponding to a specific term in a term co-occurrence matrix forms its neighborhood • idea • Two stems with similar neighborhoods have synonymity relationship • The relationship is indirect or induced by the neighborhood • scalar association matrix local scalar cluster around the stem su Take u-th row Return the set of n largest values su,v (uv)

x x x x x x Sv(n) x x Su x x x x Sv x x x x x x x x Interactive Search Formulation • neighbors of the query term sv • Terms su belonging to clusters associated to sv, i.e., suSv(n) • su is called a searchonym of sv

SimilarityThesaurus • The similarity thesaurus is based on term to term relationships rather than on a matrix of co-occurrence. • This relationship are not derived directly from co-occurrence of terms inside documents. • They are obtained by considering that the terms are concepts in a concept space. • In this concept space, each term is indexed by the documents in which it appears. • Terms assume the original role of documents while documents are interpreted as indexing elements

SimilarityThesaurus • Inverse term frequency for document dj • t: number of terms in the collection • N: number of documents in the collection • fi,j: frequency of occurrence of the term ki in the document dj • tj: vocabulary of document dj • itfj: inverse term frequency for document dj • To ki is associated a vector

SimilarityThesaurus • where wi,j is a weight associated to index-document pair[ki,dj]. These weights are computed as follows

SimilarityThesaurus • The relationship between two terms ku and kv is computed as a correlation factor cu,v given by • The global similarity thesaurus is built through the computation of correlation factor Cu,v for each pair of indexing terms [ku,kv] in the collection

Represent the query in the concept space used for representation of the index terms • Based on the global similarity thesaurus, compute a similarity sim(q,kv) between each term kv correlated to the query terms and the whole query q query term expand term

Expand the query with the top r ranked terms according to sim(q,kv)

SimilarityThesaurus • This computation is expensive • Global similarity thesaurus has to be computed only once and can be updated incrementally

Chapter 5: Query Operations