Motivation. Methods of local analysis extract information from local set of documents retrieved to expand the query An alternative is to expand the query using information from the whole set of documents
Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.
There are two modern variants based on a thesaurus-like structure built using all documents in collection
The following definitions establish the proper framework
t: number of terms in the collection
N: number of documents in the collection
fi,j: frequency of occurrence of the term ki in the document dj
tj: vocabulary of document dj
itfj: inverse term frequency for document dj
Inverse term frequency for document dj
To ki we associate a vector
where wi,j is a weight associated to index-document pair[ki,dj]. These weights are computed as follows
The relationship between two terms ku and kv is computed as a correlation factor cu,v given by
The global similarity thesaurus is built through the computation of correlation factor Cu,v for each pair of indexing terms [ku,kv] in the collection
Query expansion is done in three steps as follows:
To the query q is associated a vector q in the term-concept space given by
where wi,q is a weight associated to the index-query pair[ki,q]
Compute a similarity sim(q,kv) between each term kv and the user query q
where cu,v is the correlation factor
Doc1 = D, D, A, B, C, A, B, C
Doc2 = E, C, E, A, A, D
Doc3 = D, C, B, B, D, A, B, C, A
Doc4 = A
c(A,A) = 10.991
c(A,C) = 10.781
c(A,D) = 10.781
...
c(D,E) = 10.398
c(B,E) = 10.396
c(E,E) = 10.224
Query: q = A E E
sim(q,A) = 24.298
sim(q,C) = 23.833
sim(q,D) = 23.833
sim(q,B) = 23.830
sim(q,E) = 23.435
New query: qâ€™ = A C D E E
w(A,q')= 6.88
w(C,q')= 6.75
w(D,q')= 6.75
w(E,q')= 6.64
Given the document cluster hierarchy for the whole collection, the terms that compose each class of the global thesaurus are selected as follows
Obtain from the user three parameters
TC: Threshold class
NDC: Number of documents in class
MIDF: Minimum inverse document frequency
wtc can be used to compute a thesaurus class weight wc as
Doc1 = D, D, A, B, C, A, B, C
Doc2 = E, C, E, A, A, D
Doc3 = D, C, B, B, D, A, B, C, A
Doc4 = A
q= A E E
sim(1,3) = 0.99
sim(1,2) = 0.40
sim(1,2) = 0.40
sim(2,3) = 0.29
sim(4,1) = 0.00
sim(4,2) = 0.00
sim(4,3) = 0.00
idf A = 0.0
idf B = 0.3
idf C = 0.12
idf D = 0.12
idf E = 0.60
q'=A B E E
Problems with this approach