CS 430 / INFO 430 Information Retrieval. Lecture 11 Latent Semantic Indexing Extending the Boolean Model. Course Administration. Assignment 1 If you have questions about your grading, send me email.
Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.
Lecture 11
Latent Semantic Indexing
Extending the Boolean Model
Assignment 1
If you have questions about your grading, send me email.
The following are reasonable requests: the wrong files were graded, points were added up wrongly, comments are unclear, etc.
We are not prepared to argue over details of judgment.
If you ask for a regrade, the final grade may be lower than the original!
Assignment 2
The assignment has been posted.
The test data is being checked. Look for changes before Saturday evening.
Midterm Examination
Wednesday, October 14, 7:30 to 9:00 p.m., Upson B17. Open book.
Laptop computers may be used for lecture slides, notes, readings, etc., but no network connections during the examination.
A sample examination and discussion of the solution will
be posted to the Web site.
Latent Semantic Indexing
Objective
Replace indexes that use sets of index terms by indexes that use concepts.
Approach
Map the term vector space into a lower dimensional space, using singular value decomposition.
Each dimension in the new space corresponds to a latent concept in the original data.
Synonymy: Various words and phrases refer to the same concept (lowers recall).
Polysemy: Individual words have more than one meaning (lowers precision)
Independence: No significance is given to two terms that frequently appear together
Query: "IDF in computerbased information lookup"
Index terms for a document:access, document, retrieval,
indexing
How can we recognize that informationlookup is related to retrieval and indexing?
Conversely, if information has many different contexts in the set of documents, how can we discover that it is an unhelpful term for retrieval?
c1 Human machine interface for Lab ABC computer applications
c2 A survey of user opinion of computer system response time
c3 The EPS user interface management system
c4 System and humansystem engineering testing of EPS
c5 Relation of userperceived responsetime to error measurement
m1 The generation of random, binary, unordered trees
m2 The intersection graph of paths in trees
m3 Graph minors IV: Widths of trees and wellquasiordering
m4 Graph minors: A survey
TermsDocuments
c1 c2 c3 c4 c5 m1 m2 m3 m4
human 1 0 0 1 0 0 0 0 0
interface 1 0 1 0 0 0 0 0 0
computer 1 1 0 0 0 0 0 0 0
user 0 1 1 0 1 0 0 0 0
system 0 1 1 2 0 0 0 0 0
response 0 1 0 0 1 0 0 0 0
time 0 1 0 0 1 0 0 0 0
EPS 0 0 1 1 0 0 0 0 0
survey 0 1 0 0 0 0 0 0 1
trees 0 0 0 0 0 1 1 1 0
graph 0 0 0 0 0 0 1 1 1
minors 0 0 0 0 0 0 0 1 1
Query:
Find documents relevant to "human computer interaction"
Simple Term Matching:
Matches c1, c2, and c4
Misses c3 and c5
t3
The space has as many dimensions as there are terms in the word list.
d1
d2
t2
t1
Proximity models: Put similar items together in some space or
structure
• Clustering (hierarchical, partition, overlapping). Documents are considered close to the extent that they contain the same terms. Most then arrange the documents into a hierarchy based on distances between documents. [Covered later in course.]
• Factor analysis based on matrix of similarities between documents (single mode).
• Twomode proximity methods. Start with rectangular matrix and construct explicit representations of both row and column objects.
Additional criterion:
Computationally efficient O(N2k3)
N is number of terms plus documents
k is number of dimensions
Singular Value Decomposition
Define X as the termdocument matrix, with t rows (number of index terms) and d columns (number of documents).
There exist matrices T, S and D\', such that:
X = T0S0D0\'
T0 and D0 are the matrices of left and right singular vectors
T0 and D0 have orthonormal columns
S0 is the diagonal matrix of singular values
~
~
Diagonal elements of S0 are positive and decreasing in magnitude. Keep the first k and set the others to zero.
Delete the zero rows and columns of S0 and the corresponding rows and columns of T0 and D0. This gives:
X X = TSD\'
Interpretation
If value of k is selected well, expectation is that X retains the semantic information from X, but eliminates noise from synonymy,and recognizes dependence.
^
^
t x d
t x k
k x k
k x d
S
D\'
^
=
X
T
k is the number of singular values chosen to represent the concepts in the set of documents.
Usually, k« m.
^
The dot product of two rows of X reflects the extent to which two terms have a similar pattern of occurrences.
^
^
XX\' = TSD\'(TSD\')\'
= TSD\'DS\'T\'
=TSS\'T Since D is orthonormal
= TS(TS)\'
To calculate thei, jcell, take the dot product between the i and j rows ofTS
Since S is diagonal, TS differs from T only by stretching the coordinate system
^
The dot product of two columns of X reflects the extent to which two columns have a similar pattern of occurrences.
^
^
X\'X = (TSD\')\'TSD\'
= DS(DS)\'
To calculate thei, jcell, take the dot product between the i and j columns ofDS.
Since S is diagonal DS differs from D only by stretching the coordinate system
Comparison between a term and a document is the value of an individual cell of X.
X = TSD\'
= TS(DS)\'
where S is a diagonal matrix whose values are the square root of the corresponding elements of S.
^
^



Terms Query
xq
human 1
interface 0
computer 0
user 0
system 1
response 0
time 0
EPS 0
survey 0
trees 1
graph 0
minors 0
Query:
"humansystem interactions on trees"
In termdocument space, a query is represented by xq, a t x 1 vector.
In concept space, a query is represented by dq, a 1 x k vector.
A query can be expressed as a vector in the termdocument vector space xq.
xqi= 1 if term i is in the query and 0 otherwise.
Let pqj be the inner product of the queryxqwith document dj in the termdocument vector space.
pqj is the jth element in the product of xq\'X.
^
^
X
[pq1... pqj ... pqt] = [xq1 xq2 ... xqt]
document dj is column j of X
^
inner product of query q with document dj
query
^
pq\' = xq\'X
= xq\'TSD\'
= xq\'T(DS)\'
similarity(q, dj) =
cosine of angle is inner product divided by lengths of vectors
pqj
xq dj
Revised October 6, 2004
In the reading, the authors treat the query as a pseudodocument in the concept space dq:
dq = xq\'TS1
To compare a query against document j, they extend the method used to compare document i with document j.
Take the jth element of the product of:
dqS and(DS)\'
This is the jth element of product of:
xq\'T (DS)\' which is the same expression as before.
Note that dq is a row vector.
Revised October 6, 2004
Deerwester, et al. tried latent semantic indexing on two test collections, MED and CISI, where queries and relevant judgments available.
Documents were full text of title and abstract.
Stop list of 439 words (SMART); no stemming, etc.
Comparison with:
(a) simple term matching, (b) SMART, (c) Voorhees method.
Extending the Boolean Model
Counterintuitive results:
Query q = A and B and C and D and E
Document d has terms A, B, C and D, but not E
Intuitively, d is quite a good match for q, but it is rejected by the Boolean model.
Query q = A or B or C or D or E
Document d1 has terms A, B, C,D and E
Document d2 has term A, but not B, C,D or E
Intuitively, d1 is a much better match than d2, but the Boolean model ranks them as equal.
Boolean is all or nothing
• Boolean model has no way to rank documents.
• Boolean model allows for no uncertainty in assigning index terms to documents.
• The Boolean model has no provision for adjusting the importance of query terms.
Term weighting
• Give weights to terms in documents and/or queries.
• Combine standard Boolean retrieval with vector ranking of results
Fuzzy sets
• Relax the boundaries of the sets used in Boolean retrieval
SIRE (Syracuse Information Retrieval Experiment)
Term weights
• Add term weights to documents
Weights calculated by the standard method of
term frequency * inverse document frequency.
Ranking
• Calculate results set by standard Boolean methods
• Rank results by vector distances
SIRE (Syracuse Information Retrieval Experiment)
Relevance feedback is particularly important with Boolean
retrieval because it allow the results set to be expanded
• Results set is created by standard Boolean retrieval
• User selects one document from results set
• Other documents in collection are ranked by vector
distance from this document
• A document has a term weight associated with each index term. The term weight measures the degree to which that term characterizes the document.
• Term weights are in the range [0, 1]. (In the standard Boolean model all weights are either 0 or 1.)
• For a given query, calculate the similarity between the query and each document in the collection.
• This calculation is needed for every document that has a nonzero weight for any of the terms in the query.
Fuzzy set theory
dAis the degree of membership of an element to set A
intersection (and)
dAB = min(dA, dB)
union (or)
dAB = max(dA, dB)
Fuzzy set theory example
standard fuzzy
set theory set theory
dA1 1 0 0 0.5 0.5 0 0
dB 1 0 1 0 0.7 0 0.7 0
and dAB1 0 0 0 0.5 0 0 0
or dAB 1 1 1 0 0.7 0.5 0.7 0
Terms: A1, A2, . . . , An
DocumentD, with indexterm weights: dA1, dA2, . . . , dAn
Qor = (A1or A2or . . . or An)
Querydocument similarity:
S(Qor, D) = Cor1 * max(dA1, dA2,.. , dAn) + Cor2 * min(dA1, dA2,.. , dAn)
where Cor1 + Cor2 = 1
Terms: A1, A2, . . . , An
DocumentD, with indexterm weights: dA1, dA2, . . . , dAn
Qand = (A1and A2and . . . and An)
Querydocument similarity:
S(Qand, D) = Cand1 * min(dA1,.. , dAn) + Cand2 * max(dA1,.. , dAn)
where Cand1 + Cand2 = 1
Experimental values:
Cand1 in range [0.5, 0.8]
Cor1 > 0.2
Computational cost is low. Retrieval performance much improved.
Paice model
The MMM model considers only the maximum and minimum document weights. The Paice model takes into account all of the document weights. Computational cost is higher than MMM.
Pnorm model
DocumentD, with term weights: dA1, dA2, . . . , dAn
Query terms are given weights, a1, a2, . . . ,an
Operators have coefficients that indicate degree of strictness
Querydocument similarity is calculated by considering each document and query as a point in n space.
CISI CACM INSPEC
Pnorm 79 106 210
Paice 77 104 206
MMM 68 109 195
Percentage improvement over standard Boolean model (average best precision)
Lee and Fox, 1988
E. Fox, S. Betrabet, M. Koushik, W. Lee, Extended Boolean Models, Frake, Chapter 15
Methods based on fuzzy set concepts