csm06 information retrieval n.
Skip this Video
Loading SlideShow in 5 Seconds..
CSM06 Information Retrieval PowerPoint Presentation
Download Presentation
CSM06 Information Retrieval

Loading in 2 Seconds...

play fullscreen
1 / 25

CSM06 Information Retrieval - PowerPoint PPT Presentation

  • Uploaded on

CSM06 Information Retrieval. Lecture 3: Text IR part 2 Dr Andrew Salway a.salway@surrey.ac.uk. Recap from Lecture 2. IR Systems treat documents as ‘bags of words’: common document preprocessing techniques - tokenization , stop lists and stemming

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

CSM06 Information Retrieval

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
csm06 information retrieval

CSM06 Information Retrieval

Lecture 3: Text IR part 2

Dr Andrew Salway a.salway@surrey.ac.uk

recap from lecture 2
Recap from Lecture 2
  • IR Systems treat documents as ‘bags of words’: common document preprocessing techniques - tokenization, stop lists and stemming
  • Data about the occurrence of words in documents is stored in postings data structures:
    • the simple inverted index stores the minimum data required for full-text indexing/retrieval;
    • the extra data stored by the STAIRS data model facilitates more IR functionality
recap from lecture 21
Recap from Lecture 2

How documents are matched / ranked for a query is determined by the IR model used:

    • Boolean Model – exact matching of documents according to query terms / Boolean operators (underlied by Set Theory).
    • Vector Space Model – documents and queries represented by vectors in the same vector space; dimensions are frequencies of keywords. Similarity of documents to a query is measured by cosine distance  ranking of documents.
  • VSM Lab Exercise: create a frequency table and see which documents are ranked highest for queries of your choosing.

(System Quirk will help in making the frequency table; Microsoft Excel will help in calculating cosine distances and ranking).

lecture 3 overview
Lecture 3: OVERVIEW
  • TFIDF - Term-Frequency Inverse Document Frequency, and example of term weighting
  • Semi-automatic query modification withRelevance Feedback
  • Automatic creation ofterm clustersfor query expansion
  • Latent Semantic Indexing
term weighting
Term Weighting
  • In the simplest case an index is binary, i.e. either a keyword is present in a document or it is not
  • However, we may want to deal with ‘degrees of aboutness’ to characterise a document more accurately 
    • Use a weighting to capture the strength of the relationship between a keyword and a document
    • As a starting point we can consider the frequency with which a term occurs in a document
tf idf term frequency inverse document frequency
TF-IDF (Term Frequency – Inverse Document Frequency)
  • To incorporate a word’s discrimination into the weighting, consider its inverse document frequency to take into account the number of different documents in which the term occurs
  • This leads to the widely used TF-IDF weighting for index terms; and also terms in long queries

Belew 2000, Section 3.6

modifying query with relevance feedback
Modifying Query with Relevance Feedback
  • User makes initial query and system returns ranked documents
  • User identifies the the top-ranked documents as relevant or irrelevant
  • The document-vectors of the top-ranked documents are used to modify the initial query vector, e.g. using the Standard-Rocchio equation
  • The effect is to emphasise appropriate index terms in the query (and de-emphasise others) – with no ‘technical’ input from user
  • May also introduce new query terms

(Baeza-Yates and Ribiero-Neto 1999, pp. 118-120)

relfbk the vector view
RelFbk: the vector view

Belew (2000), Fig. 4.4

making a new query vector
Making a new query vector…
  • The query vector is moved towards the centroid of the documents judged relevant by the user.
  • It may also move away from the centroid of the irrelevant docs, but they are less likely to be clustered: (alternatively select one irrelevant document, like the highest ranked, and move query vector away from it).

Belew (2000), Fig. 4.6

standard rocchio equation
Standard-Rocchio equation

qm= αq + β/Dr * Σ(Rel doc vectors) - γ/Di * Σ(Irrel doc vectors)

q = query vector

qm= modified query

α, β, γ are constants

Dr = number of documents marked relevant by user

Di = number of documents marked irrelevant by user


Consider a query vector vq, two documents returned by an information retrieval system that a user considers relevant with vectors v1 and v2, and three documents returned considered irrelevant with vectors v3, v4, and v5. Compute a modified query using the Standard Rochio equation with α = β = γ = 1.

vq= (2, 1, 0, 0)

v1= (0, 4, 0, 2) v2= (0, 3, 0, 1)

v3= (1, 0, 2, 0) v4= (0, 1, 4, 0)

v5= (1, 1, 0, 0)

creating term clusters for query expansion
Creating Term Clusters for Query Expansion
  • Generally a query may be expanded by adding index terms that are related to the terms in the initial query
  • Related terms may be:
    • synonyms
    • stems/grammatical variants
    • co-occurring terms
  • Relationships between terms may be calculated by analysing the results set (local analysis), or by analysing the whole document collection (global analysis)
creating term clusters for query expansion1
Creating Term Clusters for Query Expansion
  • Aim is to produce clusters of related terms by automatic analysis of the local document set
  • Restricting the analysis to the local document set may improve the quality of the clusters
  • Different techniques for measuring term correlation give different kinds of clusters:
    • Association clusters
    • Metric clusters
    • Scalar clusters

For more details, see Baeza-Yates and Ribiero-Neto 1999, pp. 123-7

association clusters
Association Clusters
  • Based on how often terms/stems co-occur in documents, I.e. if term A and term B appear in a large number of documents together then they may be related (at least in the local context)
  • Query is expanded by adding the n most correlated terms for each term in the original query

[BY&RN, p.125]

metric clusters
Metric Clusters
  • Based on how close terms/stems are in documents (i.e. where they occur rather than how often)

[BY&RN p. 126]

scalar clusters
Scalar Clusters
  • Scalar clusters are formed by grouping terms/stems which correlate in similar ways with other terms
    • For each term calculate a vector which is that term’s correlation (association or metric) with all other terms
    • Calculate the cosine distance between the two vector to get the scalar correlation between the terms

[BY&RN p. 127]


Which keyword (K2-K4) clusters most closely to keyword K1 using association clusters?

latent semantic indexing
Latent Semantic Indexing


  • PROBLEMS for Vector Model:
    • Size of frequency table (a matrix) becomes prohibitive (i.e. 100’s of terms and 1m’s of texts): also the matrix is sparse
    • Synonymy: different people may use different words to mean the same thing
    • VSM assumes that the frequency of each keyword is independent of the frequencies of all other keywords
latent semantic indexing1
Latent Semantic Indexing
  • LSI involves dimensionality reduction: the dimensions in the reduced space are taken to reflect the ‘latent semantics’
  • In the VSM making each term a dimension assumes they are orthogonal 
    • LSI exploits term co-occurrence: co-occurringterms are projected onto the same dimensions
latent semantic indexing2
Latent Semantic Indexing


  • Storage space is saved
  • Texts and queries can be recognised as being similar even if they don’t share the same words (so long as they do share words that have been projected onto the same dimension)

***In the latentsemantic space a query and a document can have a cosine distance close to 1 even if they do not share any terms***

latent semantic indexing3
Latent Semantic Indexing


Singular Value Decomposition

  • SVD is a technique for dimensionality reduction (cf. Eigenfactor Analysis / Principal Components Analysis)
  • In effect SVD takes the (v. large) frequency table matrix and represents it as three smaller matrices
  • The dimensions of the reduced space correspond to the axes of greatest variance: the question remains of how many dimensions?
  • NB. Can use tools like Matlab to perform SVD
set reading lecture 3
Set Reading (LECTURE 3)
  • See references in previous slides for reading about TFIDF, relevance feedback and term clusters
  • For an overview of Latent Semantic Indexing (LSI) –


further reading lecture 3
Further Reading (LECTURE 3)
  • For more about LSI, see: Deerwester et al. (1990), ‘Indexing by Latent Semantic Analysis’, Journal of the Society for Information Science 41(6), 391-407.


lecture 3 learning outcomes

After this lecture you should be able to:

  • Explain how TFIDF weights terms
  • Explain how relevance feedback can be used to automatically modify a query, and apply the Standard Rochio equation
  • Explain how term clusters can be used for automatic query expansion, and calculate association clusters
  • Explain how LSI modifies the VSM
  • Critically discuss how each of these techniques could improve an information retrieval system
reading ahead for lecture 4
Reading ahead for LECTURE 4

If you want to read about next week’s lecture topics, see:

Page and Brin (1998), “The Anatomy of a Large-Scale Hypertextual Web Search Engine”, SECTIONS 1 and 2


Hock (2001), The extreme searcher's guide to web search engines, pages 25-31. (An overview of the factors used to rank webpages). AVAILABLE in Main Library collection and in Library Article Collection.