CSM06 Information Retrieval. Lecture 3: Text IR part 2 Dr Andrew Salway firstname.lastname@example.org. Recap from Lecture 2. IR Systems treat documents as ‘bags of words’: common document preprocessing techniques - tokenization , stop lists and stemming
Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.
Lecture 3: Text IR part 2
Dr Andrew Salway email@example.com
How documents are matched / ranked for a query is determined by the IR model used:
(System Quirk will help in making the frequency table; Microsoft Excel will help in calculating cosine distances and ranking).
Belew 2000, Section 3.6
(Baeza-Yates and Ribiero-Neto 1999, pp. 118-120)
Belew (2000), Fig. 4.4
Belew (2000), Fig. 4.6
qm= αq + β/Dr * Σ(Rel doc vectors) - γ/Di * Σ(Irrel doc vectors)
q = query vector
qm= modified query
α, β, γ are constants
Dr = number of documents marked relevant by user
Di = number of documents marked irrelevant by user
Consider a query vector vq, two documents returned by an information retrieval system that a user considers relevant with vectors v1 and v2, and three documents returned considered irrelevant with vectors v3, v4, and v5. Compute a modified query using the Standard Rochio equation with α = β = γ = 1.
vq= (2, 1, 0, 0)
v1= (0, 4, 0, 2) v2= (0, 3, 0, 1)
v3= (1, 0, 2, 0) v4= (0, 1, 4, 0)
v5= (1, 1, 0, 0)
For more details, see Baeza-Yates and Ribiero-Neto 1999, pp. 123-7
[BY&RN p. 126]
[BY&RN p. 127]
Which keyword (K2-K4) clusters most closely to keyword K1 using association clusters?
***In the latentsemantic space a query and a document can have a cosine distance close to 1 even if they do not share any terms***
Singular Value Decomposition
After this lecture you should be able to:
If you want to read about next week’s lecture topics, see:
Page and Brin (1998), “The Anatomy of a Large-Scale Hypertextual Web Search Engine”, SECTIONS 1 and 2
Hock (2001), The extreme searcher's guide to web search engines, pages 25-31. (An overview of the factors used to rank webpages). AVAILABLE in Main Library collection and in Library Article Collection.