1 / 24

Ranking in IR and WWW

Ranking in IR and WWW. Modern Information Retrieval: A Brief Overview -Amit Singhal Presented by Parin Sangoi. Ranking in IR and WWW (Modern Information Retrieval:A Brief Overview-Amit Singhal). Outline Introduction History Models and Implementation Vector Space Model

Download Presentation

Ranking in IR and WWW

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Ranking in IR and WWW Modern Information Retrieval: A Brief Overview -Amit Singhal Presented by Parin Sangoi

  2. Ranking in IR and WWW (Modern Information Retrieval:A Brief Overview-Amit Singhal) • Outline • Introduction • History • Models and Implementation • Vector Space Model • Probabilistic Models • Inference Network Model • Implementation • Evaluation

  3. Ranking in IR and WWW (Modern Information Retrieval:A Brief Overview-Amit Singhal) • Outline (contd…) • Key Techniques • Term Weighting • Query Modification • Other Techniques and Applications • References

  4. Ranking in IR and WWW (Modern Information Retrieval:A Brief Overview-Amit Singhal) • Introduction • What is Information Retrieval? • An information retrieval system does not inform (i.e. change the knowledge of) the user on the subject of his inquiry. It merely informs on the existence (or non-existence) and whereabouts of documents relating to his request. -F.W. Lancaster

  5. Ranking in IR and WWW (Modern Information Retrieval:A Brief Overview-Amit Singhal) • Introduction(contd…) A typical IR system

  6. Ranking in IR and WWW (Modern Information Retrieval:A Brief Overview-Amit Singhal) • History • Vannevar Bush(1945) • As We May Think(gave birth to the idea of automatic access) • H.P.Luhn(1957) • Indexing units for documents and measuring word overlap as a criterion for retrieval. • Gerald Salton and students • SMART system(improve search quality) • Cyril Cleverdon • Cranfield evaluations which is still in use in IR systems.

  7. Ranking in IR and WWW (Modern Information Retrieval:A Brief Overview-Amit Singhal) • History(contd…) • 1970s and 1980s • Lot of developments based on the advances of the 60s. • Development of lots of models. • Effective on small text collections. • 1992 • Text Retrieval Conference (TREC) • Aims at encouraging research in IR from large text collections. • Old techniques modified and new techniques developed.

  8. Ranking in IR and WWW (Modern Information Retrieval:A Brief Overview-Amit Singhal) • Models and Implementation • Boolean systems • ANDs, ORs, and NOTs • No ranking and difficult for a user to form a good search request. • Many users still use Boolean systems as they think they are more in control of the retrieval process

  9. Ranking in IR and WWW (Modern Information Retrieval:A Brief Overview-Amit Singhal) • Models and Implementation(contd…) • Vector Space Model • Text represented by a vector of terms. • If words are chosen as terms, then every word can be represented as a vector. • A non-zero value is assigned to a text vector if the term belongs to the text. • Text vectors are very sparse as there can be millions of term in a vocabulary. • Similarity between the query vector and the document vector is used to assign a numeric score to a document for a query.

  10. Ranking in IR and WWW (Modern Information Retrieval:A Brief Overview-Amit Singhal) • Vector Space Model(contd…) • The angle between the two vectors is used as a measure of divergence between vectors, and cosine of an angle is used as the numeric similarity( cosine is 1 for identical vectors and 0 for orthogonal vectors). • Alternatively the dot product between the two vectors can be used to measure similarity. If all the vectors are of unit length, then the cosine of the angle is the same as the dot products. • If Vq is the document vector and Vd is the document vector,then the similarity of document D to Query q is: Sim(Vd,Vq)=∑Wti(Vq) . Wti(Vd) where Wti(Vq)is the ith component in the query vector Vq and Wti(Vd) is the ith component in the document vector Vd. Vd Vq

  11. Ranking in IR and WWW (Modern Information Retrieval:A Brief Overview-Amit Singhal) • Probabilistic Models • One of the main principles of an IR system is that it should be ranked. The probabilistic model ranks by decreasing probability of their relevance to a query. • Let P(R/D) be the probability of relevance of document D. As the ranking criterion is monotonic under log-odds transformation, we can rank documents by log(P(R/D) / P(R/D)) where P(R/D) is the probability that document is non-relevant. • Applying Baye’s theorem to this ratio we get log ( (P(D/R).P(R)) / (P(D/R).P(R)) )

  12. Ranking in IR and WWW (Modern Information Retrieval:A Brief Overview-Amit Singhal) • Probabilistic Model (contd…) • Assuming P(R) is independent of the document under consideration and is thus constant, P(R) and P(R) are just scaling factors and can be eliminated. • Thus the above formula can be simplified to: log ( P(D/R) / P(D/R)). • Independence Assumption • If pi denotes P(ti/R) and qi denotes P(ti/R) the above log formula reduces to:

  13. Ranking in IR and WWW (Modern Information Retrieval:A Brief Overview-Amit Singhal) • Probabilistic Model (contd…) • Sharp and Harper assume that pi is the same for every query and pi/(1-pi) is a constant and hence can be ignored. • Also all the documents in a collection are non-relevant to a query (as the collections are very large) and estimate qi by ni/N where N is the collection size and ni is the number of documents containing the term i. • Thus we get a scoring function: which is similar to the IDF function.

  14. Ranking in IR and WWW (Modern Information Retrieval:A Brief Overview-Amit Singhal) • Inference Network Model • In the simplest implementation, a document instantiates a term with a certain strength, and the credit from multiple terms is accumulated given a query to compute the equivalent of a numeric score for the document. • If the strength is considered to be the weight of the term in the document, then the ranking is similar to the Vector Space Model or the Probabilistic Model. • Any form can be used to define the strength of the instantiation, and any formula can be used.

  15. Ranking in IR and WWW (Modern Information Retrieval:A Brief Overview-Amit Singhal) • Implementation • Inverted list data structure. • Fast access to a list of documents that contain a term along with some additional information (weight, relative position, etc.). • Inverted Index • Stop Words (ignored) • the, in , of, a... • Stemming • retrieval, retrieve, retrieved, retrieving, retriever… • Poor stemming if it returns wrong documents • Is it good enough??? • Multi Word Phrases

  16. Ranking in IR and WWW (Modern Information Retrieval:A Brief Overview-Amit Singhal) • Evaluation • Measurable Quantities: • The coverage of the collection, that is, the extent to which the system includes relevant matter • The time lag, that is, the average interval between the time the search request is made and the time an answer is given; • The form of presentation of the output • The effort involved on the part of the user in obtaining answers to his search requests • The recall of the system, that is, the proportion of relevant material actually retrieved in answer to a search request • The precision of the system, that is, the proportion of retrieved material that is actually relevant. • Out of these the last two measure the effectiveness of the retrieval system.

  17. Ranking in IR and WWW (Modern Information Retrieval:A Brief Overview-Amit Singhal) • Precision and Recall • Contingency table

  18. Ranking in IR and WWW (Modern Information Retrieval:A Brief Overview-Amit Singhal) • Precision and Recall (contd…) • Recall is the proportion of relevant documents retrieved by the system. • Precision is the proportion of retrieved documents that are relevant. • Fallout is the proportion of non-relevant documents retrieved by the system. • A good IR system should have a high recall (retrieve as many relevant documents as possible) & have a high precision (retrieve very few non-relevant documents).

  19. Ranking in IR and WWW (Modern Information Retrieval:A Brief Overview-Amit Singhal) • Precision and Recall (contd…) • Unfortunately the two goals are quite contradictory. • Average Precision

  20. Ranking in IR and WWW (Modern Information Retrieval:A Brief Overview-Amit Singhal) • Key Techniques • Term Weighting • Both the Probabilistic Model and the Vector Space Model need a weight function to determine the ranked relevance. • Three main factors affect the weight formulation: • Term Frequency (tf) Words that repeat multiple times in a document are considered salient. • Document Frequency (idf) Words that appear in many documents are considered common and are not very indicative of document content. A weighting method based on this, is called inverse document frequency (or idf) weighting. • Document Length When collections have documents of varying lengths, longer documents tend to score higher since they contain more words and word repetitions. This effect is usually compensated by normalizing for document lengths in the term weighting method.

  21. Ranking in IR and WWW (Modern Information Retrieval:A Brief Overview-Amit Singhal) • Term Weighting (contd…) • After the first TREC researchers realized that raw tf is non-optimal and a dampened frequency (e.g., a logarithmic tf function) is a better weighting metric.

  22. Ranking in IR and WWW (Modern Information Retrieval:A Brief Overview-Amit Singhal) • Query Modification • Synonyms • Earlier systems relied on thesaurus • New ones build their own thesauri by analyzing word co-occurrence • Relevance Feedback • Users are the best judgers of whether a query is relevant or non-relevant • Pseudo-Feedback • Relevance feedback on the top few documents to generate a new query

  23. Ranking in IR and WWW (Modern Information Retrieval: A Brief Overview-Amit Singhal) • Other Techniques and Applications • Techniques • Cluster Hypothesis • Documents that are very similar to each other will have a similar relevance profile for a given query. • Limited success • Aided in the development of browsing and searching interfaces • Natural Language Processing • Limited success • Applications • Information Filtering • Topic Detection and Tracking (TDT) • Speech Retrieval • Cross Language Retrieval • Question Answering • …..

  24. Ranking in IR and WWW (Modern Information Retrieval: A Brief Overview-Amit Singhal) • References • Modern Information Retrieval: A Brief Overview -Amit Singhal • Information Retrieval -C.J. van Rijsbergen • Dr. Gautam Das’s Lecture Notes

More Related