1 / 21

Retrieval models { week 13}

The College of Saint Rose CSC 460 / CIS 560 – Search and Information Retrieval David Goldschmidt, Ph.D. Retrieval models { week 13}. from Search Engines: Information Retrieval in Practice , 1st edition by Croft, Metzler, and Strohman, Pearson, 2010, ISBN 0-13-607224-0.

muriel
Download Presentation

Retrieval models { week 13}

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The College of Saint Rose CSC 460 / CIS 560 – Search and Information Retrieval David Goldschmidt, Ph.D. Retrieval models{week 13} from Search Engines: Information Retrieval in Practice, 1st edition by Croft, Metzler, and Strohman, Pearson, 2010, ISBN 0-13-607224-0

  2. Retrieval models (i) • A retrieval model is a formal (mathematical) representation of the process of matching a query and a document • Forms the basis of ranking results doc 913 doc 678 ? doc 345 doc 234 user query terms doc 567 doc 789 doc 455 doc 972 doc 881 doc 123 doc 257

  3. Retrieval models (ii) • Goal: Retrieve exactly the documents that users want (whether they know it or not!) • A good retrieval model finds documentsthat are likely to be consideredrelevant by the user submittingthe query (i.e. user relevance) • A good retrieval model alsooften considers topical relevance

  4. Topical relevance • Given a query, topical relevance identifies documents judged to be on the same topic • Even though keyword-based document scores might show a lack of relevance! Civil War Abraham Lincoln query: Abraham Lincoln Tall Guys with Beards U.S. Presidents Stovepipe Hats

  5. User relevance • User relevance is difficult to quantify because of each user’s subjectivity • Humans often have difficultyexplaining why one documentis more relevant than another • Humans may disagree abouta given document’s relevancein relation to the same query R R

  6. Boolean retrieval model (i) • In the Boolean retrieval model, there are exactly two possible outcomes for query processing: • TRUE (an exact match of query specification) • FALSE (otherwise) • Ranking is nonexistent • Each matching document has a score of 1

  7. Boolean retrieval model (ii) • Often the goal is to reduce the number of search results down to a manageable size • Typically called searching by numbers • Given a small enough set of results, human users can continue their search manually • Still a useful strategy, but the “best” resultsmay be omitted

  8. Boolean retrieval model (iii) • Advantages: • Results are predictable and explainable • Efficient and easy implementation • Disadvantages: • Query results essentially unranked (instead ordered by date or title) • Effectiveness of query results depends entirely on the user’s ability to formulate query

  9. Vector space model (i) • The vector space model is a decades-old IR approach for implementing term weighting and document ranking • Documents are represented as vector Di ina t-dimensional vector space • Each element dij represents the weight ofterm j in document i t is the numberof index terms

  10. Vector space model (ii) • Given n documents, we can use a matrix to represent all term weights:

  11. Vector space model (iii) term weights are the term countsin each document

  12. Vector space model (iv) • Query Q is represented by a t-dimensional vector of weights • Each qj is the weight of term j in the query

  13. Vector space model (v) • Given the query “tropical fish,”query vector Qa is below: what do query vectors Qb and Qc represent? Qa 0 0 0 1 0 0 0 0 0 0 1 Qb 1 0 1 0 0 0 0 0 1 0 0 Qc 0 0 0 0 0 1 0 0 0 1 0

  14. Vector space model (vi) • Conceptually, the document vector closest to the query vector is the most relevant • In reality, the distancefunction is not a goodmeasure of relevance • Use a similarity measureinstead (and maximize) • First, think normalization

  15. Cosine correlation (i) • The cosine correlation measures thecosine of the angle betweenquery and document vectors • Normalize vectors such thatall documents and queriesare of equal length

  16. Cosine correlation (ii) • The cosine function is shown in blue below: http://en.wikipedia.org/wiki/File:Sine_cosine_one_period.svg

  17. Cosine correlation (iii) • Given document Di and query Q, the cosine measure is given by: normalization occurs in the denominator

  18. Term weighting (i) • Term weighting is often based on tf.idf: • The term frequency (tf) quantifies the importance of a term in a document • tfik is term frequency weight of term k in document Di • fik is the number of occurrences of term k in Di word count (of words considered) in document Di

  19. Term weighting (ii) • Term weighting is often based on tf.idf: • The inverse document frequency (idf)quantifies the importance of a termwithin the entire collection of documents • idfk is inverse document frequency weight for term k • N is the number of documents in the collection • nk is the number of documents in which term k occurs

  20. Term weighting (iii) • Obtain term weights by multiplying term frequency and inverse document frequency values together • Perform this calculation foreach term • As new/updated documentsare processed, algorithmmust recalculate idf

  21. What next? • Read and study Chapter 7 • Do Exercises 7.1, 7.2, 7.3, and 7.4

More Related