Evaluation of IR Systems

Evaluation of IR Systems Adapted from Lectures by Prabhakar Raghavan (Yahoo and Stanford) and Christopher Manning (Stanford) L10Evaluation

This lecture • Results summaries: • Making our good results usable to a user • How do we know if our results are any good? • Evaluating a search engine • Benchmarks • Precision and recall L10Evaluation

Result Summaries • Having ranked the documents matching a query, we wish to present a results list. • Most commonly, a list of the document titles plus a short summary, aka “10 blue links”.

Summaries • The title is typically automatically extracted from document metadata. What about the summaries? • This description is crucial. • User can identify good/relevant hits based on description. • Two basic kinds: • A static summary of a document is always the same, regardless of the query that hit the doc. • A dynamic summary is a query-dependent attempt to explain why the document was retrieved for the query at hand. L10Evaluation

Static summaries • In typical systems, the static summary is a subset of the document. • Simplest heuristic: the first 50 (or so – this can be varied) words of the document • Summary cached at indexing time • More sophisticated: extract from each document a set of “key” sentences • Simple NLP heuristics to score each sentence • Summary is made up of top-scoring sentences. • Most sophisticated: NLP used to synthesize a summary • Seldom used in IR (cf. text summarization work)

Dynamic summaries • Present one or more “windows” within the document that contain several of the query terms • “KWIC” snippets: Keyword in Context presentation • Generated in conjunction with scoring • If query found as a phrase, all or some occurrences of the phrase in the doc • If not, document windows that contain multiple query terms • The summary itself gives the entire content of the window – all terms, not only the query terms.

Generating dynamic summaries • If we have only a positional index, we cannot (easily) reconstruct context window surrounding hits. • If we cache the documents at index time, then we can find windows in it, cueing from hits found in the positional index. • E.g., positional index says “the query is a phrase in position 4378” so we go to this position in the cached document and stream out the content • Most often, cache only a fixed-size prefix of the doc. • Note: Cached copy can be outdated L10Evaluation

Dynamic summaries • Producing good dynamic summaries is a tricky optimization problem • The real estate for the summary is normally small and fixed • Want snippets to be long enough to be useful • Want linguistically well-formed snippets • Want snippets maximally informative about doc • But users really like snippets, even if they complicate IR system design L10Evaluation

Alternative results presentations? • An active area of HCI research • An alternative: http://www.searchme.com / copies the idea of Apple’s Cover Flow for search results L10Evaluation

Evaluating search engines L10Evaluation

Measures for a search engine • How fast does it index • Number of documents/hour • (Average document size) • How fast does it search • Latency as a function of index size • Expressiveness of query language • Ability to express complex information needs • Speed on complex queries • Uncluttered UI • Is it free? L10Evaluation

Measures for a search engine • All of the preceding criteria are measurable: we can quantify speed/size; we can make expressiveness precise • The key measure: user happiness • What is this? • Speed of response/size of index are factors • But blindingly fast, useless answers won’t make a user happy • Need a way of quantifying user happiness L10Evaluation

Data Retrieval vs Information Retrieval • DR Performance Evaluation (after establishing correctness) • Response time • Index space • IR Performance Evaluation • How relevant is the answer set? (required to establish functional correctness, e.g., through benchmarks) L10Evaluation

Measuring user happiness • Issue: who is the user we are trying to make happy? • Depends on the setting/context • Web engine: user finds what they want and return to the engine • Can measure rate of return users • eCommerce site: user finds what they want and make a purchase • Is it the end-user, or the eCommerce site, whose happiness we measure? • Measure time to purchase, or fraction of searchers who become buyers? L10Evaluation

Measuring user happiness • Enterprise (company/govt/academic): Care about “user productivity” • How much time do my users save when looking for information? • Many other criteria having to do with breadth of access, secure access, etc. L10Evaluation

Happiness: elusive to measure • Most common proxy: relevance of search results • But how do you measure relevance? • We will detail a methodology here, then examine its issues • Relevant measurement requires 3 elements: • A benchmark document collection • A benchmark suite of queries • A usually binary assessment of either Relevant or Nonrelevant for each query and each document • Some work on more-than-binary, but not the standard L10Evaluation

Evaluating an IR system • Note: the information need is translated into a query • Relevance is assessed relative to the information need,not thequery • E.g., Information need: I'm looking for information on whether drinking red wine is more effective at reducing heart attack risks than white wine. • Query: wine red white heart attack effective • You evaluate whether the doc addresses the information need, not whether it has these words L10Evaluation

Difficulties with gauging Relevancy • Relevancy, from a human standpoint, is: • Subjective: Depends upon a specific user’s judgment. • Situational: Relates to user’s current needs. • Cognitive: Depends on human perception and behavior. • Dynamic: Changes over time. L10Evaluation

Standard relevance benchmarks • TREC - National Institute of Standards and Technology (NIST) has run a large IR test bed for many years • Reuters and other benchmark doc collections used • “Retrieval tasks” specified • sometimes as queries • Human experts mark, for each query and for each doc, Relevant or Nonrelevant • or at least for subset of docs that some system returned for that query

Unranked retrieval evaluation:Precision and Recall • Precision: fraction of retrieved docs that are relevant = P(relevant|retrieved) • Recall: fraction of relevant docs that are retrieved = P(retrieved|relevant) • Precision P = tp/(tp + fp) • Recall R = tp/(tp + fn) L10Evaluation

Precision and Recall L10Evaluation

Precision and Recall in Practice • Precision • The ability to retrievetop-ranked documents that are mostly relevant. • The fraction of the retrieved documents that are relevant. • Recall • The ability of the search to find all of the relevant items in the corpus. • The fraction of the relevant documents that are retrieved. L10Evaluation

Should we instead use the accuracy measure for evaluation? • Given a query, an engine classifies each doc as “Relevant” or “Nonrelevant” • The accuracy of an engine: the fraction of these classifications that are correct • Accuracy is a commonly used evaluation measure in machine learning classification work • Why is this not a very useful evaluation measure in IR? L10Evaluation

Why not just use accuracy? • How to build a 99.9999% accurate search engine on a low budget…. • People doing information retrieval want to findsomething and have a certain tolerance for junk. Snoogle.com Search for: 0 matching results found. L10Evaluation

Precision/Recall • You can get high recall (but low precision) by retrieving all docs for all queries! • Recall is a non-decreasing function of the number of docs retrieved • In a good system, precision decreases as either the number of docs retrieved or recall increases • This is not a theorem, but a result with strong empirical confirmation L10Evaluation

Returns relevant documents but misses many useful ones too The ideal Returns most relevant documents but includes lot of junk Trade-offs 1 Precision 0 1 Recall L10Evaluation

Difficulties in using precision/recall • Should average over large document collection/query ensembles • Need human relevance assessments • People aren’t reliable assessors • Assessments have to be binary • Nuanced assessments? • Heavily skewed by collection/authorship • Results may not translate from one domain to another L10Evaluation

A combined measure: F • Combined measure that assesses precision/recall tradeoff is F measure (harmonic mean): • Harmonic mean is a conservative average • See CJ van Rijsbergen, Information Retrieval L10Evaluation

Aka E Measure (parameterized F Measure) • Variants of F measure that allow weighting emphasis on precision over recall: • Value of  controls trade-off: •  = 1: Equally weight precision and recall (E=F). •  > 1: Weight precision more. •  < 1: Weight recall more. L10Evaluation

F1 and other averages L10Evaluation

Breakeven Point • Breakeven point is the point where precision equals recall. • Alternative single measure of IR effectiveness. • How do you compute it?

Evaluating ranked results • Evaluation of ranked results: • The system can return any number of results • By taking various numbers of the top returned documents (levels of recall), the evaluator can produce a precision-recall curve L10Evaluation

Computing Recall/Precision Points: An Example Let total # of relevant docs = 6 Check each new recall point: R=1/6=0.167; P=1/1=1 R=2/6=0.333; P=2/2=1 R=3/6=0.5; P=3/4=0.75 R=4/6=0.667; P=4/6=0.667 Missing one relevant document. Never reach 100% recall R=5/6=0.833; p=5/13=0.38 L10Evaluation

A precision-recall curve

Interpolating a Recall/Precision Curve • Interpolate a precision value for each standard recall level: • rj {0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0} • r0 = 0.0, r1 = 0.1, …, r10=1.0 • The interpolated precision at the j-th standard recall level is the maximum known precision at any recall level above the j-th level:

Average Recall/Precision Curve • Typically average performance over a large set of queries. • Compute average precision at each standard recall level across all queries. • Plot average precision/recall curves to evaluate overall system performance on a document/query corpus.

Evaluation Metrics (cont’d) • Graphs are good, but people want summary measures! • Precision at fixed retrieval level • Precision-at-k: Precision of top k results • Perhaps appropriate for most of web search: all people want are good matches on the first one or two results pages • But: averages badly and has an arbitrary parameter of k • 11-point interpolated average precision • The standard measure in the early TREC competitions: you take the precision at 11 levels of recall varying from 0 to 1 by tenths of the documents, using interpolation (the value for 0 is always interpolated!), and average them • Evaluates performance at all recall levels L10Evaluation

Typical (good) 11 point precisions • SabIR/Cornell 8A1 11pt precision from TREC 8 (1999)

11 point precisions L10Evaluation

Receiver Operating Characteristics (ROC) Curve • True positive rate = • tp/(tp+fn) = recall = sensitivity • False positive rate = fp/(tn+fp). Related to precision. • fpr=0 <-> p=1 • Why is the blue line “worthless”?

Mean average precision (MAP) • MAP for a query • Average of the precision value for each (of the k top) relevant document retrieved • This approach weights early appearance of a relevant document over later appearance • MAP for query collection is the arithmetic average of MAP for each query • Macro-averaging: each query counts equally L10Evaluation

Average Precision

Mean Average Precision (MAP) • Mean Average Precision (MAP) • summarize rankings from multiple queries by averaging average precision • most commonly used measure in research papers • assumes user is interested in finding many relevant documents for each query • requires many relevance judgments in text collection

MAP

Summarize a Ranking: MAP • Given that n docs are retrieved • Compute the precision (at rank) where each (new) relevant document is retrieved => p(1),…,p(k), if we have k rel. docs • E.g., if the first rel. doc is at the 2nd rank, then p(1)=1/2. • If a relevant document never gets retrieved, we assume the precision corresponding to that rel. doc to be zero • Compute the average over all the relevant documents • Average precision = (p(1)+…p(k))/k

(cont’d) • This gives us (non-interpolated) average precision, which captures both precision and recall and is sensitive to the rank of each relevant document • Mean Average Precisions (MAP) • MAP = arithmetic mean average precision over a set of topics • gMAP = geometric mean average precision over a set of topics (more affected by difficult topics)

Discounted Cumulative Gain • Popular measure for evaluating web search and related tasks • Two assumptions: • Highly relevant documents are more useful than marginally relevant document • the lower the ranked position of a relevant document, the less useful it is for the user, since it is less likely to be examined

Discounted Cumulative Gain • Uses graded relevance as a measure of usefulness, or gain, from examining a document • Gain is accumulated starting at the top of the ranking and may be reduced, or discounted, at lower ranks • Typical discount is 1/log (rank) • With base 2, the discount at rank 4 is 1/2, and at rank 8 it is 1/3

Summarize a Ranking: DCG • What if relevance judgments are in a scale of [1,r]? r>2 • Cumulative Gain (CG) at rank n • Let the ratings of the n documents be r1, r2, …rn (in ranked order) • CG = r1+r2+…rn • Discounted Cumulative Gain (DCG) at rank n • DCG = r1 + r2/log22 + r3/log23 + … rn/log2n • We may use any base for the logarithm, e.g., base=b

Discounted Cumulative Gain • DCG is the total gain accumulated at a particular rank p: • Alternative formulation: • used by some web search companies • emphasis on retrieving highly relevant documents

Evaluation of IR Systems