Special Topics in Computer Science
This presentation is the property of its rightful owner.
Sponsored Links
1 / 32

Alexander Gelbukh Gelbukh PowerPoint PPT Presentation


  • 42 Views
  • Uploaded on
  • Presentation posted in: General

Special Topics in Computer Science Advanced Topics in Information Retrieval Chapter 3: Goals: Retrieval Evaluation. Alexander Gelbukh www.Gelbukh.com. Previous chapter. Models are needed for formal operations Boolean model is the simplest

Download Presentation

Alexander Gelbukh Gelbukh

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Alexander gelbukh gelbukh

Special Topics in Computer ScienceAdvanced Topics in Information RetrievalChapter 3: Goals: Retrieval Evaluation

Alexander Gelbukh

www.Gelbukh.com


Previous chapter

Previous chapter

  • Models are needed for formal operations

  • Boolean model is the simplest

  • Vector model is the best combination of quality and simplicity

    • TF-IDF term weighting

    • This (or similar) weighting is used in all further models

  • Many interesting and not well-investigated variations

    • possible future work


Previous chapter research issues

Previous chapter: Research issues

  • How people judge relevance?

    • ranking strategies

  • How to combine different sources of evidence?

  • What interfaces can help users to understand and formulate their information need?

    • user interfaces: an open issue

  • Meta-search engines: how to combine results from different Web search engines?

    • These results almost do not intersect

    • How to combine rankings?


To write a paper evaluation

To write a paper: Evaluation!

  • How do you measure whether a system is good or bad?

  • To go to the right direction, need to know where you want to get to.

  • “We can do it this way” vs. “This way it performs better”

    • “I think it is better...”

    • “We do it this way...”

    • “Our method takes into account syntax and semantics...”

    • “I like the results...”

  • Criterion of truth. Crucial for any science.

  • Enables competition  financial policy attracts people

    • TREC international competitions


Methodology to write a paper

Methodology to write a paper

  • Define formally your task and constraints

  • Define formally your evaluation criterion (argue if needed)

    • One numerical value is better than several

  • Show that your method gives better value than

    • the baseline (the simple obvious way), such as:

      • Retrieve all. Retrieve none. Retrieve at random. Use Google.

    • state-of-the-art (the best reported method)

      • in the same setting and same evaluation method!

  • and your parameter settings are optimal

    • Consider extreme settings: 0, 


Methodology

... Methodology

The only valid way of reasoning

  • “But we want the clusters to be non-trivial”

    • Add this as a penalty to your criteria or as constraints

  • Divide your “acceptability considerations” into:

    • Constraints: yes/no.

    • Evaluation: better/worse.

  • Check that your evaluation criteria are well justified

    • “My formula gives it this way”

    • “My result is correct since this is what my algorithm gives”

    • Reason in terms of the user task, not your algorithm / formulas

  • Are your good/bad judgments in accord with intuition?


Evaluation possible how

Evaluation? (Possible? How?)

  • IR: “user satisfaction”

    • Difficult to model formally

    • Expensive to measure directly (experiments with subjects)

  • At least two contradicting parameters

    • Completeness vs. quality

    • No good way to combine into one single numerical value

    • Some “user-defined” “weights of importance” of the two

      • Not formal, depend on situation

  • Art


Parameters to evaluate

Parameters to evaluate

  • Performance (in general sense)

    • Speed

    • Space

      • Tradoff

    • Common for all systems. Not discussed here.

  • Retrieval performance (quality?)

    • = goodness of a retrieval strategy

    • A testreference collection: docs and queries.

    • The “correct” set (or ordering) provided by “experts”

    • A similarity measure to compare system output with the “correct” one.


Evaluation model user satisfaction

Evaluation: Model User Satisfaction

  • User task

    • Batch query processing? Interaction? Mixed?

  • Way of use

    • Real-life situation: what factors matter?

    • Interface type

  • In this chapter: laboratory settings

    • Repeatability

    • Scalability


Sets boolean precision recall

Sets (Boolean): Precision & Recall

  • Tradeoff (as with time and space)

  • Assumes the retrieval results are sets

    • as in Boolean; in Vector, use threshold

  • Measures closeness between two sets

  • Recall:Of relevant docs, how many (%) were retrieved?

    Others are lost.

  • Precision:Of retrieved docs, how many (%) are relevant?

    Others are noise.

  • Nowadays with huge collections Precision is more important!


Precision recall

Precision & Recall

Recall =

Precision =


Ranked output vector

Ranked Output (Vector): ?

  • “Truth”: ordering built by experts

  • System output: guessed ordering

  • Ways to compare two rankings: ?

  • Build the “truth” set is not possible or too expensive

  • So not used (rarely used?) in practice

  • One can built the “truth” set automatically

    • Research topic for us?


Ranked output vector vs set

Ranked Output (Vector) vs. Set

  • “Truth”: unordered “relevant” set

  • Output: ordered guessing

  • Compare ordered set with an unordered one


Ranked output vs set one query

... Ranked Output vs. set (one query)

  • Plot precision vs. recall curve

  • In the initial part of the list containing n% of all relevant docs, what the precision is?

    • 11 standard recall levels: 0%, 10%, ..., 90%, 100%.

    • 0%: interpolated


Many queries

... Many queries

  • Average precision and recall

    Ranked output:

  • Average precision at each recall level

  • To get equal (standard) recall levels, interpolation

    • of 3 relevant docs, there is no 10% level!

    • Interpolated value at level n =maximum known value between n and n + 1

    • If none known, use the nearest known.


Precision vs recall figures

Precision vs. Recall Figures

  • Alternative method: document cutoff values

    • Precision at first 5, 10, 15, 20, 30, 50, 100 docs

  • Used to compare algorithms.

    • Simple

    • Intuitive

  • NOT a one-value comparison!


Alexander gelbukh gelbukh

Which one is better?


Single value summaries

Single-value summaries

  • Curves cannot be used for averaging by multiple queries

  • We need single-value performance for each query

    • Can be averaged over several queries

    • Histogram for several queries can be made

    • Tables can be made

  • Precision at first relevant doc?

  • Average precision at (each) seen relevant docs

    • Favors systems that give several relevant docs first

  • R-precision

    • precision at R-th retrieved (R = total relevant)


Alexander gelbukh gelbukh

Precision histogram

Two algs: A, B

R(A)-R(B).

Which is better?


Alternative measures for boolean

Alternative measures for Boolean

  • Problems with Precision & Recall measure:

    • Recall cannot be estimated with large collections

    • Two values, but we need one value to compare

    • Designed for batch mode, not interactive. Informativeness!

    • Designed for linear ordering of docs (not weak ordering)

  • Alternative measures: combine both in one

    F-measure: E-measure:user preference Rec vs. Prec


User oriented measures

User-oriented measures

Definitions:


User oriented measures1

User-oriented measures

  • Coverage ratio

    • Many expected docs

  • Novelty ratio

    • Many new docs

  • Relative recall: # found / # expected

  • Recall effort: # expected / # examined until those are found

  • Other:

    • expected search length (good for weak order)

    • satisfaction (considers only relevant docs)

    • frustration (considers only non-relevant docs)


Reference collections

Reference collections

Texts with queries and relevant docs known

TREC

  • Text REtrieval Conference. Different in different years

  • Wide variety of topics. Document structure marked up.

  • 6 GB. See NIST website: available at small cost

  • Not all relevant docs marked!

    • Pooling method:

    • top 100 docs in ranking of many search engines

    • manually verified

    • Was tested that is a good approximation to the “real” set


Trec tasks

Ad-hoc (conventional: query  answer)

Routing (ranked filtering of changing collection)

Chinese ad-hoc

Filtering (changing collection; no ranking)

Interactive (no ranking)

NLP: does it help?

Cross-language (ad-hoc)

High precision (only 10 docs in answer)

Spoken document retrieval (written transcripts)

Very large corpus (ad-hoc, 20 GB = 7.5 M docs)

Query task (several query versions; does strategy depends on it?)

Query transforming

Automatic

Manual

...TREC tasks


Trec evaluation

...TREC evaluation

  • Summary table statistics

    • # of requests used in the task

    • # of retrieved docs; # of relevant retrieved and not retrieved

  • Recall-precision averages

    • 11 standard points. Interpolated (and not)

  • Document level averages

    • Also, can include average R-value

  • Average precision histogram

    • By topic.

    • E.g., difference between R-precision of this system and average of all systems


Smaller collections

Smaller collections

  • Simpler to use

  • Can include info that TREC does not

  • Can be of specialized type (e.g., include co-citations)

  • Less sparse, greater overlap between queries

  • Examples:

    • CACM

    • ISI

    • there are others


Cacm collection

CACM collection

  • Communications of ACM, 1958-1979

  • 3204 articles

  • Computer science

  • Structure info (author, date, citations, ...)

  • Stems (only title and abstract)

  • Good for algorithms relying on cross-citations

    • If a paper cites another one, they are related

    • If two papers cite the same ones, they are related

  • 52 queries with Boolean form and answer sets


Isi collection

ISI collection

  • On information sciences

  • 1460 docs

  • For similarity in terms and cross-citation

  • Includes:

    • Stems (title and abstracts)

    • Number of cross-citations

  • 35 natural-language queries with Boolean form and answer sets


Cystic fibrosis cf collection

Cystic Fibrosis (CF) collection

  • Medical

  • 1239 docs

  • MEDLINE data

    • keywords assigned manually!

  • 100 requests

  • 4 judgments for each doc

    • Good to see agreement

  • Degrees of relevance, from 0 to 2

  • Good answer set overlap

    • can be used for learning from previous queries


Research issues

Research issues

  • Different types of interfaces; interactive systems:

    • What measures to use?

    • Such as infromativeness


Conclusions

Conclusions

  • Main measures: Precision & Recall.

    • For sets

    • Rankings are evaluated through initial subsets

  • There are measures that combine them into one

    • Involve user-defined preferences

  • Many (other) characteristics

    • An algorithm can be good at some and bad at others

    • Averages are used, but not always are meaningful

  • Reference collection exists with known answers to evaluate new algorithms


Alexander gelbukh gelbukh

Thank you!

Till ... ??


  • Login