A formal study of information retrieval heuristics
This presentation is the property of its rightful owner.
Sponsored Links
1 / 19

A Formal Study of Information Retrieval Heuristics PowerPoint PPT Presentation


  • 46 Views
  • Uploaded on
  • Presentation posted in: General

A Formal Study of Information Retrieval Heuristics. Hui Fang, Tao Tao and ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign USA. Empirical Observations in IR. Retrieval heuristics are necessary for good retrieval performance.

Download Presentation

A Formal Study of Information Retrieval Heuristics

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


A formal study of information retrieval heuristics

A Formal Study of Information Retrieval Heuristics

Hui Fang, Tao Tao and ChengXiang Zhai

Department of Computer Science

University of Illinois, Urbana-Champaign

USA


Empirical observations in ir

Empirical Observations in IR

  • Retrieval heuristics are necessary for good retrieval performance.

    • E.g. TF-IDF weighting, document length normalization

  • Similar formulas may have different performances.

  • Performance is sensitive to parameter setting.


A formal study of information retrieval heuristics

Inversed Document Frequency

  • Pivoted Normalization Method

  • Dirichlet Prior Method

  • Okapi Method

1+ln(c(w,d))

Parameter sensitivity

Document Length Normalization

Alternative TF transformation

Term Frequency

Empirical Observations in IR (Cont.)


Research questions

Research Questions

  • How can we formally characterize these necessary retrieval heuristics?

  • Can we predict the empirical behavior of a method without experimentation?


A formal study of information retrieval heuristics

Outline

  • Formalized heuristic retrieval constraints

  • Analytical evaluation of the current retrieval formulas

  • Benefits of constraint analysis

    • Better understanding of parameter optimization

    • Explanation of performance difference

    • Improvement of existing retrieval formulas


Term frequency constraints tfc1

Let q be a query with only one term w.

w

q :

If

d1:

and

d2:

then

Term Frequency Constraints (TFC1)

TF weighting heuristic I: Give a higher score to a document with more occurrences of a query term.

  • TFC1


Term frequency constraints tfc2

Let q be a query and w1, w2be two query terms.

w1

w2

q:

Assume

and

d1:

If

and

d2:

then

Term Frequency Constraints (TFC2)

TF weighting heuristic II: Favor a document with more distinct query terms.

  • TFC2


Term discrimination constraint tdc

Doc 1

Doc 2

...

SVM

SVM

Tutorial

Tutorial

SVM

SVM

Tutorial

Tutorial

Term Discrimination Constraint (TDC)

IDF weighting heuristic:Penalize the words popular in the collection; Give higher weights to discriminative terms.

Query: SVM TutorialAssume IDF(SVM)>IDF(Tutorial)

SVMTutorial


Term discrimination constraint cont

w1

w2

q:

Let q be a query and w1, w2be two query terms.

d1:

Assume

d2:

and

and

for all other words w.

If

and

then

Term Discrimination Constraint (Cont.)

  • TDC


Length normalization constraints lncs

  • LNC1

q:

Let q be a query.

d1:

If for some word

d2:

but for other words

then

  • LNC2

q:

Let q be a query.

If

and

d1:

d2:

then

Length Normalization Constraints(LNCs)

Document length normalization heuristic:Penalize long documents(LNC1); Avoid over-penalizing long documents (LNC2) .


Tf length constraint tf lnc

Let q be a query with only one term w.

w

q:

If

d1:

d2:

and

then

TF-LENGTH Constraint (TF-LNC)

TF-LN heuristic:Regularize the interaction of TF and document length.

  • TF-LNC


Analytical evaluation

Analytical Evaluation


Term discrimination constraint tdc1

Query: SVM TutorialAssume IDF(SVM)>IDF(Tutorial)

Doc 1

...

SVM

SVM

SVM

Tutorial

Tutorial

Term Discrimination Constraint (TDC)

IDF weighting heuristic:Penalize the words popular in the collection; Give higher weights to discriminative terms.

Doc 2

Tutorial

SVM

SVM

Tutorial

Tutorial


Benefits of constraint analysis

Benefits of Constraint Analysis

  • Provide an approximate bound for the parameters

    • A constraint may be satisfied only if the parameter is within a particular interval.

  • Compare different formulas analytically without experimentations

    • When a formula does not satisfy the constraint, it often indicates non-optimality of the formula.

  • Suggest how to improve the current retrieval models

    • Violation of constraints may pinpoint where a formula needs to be improved.


Benefits 1 bounding parameters

Optimal s (for average precision)

Parameter sensitivity of s

Avg. Prec.

0.4

s

Benefits 1 : Bounding Parameters

LNC2  s<0.4

  • Pivoted Normalization Method


Benefits 2 analytical comparison

Negative when df(w) is large  Violate many constraints

keyword query

verbose query

Avg. Prec

Avg. Prec

Okapi

Pivoted

s or b

s or b

Benefits 2 : Analytical Comparison

  • Okapi Method


Benefits 3 improving retrieval formulas

Modified Okapi

verbose query

keyword query

Avg. Prec.

Avg. Prec.

Okapi

Pivoted

s or b

s or b

Benefits 3: Improving Retrieval Formulas

  • Modified Okapi Method

Make Okapi satisfy more constraints; expected to help verbose queries


Conclusions and future work

Conclusions and Future Work

  • Conclusions

    • Retrieval heuristics can be captured through formally defined constraints.

    • It is possible to evaluate a retrieval formula analytically through constraint analysis.

  • Future Work

    • Explore additional necessary heuristics

    • Apply these constraints to many other retrieval methods

    • Develop new retrieval formulas through constraint analysis


The end

The End

Thank you!


  • Login