Term weighting approaches in automatic text retrieval
1 / 16

Term Weighting approaches in automatic text retrieval. - PowerPoint PPT Presentation

  • Uploaded on

Term Weighting approaches in automatic text retrieval. Presented by Ehsan. References. Modern Information Retrieval: Text book Slides on Vectorial Model by Dr. Rada The paper itself. The main idea.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'Term Weighting approaches in automatic text retrieval.' - jun

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript


  • Modern Information Retrieval: Text book

  • Slides on Vectorial Model by Dr. Rada

  • The paper itself

The main idea
The main idea

  • Text indexing system based on weighted single terms is better than the one based on more complex text representation

    • Crucial importance: effective term weighting.

Basic ir
Basic IR

  • Attach content identifier to both stored texts and user queries.

  • A content identifier/term is a word or a group of words extracted from the document/queries

    • Underlying assumption

      • Semantics of the documents and queries can be expressed by this terms

Two things to consider
Two things to consider

  • What is an appropriate content identifier?

  • Are all the identifier of same importance?

    • If not, how can we discriminate a term from the others?

Choosing content identifier
Choosing content identifier

  • Use single term/word as individual identifier

  • Use more complex text representation as identifier

  • An example

    • “Industry is the mother of good luck”

    • Mother said, “Good luck”.

Complex text representation
Complex text representation

  • Set of related terms based on statistical co-occurrence

  • Term phrases consisting of one of more governing terms (head of the phrase) together with corresponding depending terms

  • Grouping words under a common heading like thesaurus

  • Constructing knowledge base to represent the content of the subject area

What is better single or complex terms
What is better: single or complex terms?

  • Construction of complex text representation is inherently difficult.

    • Need sophisticated syntactic/statistical analysis program

  • An example

    • Using term phrase 20% increase in some cases

    • Other cases it is quite discouraging

  • Knowledge base

    • Effective vocabulary tools covering subject areas of reasonable scope is still sort of under-development

  • Conclusion

    • Using single terms as content identifier is preferable

The second issue
The second issue

  • How to discriminate terms?

    • Term weight of course!

  • Effectiveness of IR system

    • Document with relevant items must be retrieved

    • Documents with irrelevant/extraneous items must be rejected.

Precision and recall
Precision and Recall

  • Recall

    • Number of relevant document retrieved divided by total number of relevant documents

  • Precision

    • Out of the documents retrieved, how many of them are relevant

  • Our goal

    • High recall to retrieve as many relevant documents as possible

    • High precision to reject extraneous documents.

    • Basically, it is a trade off.

Weighting mechanism
Weighting mechanism

  • To get high recall

    • Term frequency, tf

  • When high frequency term are prevalent in the whole document collection

    • With high tf every single documents will be retrieved

  • To get high precision

    • Inverse document frequency

    • Varies inversely with the number of documents, n in which the term appears.

    • Idf is given by log2 (N/ n) , where N is total number of documents

  • To discriminate terms

    • We use tf X idf

Two more things to consider
Two more things to consider

  • Current “tf X id” mechanism favors larger documents

    • introduce a normalizing factor in the weight to equalize the length of the document.

  • Probabilistic mode

    • Term weight is the proportion of the relevant documents in which a term occurs divided by proportion of irrelevant items in which the term occurs

    • Is given by log ((N-n)/n)

Term weighting components
Term weighting components

  • Term frequency components

    • b, t, n

  • Collection frequency components

    • x, f, p

  • Normalization components

    • x, c

  • What would be weighting system given by tfc.nfx?

Experimental evidence
Experimental evidence

  • Query vectors

    • For tf

      • short query, use n

      • Long query, use t

    • For idf

      • Use f

    • For normalization

      • Use x

Experimental evidence1
Experimental evidence

  • Document vectors

    • For tf

      • Technical vocabulary, use n

      • More varied vocabulary, use t

    • For idf

      • Use f in general

      • Documents from different domain use x

    • For normalization

      • Documents with heterogeneous length, use c

      • Homogenous documents, use x


  • Best document weighting tfc, nfc (or tpc, npc)

  • Best query weighting nfx, tfx, bfx (or npx, tpx, bpx)

  • Questions?