Term weighting approaches in automatic text retrieval
Download
1 / 16

Term Weighting approaches in automatic text retrieval. - PowerPoint PPT Presentation


  • 109 Views
  • Uploaded on

Term Weighting approaches in automatic text retrieval. Presented by Ehsan. References. Modern Information Retrieval: Text book Slides on Vectorial Model by Dr. Rada The paper itself. The main idea.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Term Weighting approaches in automatic text retrieval.' - jun


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

References
References

  • Modern Information Retrieval: Text book

  • Slides on Vectorial Model by Dr. Rada

  • The paper itself


The main idea
The main idea

  • Text indexing system based on weighted single terms is better than the one based on more complex text representation

    • Crucial importance: effective term weighting.


Basic ir
Basic IR

  • Attach content identifier to both stored texts and user queries.

  • A content identifier/term is a word or a group of words extracted from the document/queries

    • Underlying assumption

      • Semantics of the documents and queries can be expressed by this terms


Two things to consider
Two things to consider

  • What is an appropriate content identifier?

  • Are all the identifier of same importance?

    • If not, how can we discriminate a term from the others?


Choosing content identifier
Choosing content identifier

  • Use single term/word as individual identifier

  • Use more complex text representation as identifier

  • An example

    • “Industry is the mother of good luck”

    • Mother said, “Good luck”.


Complex text representation
Complex text representation

  • Set of related terms based on statistical co-occurrence

  • Term phrases consisting of one of more governing terms (head of the phrase) together with corresponding depending terms

  • Grouping words under a common heading like thesaurus

  • Constructing knowledge base to represent the content of the subject area


What is better single or complex terms
What is better: single or complex terms?

  • Construction of complex text representation is inherently difficult.

    • Need sophisticated syntactic/statistical analysis program

  • An example

    • Using term phrase 20% increase in some cases

    • Other cases it is quite discouraging

  • Knowledge base

    • Effective vocabulary tools covering subject areas of reasonable scope is still sort of under-development

  • Conclusion

    • Using single terms as content identifier is preferable


The second issue
The second issue

  • How to discriminate terms?

    • Term weight of course!

  • Effectiveness of IR system

    • Document with relevant items must be retrieved

    • Documents with irrelevant/extraneous items must be rejected.


Precision and recall
Precision and Recall

  • Recall

    • Number of relevant document retrieved divided by total number of relevant documents

  • Precision

    • Out of the documents retrieved, how many of them are relevant

  • Our goal

    • High recall to retrieve as many relevant documents as possible

    • High precision to reject extraneous documents.

    • Basically, it is a trade off.


Weighting mechanism
Weighting mechanism

  • To get high recall

    • Term frequency, tf

  • When high frequency term are prevalent in the whole document collection

    • With high tf every single documents will be retrieved

  • To get high precision

    • Inverse document frequency

    • Varies inversely with the number of documents, n in which the term appears.

    • Idf is given by log2 (N/ n) , where N is total number of documents

  • To discriminate terms

    • We use tf X idf


Two more things to consider
Two more things to consider

  • Current “tf X id” mechanism favors larger documents

    • introduce a normalizing factor in the weight to equalize the length of the document.

  • Probabilistic mode

    • Term weight is the proportion of the relevant documents in which a term occurs divided by proportion of irrelevant items in which the term occurs

    • Is given by log ((N-n)/n)


Term weighting components
Term weighting components

  • Term frequency components

    • b, t, n

  • Collection frequency components

    • x, f, p

  • Normalization components

    • x, c

  • What would be weighting system given by tfc.nfx?


Experimental evidence
Experimental evidence

  • Query vectors

    • For tf

      • short query, use n

      • Long query, use t

    • For idf

      • Use f

    • For normalization

      • Use x


Experimental evidence1
Experimental evidence

  • Document vectors

    • For tf

      • Technical vocabulary, use n

      • More varied vocabulary, use t

    • For idf

      • Use f in general

      • Documents from different domain use x

    • For normalization

      • Documents with heterogeneous length, use c

      • Homogenous documents, use x


Conclusion
Conclusion

  • Best document weighting tfc, nfc (or tpc, npc)

  • Best query weighting nfx, tfx, bfx (or npx, tpx, bpx)

  • Questions?


ad