Term Weighting approaches in automatic text retrieval.

Download Presentation

Term Weighting approaches in automatic text retrieval.

Loading in 2 Seconds...

- 91 Views
- Uploaded on
- Presentation posted in: General

Term Weighting approaches in automatic text retrieval.

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Term Weighting approaches in automatic text retrieval.

Presented by

Ehsan

- Modern Information Retrieval: Text book
- Slides on Vectorial Model by Dr. Rada
- The paper itself

- Text indexing system based on weighted single terms is better than the one based on more complex text representation
- Crucial importance: effective term weighting.

- Attach content identifier to both stored texts and user queries.
- A content identifier/term is a word or a group of words extracted from the document/queries
- Underlying assumption
- Semantics of the documents and queries can be expressed by this terms

- Underlying assumption

- What is an appropriate content identifier?
- Are all the identifier of same importance?
- If not, how can we discriminate a term from the others?

- Use single term/word as individual identifier
- Use more complex text representation as identifier
- An example
- “Industry is the mother of good luck”
- Mother said, “Good luck”.

- Set of related terms based on statistical co-occurrence
- Term phrases consisting of one of more governing terms (head of the phrase) together with corresponding depending terms
- Grouping words under a common heading like thesaurus
- Constructing knowledge base to represent the content of the subject area

- Construction of complex text representation is inherently difficult.
- Need sophisticated syntactic/statistical analysis program

- An example
- Using term phrase 20% increase in some cases
- Other cases it is quite discouraging

- Knowledge base
- Effective vocabulary tools covering subject areas of reasonable scope is still sort of under-development

- Conclusion
- Using single terms as content identifier is preferable

- How to discriminate terms?
- Term weight of course!

- Effectiveness of IR system
- Document with relevant items must be retrieved
- Documents with irrelevant/extraneous items must be rejected.

- Recall
- Number of relevant document retrieved divided by total number of relevant documents

- Precision
- Out of the documents retrieved, how many of them are relevant

- Our goal
- High recall to retrieve as many relevant documents as possible
- High precision to reject extraneous documents.
- Basically, it is a trade off.

- To get high recall
- Term frequency, tf

- When high frequency term are prevalent in the whole document collection
- With high tf every single documents will be retrieved

- To get high precision
- Inverse document frequency
- Varies inversely with the number of documents, n in which the term appears.
- Idf is given by log2 (N/ n) , where N is total number of documents

- To discriminate terms
- We use tf X idf

- Current “tf X id” mechanism favors larger documents
- introduce a normalizing factor in the weight to equalize the length of the document.

- Probabilistic mode
- Term weight is the proportion of the relevant documents in which a term occurs divided by proportion of irrelevant items in which the term occurs
- Is given by log ((N-n)/n)

- Term frequency components
- b, t, n

- Collection frequency components
- x, f, p

- Normalization components
- x, c

- What would be weighting system given by tfc.nfx?

- Query vectors
- For tf
- short query, use n
- Long query, use t

- For idf
- Use f

- For normalization
- Use x

- For tf

- Document vectors
- For tf
- Technical vocabulary, use n
- More varied vocabulary, use t

- For idf
- Use f in general
- Documents from different domain use x

- For normalization
- Documents with heterogeneous length, use c
- Homogenous documents, use x

- For tf

- Best document weighting tfc, nfc (or tpc, npc)
- Best query weighting nfx, tfx, bfx (or npx, tpx, bpx)
- Questions?