1 / 24

Set-Based Model: A New Approach for Information Retrieval

Set-Based Model: A New Approach for Information Retrieval. Bruno Pôssas Nivio Ziviani Wagner Meira Jr. Berthier Ribeiro-Neto Department of Computer Science Federal University of Minas Gerais, Brazil. Introduction. Vector space model (VSM)

makoto
Download Presentation

Set-Based Model: A New Approach for Information Retrieval

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Set-Based Model: A New Approach for Information Retrieval Bruno Pôssas Nivio Ziviani Wagner Meira Jr. Berthier Ribeiro-Neto Department of Computer Science Federal University of Minas Gerais, Brazil

  2. Introduction • Vector space model (VSM) • Query terms and documents are represented as weighted vectors in a vector space • Query answers are documents whose representative vectors have high similarity to the query vector • Term weighting scheme: TF x IDF LATIN - Lab for Treating Information -- Federal University of Minas Gerais, Brazil

  3. Motivation • In VSM, index terms are assumed to be • mutually independent • Linear weighting function • Not realistic but easy to compute • Our hypothesis: • Exploration of correlation among index • terms might improve retrieval effectiveness LATIN - Lab for Treating Information -- Federal University of Minas Gerais, Brazil

  4. Our Goal • Propose a new model for computing index • term weights, based on set theory • Terms  Sets of terms (termsets) • Correlation among index terms • High retrieval effectiveness keeping computational costs small • Exploit the intuition that related term • occurrences often occur close to each other LATIN - Lab for Treating Information -- Federal University of Minas Gerais, Brazil

  5. Related Work • Correlation among index terms • Raghavan and Yu (1979) • Rijsbergen (1977), Harper and Rijsbergen (1978) • Wong et al. (1985 and 1987) • Common limitations: • Expensive to compute dependency factors • Exhaustive application of term co-occurences hurts overall effectiveness and performance • Association rule mining • Zaki (2000) LATIN - Lab for Treating Information -- Federal University of Minas Gerais, Brazil

  6. Termsets • T = {t1, t2, …, tt} is the set of t unique terms • of a collection of documents D. • An n-termsets is an ordered set of n terms, • such that s  T. • dsis the frequency of a termset s. • S is the set of 2t unique termsets that may • appear in a document (power set of T). LATIN - Lab for Treating Information -- Federal University of Minas Gerais, Brazil

  7. Collection D d2 d1 d3 C D AC T C D T Termsets: Example D = {d1, d2, d3} T = {A,C,D,T} S ={sA,sC,…,sAC, sAD,…,sACDT} dsA = 1 dsCD = 2 dsCDT = 1 sA = {A} (1-termset) sCD = {C,D} (2-termset) sCDT = {C,D,T} (3-termset) LATIN - Lab for Treating Information -- Federal University of Minas Gerais, Brazil

  8. Termsets: Definitions • Frequent termset • Is a termset with frequency greater or equal to a given minimal frequency. • Closed termset • Is a frequent termset that is (1) the largest among its subsets and (2) its subsets occur in the same set of documents. The use of closed termsets reduces significantly the number of termsets taken into consideration LATIN - Lab for Treating Information -- Federal University of Minas Gerais, Brazil

  9. Collection D d2 d1 d3 C D AC T C D T Termsets: Example Empty set Frequent Termset Closed Termset { } A: 1 C: 3 D: 2 T: 2 AC: 1 AT: 1 CD: 2 CT: 2 DT: 1 ACT: 1 CDT: 1 LATIN - Lab for Treating Information -- Federal University of Minas Gerais, Brazil

  10. Set-Based Model • Documents and queries are described by • sets of closed termsets, instead of terms. • Closed termsets provide all elements of the • TF x IDF scheme. • Computational cost is linear on the number • of documents in the collection. LATIN - Lab for Treating Information -- Federal University of Minas Gerais, Brazil

  11. Set-Based Model: Termset Weights • Extension of a TF x IDF scheme • sfi,j number of occurrences of si in dj • dsi number of occurrences of si in D • Idsi inverted freq. of occurrence of si in D SBM  VSM, if only 1-termsets areconsidered LATIN - Lab for Treating Information -- Federal University of Minas Gerais, Brazil

  12. sAT 1 d1 Q 2 sA sT d2 Set-Based Model: Similarity Calculation Normalization uses just terms instead of termsets LATIN - Lab for Treating Information -- Federal University of Minas Gerais, Brazil

  13. Set-Based Model: Query Mechanism • SBM Algorithm: • Obtain the 1-termsets from query terms; • Enumerate all closed termsets from 1-termsets; • Calculate similarities between query and documents using the closed termsets; • Normalize document similarities; • Select the k largest document similarities. LATIN - Lab for Treating Information -- Federal University of Minas Gerais, Brazil

  14. Experimental Results LATIN - Lab for Treating Information -- Federal University of Minas Gerais, Brazil

  15. TReC-3: Recall x Precision LATIN - Lab for Treating Information -- Federal University of Minas Gerais, Brazil

  16. Average Precision LATIN - Lab for Treating Information -- Federal University of Minas Gerais, Brazil

  17. Average Precision at 10 LATIN - Lab for Treating Information -- Federal University of Minas Gerais, Brazil

  18. Computational Efficiency LATIN - Lab for Treating Information -- Federal University of Minas Gerais, Brazil

  19. Conclusions and Future Work • SBM exploits index terms correlations • improving retrieval effectiveness efficiently. • Future work: • Investigate behavior of SBM when applied • to larger collections. • Extend SBM to take into account the • proximity information of index terms. LATIN - Lab for Treating Information -- Federal University of Minas Gerais, Brazil

  20. Termsets: Complexity Time Complexity: Space Complexity Worst Case: O(r.2l.N) |q| = query size, c = number of closed termsets, N = number of documents, r = number of maximal termsets, l = length of the largest termset. LATIN - Lab for Treating Information -- Federal University of Minas Gerais, Brazil

  21. TReC-3: Number of Closed Termsets The average case scenario is significantly smaller than the worst case scenario. LATIN - Lab for Treating Information -- Federal University of Minas Gerais, Brazil

  22. TReC-3: Minimal Frequency Trade-off between precision, the number of termsets taken into consideration and performance LATIN - Lab for Treating Information -- Federal University of Minas Gerais, Brazil

  23. Termsets: Enumeration • An incremental algorithm that employs a • very powerful pruning strategy. • Enumeration of (n+1)-termsets from n-termsets Union of all pairs (si,sj) that have the same prefix. • Evaluation if a frequent termset ‘s’ being verified is closed Check if all current termsets have ‘s’ as its closure, being discarded if such condition holds. LATIN - Lab for Treating Information -- Federal University of Minas Gerais, Brazil

  24. Collection D d2 d1 d3 C D A C T C D T Termsets: Example 1-termsets lsA = {d1} lsC = {d1,d2,d3} lsD = {d2,d3} lsT = {d1,d3} 3-termsets lsACT = {d1} 2-termsets lsAC = {d1} lsAT = {d1} Closed termset lsACT = {d1} LATIN - Lab for Treating Information -- Federal University of Minas Gerais, Brazil

More Related