1 / 17

Federated text retrieval from uncooperative overlapped collections

Federated text retrieval from uncooperative overlapped collections. Milad Shokouhi , RMIT University, Melbourne, Australia Justin Zobel , RMIT University, Melbourne, Australia SIGIR 2007 (Collection representation in distributed IR) 2009-03-13

lahela
Download Presentation

Federated text retrieval from uncooperative overlapped collections

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Federated text retrieval from uncooperative overlapped collections MiladShokouhi, RMIT University, Melbourne, Australia Justin Zobel, RMIT University, Melbourne, Australia SIGIR 2007(Collection representation in distributed IR) 2009-03-13 Presented by JongHeumYeon, IDS Lab., Seoul National University

  2. Abstract Broker User Collection • Collection • Collection • Federated information retrieval (FIR) • Send query to multiple collections • Central broker merges the results and ranks them • Duplicated documents in collections • Final results contains high number of duplicates potentially • Authors propose a method for estimating the rate of overlap among collections based on sampling • Using the estimated overlap statistics, they propose two collection selection methods that aim to maximize the number of unique relevant documents in the final results

  3. Federated Information Retrieval (FIR) • Query is sent simultaneously to several collections • Each collection evaluates the query and returns the results to the broker • Advantage • No need to access the index of the collections • Search over the latest version of documents without crawling and indexing • Broker selects collections that are most likely to return relevant documents • Collection selection problem • Collection representation problem • Result merging problem

  4. Collection Selection Problem • FIR techniques assume that the degree of overlap among collections is either none or negligible • However, there are many collections that have a significant degree of overlap • Bibliographic databases • News resources • Selecting collections that are likely to return the same results by introducing duplicate documents into the final results • Wastes costly resources • Degrades search effectiveness • Authors propose … • A method that estimates the degree of overlap among collections by sampling from each collection using random queries • two collection selection techniques that use the estimated overlap statistics to maximize the number of unique relevant documents in the final results

  5. Related Work • Cooperative collection selection techniques • Collections provide the broker with their index statistics and other useful information • CORI, GlOSS, CVV • Uncooperative collection selection techniques • Collections do not provide their index statistics to the broker • The broker samples documents from each collection • ReDDE uses sampled documents for … • Estimates the number of relevant documents in collections • Ranks collections according to the number of highly ranked sampled documents

  6. Overlap Estimation C1 C2 K S2 S1 • Expected number of documents Using the documents downloaded by query-based sampling for estimating the rate of overlap and does not require any additional information Subset of sample documents Size of m The probability of any given document from m1 to be available in m2

  7. Overlap Estimation (cont’d) P(i) follows binomial distribution

  8. Overlap Estimation (cont’d) • Binomial theorem • Expected number of documents in m1 ∩ m2 • The number of overlap documents is independent of the collection size

  9. The ‘RELAX’ Selection Method Graph G = {(u,v) | vertex u, v are collections, edges indicates overlap documents between vertices} Output : final merged document lists that minimized duplicates

  10. The ‘RELAX’ Selection Method (cont’d)

  11. Overlap Filtering for ReDDE • F-ReDDE • The overlaps among collections are estimated as described for the Relax selection • Collections are ranked using a resource selection algorithm such as ReDDE • Each collection is compared with the previously selected collections. It is removed from the list if it has a high overlap (greater than γ) with any of the previously selected collections. We empirically choose γ = 30% and leave methods for finding the optimum value as future work

  12. Testbeds • Authors create three new testbeds with overlapping collections based on the documents available in the TREC GOV dataset • Qprobed-280 • 360 most frequent queries in a search engine in the .gov • A random number of documents (between 5000 and 20000) are downloaded as a collection • Generate 280 collections with average size of 12194 documents • Qprobed-300 • every twentieth collection is merged into a single large collection • Sliding-115 • Using a sliding window of 30 000 documents • Generate 112 collections

  13. Testbeds (cont’d) • Qprobed-280 • 74492 collection pairs < 10% overlap • 79 pairs < 90% • 1.1% of collection pairs > 50% overlap • Qprobed-300 • 1.9% of collection pairs > 50% overlap • Sliding-115 • 2.5% of collection pairs > 50% overlap

  14. Results • The initial estimated values for D(i, j) suggested that the degree of overlap among collections is usually overestimated • Document retrieval models are biased towards returning some popular documents for many queries • Samples produced by query-based sampling are not random

  15. Results (cont’d)

  16. Results (cont’d)

  17. Conclusion & Discussion • Pros • Propose the efficient algorithm for handling duplicates • Cons • Experiments show the improved performance • In practical environment?

More Related