1 / 6

Efficient Search in Large Textual Collections with Redundancy Jiangong Zhang and Torsten Suel

Efficient Search in Large Textual Collections with Redundancy Jiangong Zhang and Torsten Suel. Review by Newton Alex 993940942. Problem. Searching over collections of data that include many different crawls and versions of each page E.g. Searching the Internet archive, email archives etc.

Download Presentation

Efficient Search in Large Textual Collections with Redundancy Jiangong Zhang and Torsten Suel

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Efficient Search in Large Textual Collections with RedundancyJiangong Zhang and Torsten Suel Review by Newton Alex 993940942

  2. Problem • Searching over collections of data that include many different crawls and versions of each page • E.g. Searching the Internet archive, email archives etc. • Not feasible to provide full text search due to high cost of processing a query • E.g. Current indexing and query processing techniques when applied to say 10 successive crawls of the same URL will result in index sizes and query processing costs roughly 10 times that of single crawl

  3. Proposed Solution • A new and general framework that results in significant savings in the size of the inverted index and the performance of query processing for webpage collections with redundancies. • Features • Content-dependent partitioning techniques, in particular Winnowing. • Non redundant indexing. Two policies with respect to indexing • local sharing • global sharing • Modification of Document-at-a-time query processing algorithm to take advantage of the fragment based indexes

  4. Critique • The paper does not described the data structures used or the hardware setup in detail. • The framework supports deleting old unused fragments. Why is a delete required when we are interested in versioned systems? • Since no duplicate fragments are maintained, deleting a fragment might result in removing fragments corresponding to other pages in the archive.

  5. Relation to Course • This paper is similar to the Google News paper. However, this paper doesn’t describe the data structures or the environment setup in detail • Related to the concepts that were used in the Search engine project like inverted indexes, query matching etc. • Proposes methods for creating efficient indexes for redundant data.

More Related