1 / 23

Papers on Parallel IR

Papers on Parallel IR. Agenda Introduction Paper 1:Inverted file partitioning schemes in multiple disk systems Paper 2: Parallel search using partitioned inverted files Comparison Conclusion URL Links to Paper. Parallel IR Introduction. Parallelism in Query processing involves:

niveditha
Download Presentation

Papers on Parallel IR

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Papers on Parallel IR • Agenda • Introduction • Paper 1:Inverted file partitioning schemes in multiple disk systems • Paper 2: Parallel search using partitioned inverted files • Comparison • Conclusion • URL Links to Paper Parallel IR

  2. Parallel IR Introduction Parallelism in Query processing involves: • Multitasking Simultaneous Queries A thread or process for each user query, that can execute on a CPU The same thread or process completes an entire single query Ability to handle multiple concurrent queries • Query Partitioning A single query is broken into sub tasks Each sub task can run in parallel Improves Response Time of a single Query Parallel IR

  3. Partitioning Query into Sub Tasks • IR involves dealing with large amounts of data. Hence we can partition data set between sub tasks • Document Partitioning • Divides documents over sub tasks, so that each sub task processes a sub set of the documents • Term Partitioning • Divides the indexing terms among sub tasks so that each document processing is spread out between sub tasks Parallel IR

  4. Theme of Papers being presented…. • Both the papers explore the issues and performance implications in parallel IR systems using inverted indexes when they employ • A) Document Partitioning • B) Index Term Partitioning • Paper1: Inverted file partitioning schemes in multiple disk systems • Paper2: Parallel search using partitioned inverted files Parallel IR

  5. P1: Inverted File Systems • Inverted File System consists of: • Index File: Ordered list of all keywords that have been used to index a collection of documents. Along with each term there are fields that give the location and number of postings in the posting file • Posting File: consists of a group of records, with each record having the weight of the term and a pointer to the actual document file • Document File: contains the actual document records of the collection Parallel IR

  6. P1: Inverted File Systems ( cont ) Parallel IR

  7. P1: Load Balancing In a multiple CPU, multiple disk system we need to: • Balance the Load on Processors • Need to maximize CPU utilization • Balance the Load on the I/O devices i.e. disk drives • Avoid I/O bottle necks which will cause CPUs to go in wait states Parallel IR

  8. P1:Partitioning an Inverted File The paper explores the 2 schemes: • Based on Term Id • Based on Document Id • With Both the schemes partitioning of the index file and the document file is the same – Index File by index term id and document file by document id • We have seen that the posting file has both the document id as well as the index term id. One scheme partitions the posting file based on the Term Id while the other partitions it based on the document id. Parallel IR

  9. P1:Partitioning an Inverted File ( cont) Parallel IR

  10. P1: Objective of Partitioning Inverted Index • Objective: To maximize performance • Ideal: All I/O channels and Disk drives are equally used when sub tasks of a query gets executed in parallel • However Data usage is dynamic from query to query and cannot be predicted. Hence we cannot achieve the ideal limit • Paper recognizes that I/O is a major cost factor in IR Parallel IR

  11. A Brief Comparison Parallel IR

  12. A Brief Comparison… • The Main Important Difference: Different I/O characteristic: A sub task of a single query index term will lead to disk I/O distribution across multiple disks in DocumentId partitioning while with TermId is limited to one disk. Which is better? – It is a tradeoff……… Parallel IR

  13. P1: Simulation Model • To compare the two schemes the paper defines a simulation model with the following factors: • Collection Database Model – follows natural language text distribution following Zipfs law. 20% of index terms comprise 80% of posting entries. Model Skews the above ratios to observe the effect on query performance • User Query Model : The paper used two cases. Skewed queries, with some terms of low ranks frequently requested. Uniform query model with al terms having same probability Parallel IR

  14. P1: Simulation Model.. Cont.. c) Queuing Model: Concurrent I/O requests on the same device are queued in priority. CPU usage requests on the same CPU are also queued d) Work Load Model : Vary the number of disks and CPUs Parallel IR

  15. Simulation Results • Increasing the number of disks up to a threshold improves performance, by decreasing the response time • When the index term and the query term distribution is not skewed partitioning scheme based on term id performed the best • When data was skewed, partitioning scheme based on document id performed the best. With skewed data (80/20) and with TermId, disks with those 20% of terms will become bottlenecks Parallel IR

  16. Paper 2 - Positioning w.r.t. Paper 1 • The thrust of paper 1’s approach was to partition the user queries by index terms, with each index term query becoming a sub task. The objective then became to optimize the one individual sub task with the biggest bottle next of I/O • What if user query has only one query index term!!! Your disks are optimized, but your CPUs are idle • Paper 2 recognizes that most user queries are single term only. Why? Parallel IR

  17. P2: Search Topology Framework • P2’s proposes a different framework: Parallel IR

  18. P2: Search Topology ( Cont..) • Top Node: Accepts query from client and distributes it to all of its child nodes and awaits results. • Leaf Node: Looks after only ONE PARTITION of the inverted file. Each leaf node and the top node have a processor each. Within this framework the papers objective is to evaluate which type of inverted index partitioning is better: DocId or TermId based. Parallel IR

  19. P2: Approach • The paper uses real web collections instead of simulations for experimentations • The PLIERS system is used on a 8 to 12 nodes AP3000 m/c. • The data used comprised BASE1(1Gb) to BASE10(10Gb) of VLC2 collection • Queries were based on topics 351 to 400 of the TREC-7 ad-hoc track. • Title only and whole topic queries were used • DocId and TermId index partitioning was used • Bottom Line: Real Data instead of simulation Parallel IR

  20. P2: Summary of Results Within the framework of the experiment: • DocId partitioning is better in a multiprocessor environment, than TermId Partitioning • TermId approach imposes too much communication overhead between leafs and the top node as the final result for a given doc, depends on the results from each leaf node Parallel IR

  21. Comparison Parallel IR

  22. Conclusion In combination these 2 papers highlight the issues of processor and I/O utilizations, in context to the factors affecting partitioning inverted indexes, in DocumentId and TermId Schemes Parallel IR

  23. URL Links to Paper Paper 1:Inverted file partitioning schemes in multiple disk systems Byeong-Soo Jeong; Omiecinski, E.; Parallel and Distributed Systems, IEEE Transactions on , Volume: 6 Issue: 2 , Feb 1995 http://ieeexplore.ieee.org/iel4/71/8001/00342125.pdf?isNumber=8001&prod=IEEE+JNL&arnumber=342125&arSt=142&ared=153&arAuthor=Byeong-Soo+Jeong%3B+Omiecinski%2C+E.%3B Paper 2:Parallel search using partitioned inverted files MacFarlane, A.; McCann, J.A.; Robertson, S.E.; String Processing and Information Retrieval, 2000. SPIRE 2000. Proceedings. Seventh International Symposium on , 2000 http://ieeexplore.ieee.org/iel5/7055/19010/00878197.pdf?isNumber=19010&prod=IEEE+CNF&arnumber=878197&arSt=209&ared=220&arAuthor=MacFarlane%2C+A.%3B+McCann%2C+J.A.%3B+Robertson%2C+S.E.%3B Parallel IR

More Related