Parallel and Distributed IR

Parallel and Distributed IR

Papers on Parallel and Distributed IR Agenda • Introduction • Paper A: Inverted file partitioning schemes in Multiple Disk Systems by Byeong-Soo Jeong and Edward Omiecinski [404] • Paper B: Methodologies for Distributed Information Retrieval by Alister Moffat, Justin Zobel, Owen De Kretser, Tim Shimmin [1998] • Comparison and Conclusion • URLs

Introduction • Exponential growth in size of online electronic text. • Per surveys conducted, publicly indexable web contained • 350 million pages ~ July 98 • 800 million pages ~ July 99 • 1 billion pages ~ January 00. • To manage this size and growth, we need a scalable model, multitasking algorithms, parallel and distributed IR.

Parallel and Distributed IR Comparison • Computation model in Distributed IR and Parallel IR is very similar. It divides the main task into sub-tasks and executes the sub-tasks in parallel. • The main difference is that, in Distributed IR sub-tasks are run on different processing units where interprocess communication is via network protocols rather than shared memory. • Distributed IR employs procedure to select subset of processes to broadcast request whereas Parallel IR broadcasts every request to every process. • Paper A: discusses two schemes for Parallel IR implementation • Paper B: gives methodologies for Distributed IR.

Paper A: Inverted file partitioning schemes - Objective • Goal of the paper is to reduce average response time by partitioning inverted file. • The paper identifies I/O time as a major cost factor in IR system. • It exploits the potential of I/O parallelism and balances I/O work-load for better response time by partitioning and distributing files. • The paper discusses two partitioning schemes for inverted file systems. Inverted file partitioning schemes in Multiple Disk Systems By Byeong-Soo Jeong and Edward Omiecinski [1995]

Paper A: Inverted file structure

Paper A: Inverted file partitioning schemes • Paper A: Inverted file partitioning schemes • Based on term-id • Based on document-id • Scheme 1: All postings for a term on one disk. • Scheme 2: All postings for a document on one disk (but for one term distributed across disks).

Paper A: Partitioning schemes – Pictorial presentation

Paper A: Two schemes - comparison

Paper A: Two schemes performance comparison Performance comparison under different parameters • Query Model: Under skew Query model: partition by document-id performs better. Because I/O load is more balanced in partition by document-ID. Whereas, partition by term-ID performs better in uniform query model. • Query length: Under uniform query environment, partition by term-ID model performs twice as fast for long queries and 5-10 times fast for short queries. • Number of disks: Addition of number of disks improves performance of partition by document-ID scheme at higher rate, since I/O load is more evenly distributed in partition by document-ID. Conclusion: Partition by Term ID performs better under uniform query models, but has high fluctuation in response time depending on terms in query. In Partition by Doc-ID, there is little variation in response time for almost all cases.

Paper B: Methodologies for Distributed IR - Objective • This paper is in the proceedings of 18th international conference on Distributed Computing Systems – 1998. • This paper discusses three different methodologies for Distributed IR and compares their effectiveness, efficiency and response time. Methodologies for Distributed Information Retrieval By Alister Moffat, Justin Zobel, Owen De Kretser, Tim Shimmin [1998]

Paper B: Methodologies for Distributed IR • “Parallel Text Search Methods”- paper by Salton and Buckley, in 1988, [701], has interesting comments about early implementation of Parallel IR where its effectiveness and efficiency are challenged. • Moffat and Zobel, in this paper, conclude that Distributed IR can be fast and effective; but agree with Salton-Buckley that its not efficient. [Will see why its not efficient in coming slides]

Paper B: Distributed IR Model • Librarian – Individual node that has its own sub-collection, maintains index for sub-collection, evaluates queries, fetches doc. • Receptionist – provides user interface, posts user queries to all or set of librarians, merges results from librarians, generates final ranked list of result using global info. • After global ranking by the receptionist, many of the docs returned by librarian may not even be presented to the user. Thus, there is wastage of resource in calculating similarity and transmission of those unwanted docs, therefore efficiency is low in distributed model.

Paper B: Distributed IR methodologies • Three different methodologies are defined based on the global information stored at the receptionist. • Central Nothing – CN The only global information maintained by the receptionist is a list of librarian. • Central Vocabulary – CV Global information stored by receptionist is the vocabularies of the sub-collections. • Central Index – CI Receptionist has a full access to the indexes of sub-collections.

Paper B: Central Nothing–Distributed IR Global Information: List of librarians • Advantage: • Little or no storage space is required for global information at receptionist. • Simple implementation. • Disadvantage: • Receptionist has no basis for excluding any sub-collection processes query in full. • Final ranking quality is poor (a term might be common in one sub-collection and be assigned a minimal weight, but in context of the collection as a whole that term might be rare. When results from different sub-collection are merged, no basis to rank collection-wide).

Paper B: Central Vocabulary-Distributed IR Global Information: Vocabularies of all sub-collections. • Advantage: • Receptionist can decide better to choose sub-collections for query distribution and sub-collections can be completely avoided if they contain none or few of the query terms. • It has a better global ranking (compared to CN) as it can use Central Vocabulary. • Disadvantage: • More storage required for string collection-wide vocabulary.

Paper B: Central Index–Distributed IR Receptionist has full access to indexes of sub-collection. • Advantage: • Receptionist can perform all index processing and request, from librarian, docs required to make final ranking. • Better selection of librarians. • Disadvantage: • More storage required for string collection-wide vocabulary and index. • More preprocessing required at the receptionist to request documents from librarians.

Paper A & Paper B comparison - Conclusion

Paper A and Paper B - URLs • Paper A: • Inverted File Partitioning Schemes in Multiple Disk Systems by Byeong-Soo Jeong, Edward Omiecinski. (IEEE transactions on Parallel and distributed systems, Vol 6, Feb 1995) • http://csdl.computer.org/comp/trans/td/1995/02/l0142abs.htm • Paper B: • Methodologies for Distributed Information Retrieval by Owen de Kretser, Alistair Moffat, Tim Shimmin, Justin Zobel. (The proceedings from 18th International Conference on Distributed Computing Systems ) • http://csdl.computer.org/comp/proceedings/icdcs/1998/8292/00/82920066abs.htm

Parallel and Distributed IR