Parallel and distributed ir
Sponsored Links
This presentation is the property of its rightful owner.
1 / 19

Parallel and Distributed IR PowerPoint PPT Presentation

  • Uploaded on
  • Presentation posted in: General

Parallel and Distributed IR. Papers on Parallel and Distributed IR. Agenda. Introduction Paper A: Inverted file partitioning schemes in Multiple Disk Systems by Byeong-Soo Jeong and Edward Omiecinski [404]

Download Presentation

Parallel and Distributed IR

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

Parallel and Distributed IR

Papers on Parallel and Distributed IR


  • Introduction

  • Paper A: Inverted file partitioning schemes in Multiple Disk Systems by Byeong-Soo Jeong and Edward Omiecinski [404]

  • Paper B: Methodologies for Distributed Information Retrieval by Alister Moffat, Justin Zobel, Owen De Kretser, Tim Shimmin [1998]

  • Comparison and Conclusion

  • URLs


  • Exponential growth in size of online electronic text.

  • Per surveys conducted, publicly indexable web contained

    • 350 million pages ~ July 98

    • 800 million pages ~ July 99

    • 1 billion pages ~ January 00.

  • To manage this size and growth, we need a scalable model, multitasking algorithms, parallel and distributed IR.

Parallel and Distributed IR Comparison

  • Computation model in Distributed IR and Parallel IR is very similar. It divides the main task into sub-tasks and executes the sub-tasks in parallel.

  • The main difference is that, in Distributed IR sub-tasks are run on different processing units where interprocess communication is via network protocols rather than shared memory.

  • Distributed IR employs procedure to select subset of processes to broadcast request whereas Parallel IR broadcasts every request to every process.

    • Paper A: discusses two schemes for Parallel IR implementation

    • Paper B: gives methodologies for Distributed IR.

Paper A: Inverted file partitioning schemes - Objective

  • Goal of the paper is to reduce average response time by partitioning inverted file.

  • The paper identifies I/O time as a major cost factor in IR system.

  • It exploits the potential of I/O parallelism and balances I/O work-load for better response time by partitioning and distributing files.

  • The paper discusses two partitioning schemes for inverted file systems.

Inverted file partitioning schemes in Multiple Disk Systems

By Byeong-Soo Jeong and Edward Omiecinski [1995]

Paper A: Inverted file structure

Paper A: Inverted file partitioning schemes

  • Paper A: Inverted file partitioning schemes

    • Based on term-id

    • Based on document-id

  • Scheme 1: All postings for a term on one disk.

  • Scheme 2: All postings for a document on one disk (but for one term distributed across disks).

Paper A: Partitioning schemes – Pictorial presentation

Paper A: Two schemes - comparison

Paper A: Two schemes performance comparison

Performance comparison under different parameters

  • Query Model:

    Under skew Query model: partition by document-id performs better. Because I/O load is more balanced in partition by document-ID. Whereas, partition by term-ID performs better in uniform query model.

  • Query length:

    Under uniform query environment, partition by term-ID model performs twice as fast for long queries and 5-10 times fast for short queries.

  • Number of disks:

    Addition of number of disks improves performance of partition by document-ID scheme at higher rate, since I/O load is more evenly distributed in partition by document-ID.

Conclusion: Partition by Term ID performs better under uniform query models, but has high fluctuation in response time depending on terms in query. In Partition by Doc-ID, there is little variation in response time for almost all cases.

Paper B: Methodologies for Distributed IR - Objective

  • This paper is in the proceedings of 18th international conference on Distributed Computing Systems – 1998.

  • This paper discusses three different methodologies for Distributed IR and compares their effectiveness, efficiency and response time.

Methodologies for Distributed Information Retrieval

By Alister Moffat, Justin Zobel, Owen De Kretser, Tim Shimmin [1998]

Paper B: Methodologies for Distributed IR

  • “Parallel Text Search Methods”- paper by Salton and Buckley, in 1988, [701], has interesting comments about early implementation of Parallel IR where its effectiveness and efficiency are challenged.

  • Moffat and Zobel, in this paper, conclude that Distributed IR can be fast and effective; but agree with Salton-Buckley that its not efficient.

    [Will see why its not efficient in coming slides]

Paper B: Distributed IR Model

  • Librarian – Individual node that has its own sub-collection, maintains index for sub-collection, evaluates queries, fetches doc.

  • Receptionist – provides user interface, posts user queries to all or set of librarians, merges results from librarians, generates final ranked list of result using global info.

  • After global ranking by the receptionist, many of the docs returned by librarian may not even be presented to the user. Thus, there is wastage of resource in calculating similarity and transmission of those unwanted docs, therefore efficiency is low in distributed model.

Paper B: Distributed IR methodologies

  • Three different methodologies are defined based on the global information stored at the receptionist.

    • Central Nothing – CN

      The only global information maintained by the receptionist is a list of librarian.

    • Central Vocabulary – CV

      Global information stored by receptionist is the vocabularies of the sub-collections.

    • Central Index – CI

      Receptionist has a full access to the indexes of sub-collections.

Paper B: Central Nothing–Distributed IR

Global Information: List of librarians

  • Advantage:

    • Little or no storage space is required for global information at receptionist.

    • Simple implementation.

  • Disadvantage:

    • Receptionist has no basis for excluding any sub-collection processes query in full.

    • Final ranking quality is poor (a term might be common in one sub-collection and be assigned a minimal weight, but in context of the collection as a whole that term might be rare. When results from different sub-collection are merged, no basis to rank collection-wide).

Paper B: Central Vocabulary-Distributed IR

Global Information: Vocabularies of all sub-collections.

  • Advantage:

    • Receptionist can decide better to choose sub-collections for query distribution and sub-collections can be completely avoided if they contain none or few of the query terms.

    • It has a better global ranking (compared to CN) as it can use Central Vocabulary.

  • Disadvantage:

    • More storage required for string collection-wide vocabulary.

Paper B: Central Index–Distributed IR

Receptionist has full access to indexes of sub-collection.

  • Advantage:

    • Receptionist can perform all index processing and request, from librarian, docs required to make final ranking.

    • Better selection of librarians.

  • Disadvantage:

    • More storage required for string collection-wide vocabulary and index.

    • More preprocessing required at the receptionist to request documents from librarians.

Paper A & Paper B comparison - Conclusion

Paper A and Paper B - URLs

  • Paper A:

    • Inverted File Partitioning Schemes in Multiple Disk Systems by Byeong-Soo Jeong, Edward Omiecinski. (IEEE transactions on Parallel and distributed systems, Vol 6, Feb 1995)


  • Paper B:

    • Methodologies for Distributed Information Retrieval by Owen de Kretser, Alistair Moffat, Tim Shimmin, Justin Zobel. (The proceedings from 18th International Conference on Distributed Computing Systems )


  • Login