1 / 10

Shakespeare's Plays: Search for Brutus and Caesar excluding Calpurnia

This project focuses on developing an inverted index to efficiently search for plays of Shakespeare that contain the words "Brutus" and "Caesar" but do not include "Calpurnia". The inverted index allows for quick retrieval of relevant documents.

deboer
Download Presentation

Shakespeare's Plays: Search for Brutus and Caesar excluding Calpurnia

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CS122B: Projects in Databases and Web Applications Winter 2017 Professor Chen Li Department of Computer Science UC Irvine Notes 06: Inverted Index Slides borrowed from Prof. Manning at Stanford

  2. Query • Which plays of Shakespeare contain the words BrutusANDCaesar but NOTCalpurnia? • One could grep all of Shakespeare’s plays for Brutus and Caesar, then strip out lines containing Calpurnia? • Slow (for large corpora) • NOTCalpurnia is non-trivial • Other operations (e.g., find the word Romans nearcountrymen) not feasible

  3. 2 4 8 16 32 64 128 1 2 3 5 8 13 21 34 Inverted index • For each term T, we must store a list of all documents that contain T. • Do we use an array or a list for this? Brutus Calpurnia Caesar 13 16 What happens if the word Caesar is added to document 14?

  4. Brutus Calpurnia Caesar Dictionary Postings Inverted index • Linked lists generally preferred to arrays • Dynamic space allocation • Insertion of terms into documents easy • Space overhead of pointers 2 4 8 16 32 64 128 1 2 3 5 8 13 21 34 13 16

  5. Tokenizer Friends Romans Countrymen Token stream. Linguistic modules friend friend roman countryman Modified tokens. roman Indexer 2 4 countryman 1 2 Inverted index. 16 13 Inverted index construction Documents to be indexed. Friends, Romans, countrymen.

  6. 2 4 8 16 32 64 1 2 3 5 8 13 21 Query processing • Consider processing the query: BrutusANDCaesar • Locate Brutus in the Dictionary; • Retrieve its postings. • Locate Caesar in the Dictionary; • Retrieve its postings. • “Merge” the two postings: 128 Brutus Caesar 34

  7. Brutus Caesar 13 128 2 2 4 4 8 8 16 16 32 32 64 64 8 1 1 2 2 3 3 5 5 8 8 21 21 13 34 The merge • Walk through the two postings simultaneously, in time linear in the total number of postings entries 128 2 34 If the list lengths are x and y, the merge takes O(x+y) operations. Crucial: postings sorted by docID.

  8. Boolean queries: Exact match • Boolean Queries are queries using AND, OR and NOT together with query terms • Views each document as a set of words • Is precise: document matches condition or not. • Primary commercial retrieval tool for 3 decades. • Professional searchers (e.g., lawyers) still like Boolean queries: • You know exactly what you’re getting.

  9. Beyond term search • What about phrases? • Proximity: Find GatesNEAR Microsoft. • Need index to capture position information in docs. More later. • Zones in documents: Find documents with (author = Ullman) AND (text contains automata).

  10. Other Challenges • Stemming • Tokenization • Especially hard for non-Latin languages • E.g., Chinese, Japanese • Stop words • Synonyms

More Related