1 / 10

20-760 Web-based Information Architectures

20-760 Web-based Information Architectures. How to Construct a Inverted List. Parsing & Indexing: Overview. Tasks Build a set of indices inverted list, idf, document id, normalized tf, word positions,… Speed (Example)

archer
Download Presentation

20-760 Web-based Information Architectures

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. 20-760 Web-based Information Architectures How to Construct a Inverted List

  2. Parsing & Indexing: Overview • Tasks • Build a set of indices • inverted list, idf, document id, normalized tf, word positions,… • Speed (Example) • On a PC of 750MHz CPU and 256M memory, a C++ program that builds indices without positions runs 46-56 seconds on the HTML collection of 50M. (The cleanup collection is 30M) • A few seconds for your Java program on the Reuters-1000 collection • Memory • 1-5% the size of the total uncompressed documents • E.g. 128 MB RAM for 2 GB text

  3. Document Parsing: sample document

  4. Document Parsing • Read the corpus file “reut2-1000.plain” • Identify the document boundary • <REUTERS ID=“document id”> • Process each document to extract: • Document ID • Segment the text into tokens • e.g. Apple, REUTERS, U.S. … • In our case, separate the text by white-spaces and newlines • Case conversion (make all tokens lowercase) • Discard stopwords and other non-content words (e.g. numbers) • Word stemming • Count term frequencies, record positions • Update indices • Write out the index to file, according to alphabetical order from a to z

  5. Data Structure • You can use whatever you like, but hashtable is simple to implement • Hashtable • Java provide such classes in java.util • Perl has hashes as a datatype, e.g. %words • C++ implements the associated list in Standard Templete Library(STL). The template class is called map. Internal implementations are either hashes or B-tree. • You can also implement your own hashtable(see Ch13 “Information Retrieval: Data Structures & Algorithms” by William B. Frakes, Ricardo Baeza-Yates) • Searching is fast O(1), but scanning in sequential order is not possible • B-tree and B+ tree (see section 2.3 of the above book for details)

  6. Associated List • Associated list is a data structure, a list of pairs. Each pair is composed of a key and a value. Value could be a complex data structure. • In our case: Key/value -> Term / Associated posting list • Access an associated list. You have the key, you want to access the associated value quickly. • Many ways of implementing the associated list: Hash, B-tree, Array

  7. Hashtable • Hashtable provides the insertion/access of the associated value in a constant time • Hashtable uses a hash function to map the key to the address that the associated value is stored Hash(key) value

  8. Indices • Format • <term> <idf> <doc id>:<normalized tf>:<tf>:<positions> • positions are separated by commas • IDF(t) = log2(N/n) where N is the number of documents in the whole collection, n is the number of documents that contains the term t • TFnom = TF/TFmax • Sample

  9. Stopword Recognition • There are usually fewer than 500 stopwords • Some systems have very few • Every word token is checked, so the test should be very fast • Store the stopword list in a hash table • Since stopword lists evolve slowly, calculate a perfect hash code • Lookup each word token in the hash table • If found, the token is a stopword, so discard it • Document length & word locations should count stopwords • Example: “Library of Congress” has length of 3 Location: 1 2 3

  10. Good Luck! • Due on 7:00pm July 19.

More Related