IR Indexing

IR Indexing Thanks to B. Arms SIMS Baldi, Frasconi, Smyth, Manning, Raghavan, Schutze

What we have covered • What is IR • Evaluation • Tokenization, terms and properties of text • Web crawling; robots.txt • Vector model of documents • Bag of words • Queries • Measures of similarity • This presentation • Indexing • Inverted files

Last time • Vector models of documents and queries • Bag of words model • Similarity measures • Text similarity typically used • Similarity is a measure of relevance (and ranking) • Query is a small document • Match query to document • All stored and indexed before a query is matched.

Summary: What’s the point of using vector spaces? • A well-formed algebraic space for retrieval • Key: A user’s query can be viewed as a (very) short document. • Query becomes a vector in the same space as the docs. • Can measure each doc’s proximity to it. • Natural measure of scores/ranking – no longer Boolean. • Queries are expressed as bags of words • Other similarity measures: see http://www.lans.ece.utexas.edu/~strehl/diss/node52.html for a survey

Elastic/Lucene Index Query Engine Interface Users Indexer Crawler Heritrix Web A Typical Web Search Engine

Indexing Index Query Engine Interface Indexer Users Crawler Web A Typical Web Search Engine

Why indexing? • For efficient searching for documents with unstructured text (not databases) using terms as queries • Databases can still be searched with an index • Sphinx • Online sequential text search (grep) • Small collection • Text volatile • Data structures - indexes • Large, semi-stable document collection • Efficient search

Sec. 1.1 Unstructured data in 1620 • Which plays of Shakespeare contain the words BrutusANDCaesar but NOTCalpurnia? • One could grep all of Shakespeare’s plays for Brutus and Caesar, then strip out lines containing Calpurnia? • Why is that not the answer? • Slow (for large corpora) • NOTCalpurnia is non-trivial • Other operations (e.g., find the word Romans nearcountrymen) not feasible • Ranked retrieval (best documents to return)

Sec. 1.1 Term-document incidence matrices 1 if play contains word, 0 otherwise BrutusANDCaesarBUTNOTCalpurnia

Sec. 1.1 Incidence vectors • So we have a 0/1 vector for each term. • To answer query: take the vectors for Brutus, Caesar and Calpurnia (complemented)  bitwise AND. • 110100 AND • 110111 AND • 101111 = • 100100

Sec. 1.1 Answers to query • Antony and Cleopatra,Act III, Scene ii Agrippa [Aside to DOMITIUS ENOBARBUS]: Why, Enobarbus, When Antony found Julius Caesar dead, He cried almost to roaring; and he wept When at Philippi he found Brutus slain. • Hamlet, Act III, Scene ii Lord Polonius: I did enact Julius Caesar I was killed i’ the Capitol; Brutus killed me.

Sec. 1.1 Bigger collections • Consider N = 1 million documents, each with about 1000 words. • Avg 6 bytes/word including spaces/punctuation • 6GB of data in the documents. • Say there are M = 500K distinct terms among these.

Sec. 1.1 Can’t build the matrix • 500K x 1M matrix has half-a-trillion 0’s and 1’s. • But it has no more than one billion 1’s. • matrix is extremely sparse. • What’s a better representation? • We only record the 1 positions.

Sec. 1.2 1 2 4 11 31 45 173 1 2 4 5 6 16 57 132 Inverted index • For each term t, we must store a list of all documents that contain t. • Identify each doc by a docID, a document serial number • Can we used fixed-size arrays for this? Brutus 174 Caesar Calpurnia 2 31 54 101 What happens if the word Caesar is added to document 14?

Sec. 1.2 1 2 4 11 31 45 173 1 2 4 5 6 16 57 132 Dictionary Postings Inverted index (file) • We need variable-size postings lists • On disk, a continuous run of postings is normal and best (posting is docid) • In memory, can use linked lists or variable length arrays • Some tradeoffs in size/ease of insertion Posting Brutus 174 Caesar Calpurnia 2 31 54 101 Sorted by docID .

Sec. 1.2 Tokenizer Friends Romans Countrymen Token stream Linguistic modules friend friend roman countryman Modified tokens roman Indexer 2 4 countryman 1 2 Inverted index 16 13 Inverted index construction Documents to be indexed Friends, Romans, countrymen.

Initial stages of text processing • Tokenization • Cut character sequence into word tokens • Deal with “John’s”, a state-of-the-art solution • Normalization • Map text and query term to same form • You want U.S.A. and USA to match • Stemming • We may wish different forms of a root to match • authorize, authorization • Stop words • We may omit very common words (or not) • the, a, to, of

Sec. 3.1 Dictionary data structures for inverted indexes • The dictionary data structure stores the term vocabulary, document frequency, pointers to each postings list … in what data structure?

Sec. 1.2 Indexer steps: Token sequence • Sequence of (Modified token, Document ID) pairs. Doc 1 Doc 2 I did enact Julius Caesar I was killed i' the Capitol; Brutus killed me. So let it be with Caesar. The noble Brutus hath told you Caesar was ambitious

Sec. 1.2 Indexer steps: Sort • Sort by terms • And then docID Core indexing step

Sec. 1.2 Indexer steps: Dictionary & Postings • Multiple term entries in a single document are merged. • Split into Dictionary and Postings • Doc. frequency information is added.

Sec. 1.2 Where do we pay in storage? Lists of docIDs Terms and counts - How do we index efficiently? - How much storage do we need? Pointers

Sec. 1.3 The index we just built • How do we process a query?

Sec. 1.3 2 4 8 16 32 64 1 2 3 5 8 13 21 Query processing: AND • Consider processing the query: BrutusANDCaesar • Locate Brutus in the Dictionary; • Retrieve its postings. • Locate Caesar in the Dictionary; • Retrieve its postings. • “Merge” the two postings: 128 Brutus Caesar 34

Sec. 1.3 Brutus Caesar 13 128 2 2 4 4 8 8 16 16 32 32 64 64 8 1 1 2 2 3 3 5 5 8 8 21 21 13 34 The merge • Walk through the two postings simultaneously, in time linear in the total number of postings entries 128 2 34 If list lengths are x and y, merge takes O(x+y) operations. Crucial: postings sorted by docID.

Intersecting two postings lists(a “merge” algorithm)

Sec. 3.1 A naïve dictionary • An array of structure: char[20] int Postings * 20 bytes 4/8 bytes 4/8 bytes • How do we store a dictionary in memory efficiently? • How do we quickly look up elements at query time?

Sec. 3.1 Dictionary data structures • Two main choices: • Hashtables • Trees • Some IR systems use hashtables, some trees

Sec. 3.1 Hash functions • Hash function is any function used to map data of arbitrary size to data of fixed size. • Values returned by a hash function are called hash values, hash codes, digests, or hashes. • Used as a data structure called a hash table, widely used in computer software for rapid data lookup. • Hash functions accelerate table or database lookup by detecting duplicated records in a large file.

Sec. 3.1 Hash tables • Each vocabulary term is hashed to an integer • Pros: • Lookup is faster than for a tree: O(1) • Cons: • No easy way to find minor variants: • judgment/judgement • No prefix search [tolerant retrieval] • If vocabulary keeps growing, need to occasionally do the expensive operation of rehashing everything

Sec. 3.1 Tree: binary tree Root a-m n-z a-hu hy-m n-sh si-z zygot sickle huygens aardvark

Sec. 3.1 Tree: B-tree • Definition: Every internal nodel has a number of children in the interval [a,b] where a, b are appropriate natural numbers, e.g., [2,4]. n-z a-hu hy-m

Sec. 3.1 Trees • Simplest: binary tree • More usual: B-trees • Trees require a standard ordering of characters and hence strings … but we typically have one • Pros: • Solves the prefix problem (terms starting with hyp) • Cons: • Slower: O(log M) [and this requires balanced tree] • Rebalancing binary trees is expensive • But B-trees mitigate the rebalancing problem

Unstructured vs structured data • What’s available? • Web • Organizations • You • Unstructured usually means “text” • Structured usually means “databases” • Semistructured somewhere in between

Unstructured (text) vs. structured (database) data companies in 1996

Unstructured (text) vs. structured (database) data companies in 2006 Google – $580B Oracle - $170B Feb, 2017

IR vs. databases:Structured vs unstructured data • Structured data tends to refer to information in “tables” Employee Manager Salary Smith Jones 50000 Chang Smith 60000 Ivy Smith 50000 Typically allows numerical range and exact match (for text) queries, e.g., Salary < 60000 AND Manager = Smith.

Unstructured data • Typically refers to free text • Allows • Keyword queries including operators • More sophisticated “concept” queries e.g., • find all web pages dealing with drug abuse • Classic model for searching text documents

Semi-structured data • In fact almost no data is “unstructured” • E.g., this slide has distinctly identified zones such as the Title and Bullets • Facilitates “semi-structured” search such as • Title contains data AND Bullets contain search … to say nothing of linguistic structure

Decisions in Building an Inverted File: Efficiency and Query Languages Some query options may require huge computation, e.g., Regular expressions If inverted files are stored in lexicographic order, comp* can be processed efficiently *comp cannot be processed efficiently Boolean terms If A and B are search terms A or B can be processed by comparing two moderate sized lists (not A) or (not B) requires two very large lists

Major Indexing Methods • Inverted index • effective for very large collections of documents • associates lexical items to their occurrences in the collection • Positional index • Non-positional indexs • Block-sort • Suffix trees and arrays • Faster for phrase searches; harder to build and maintain • Signature files • Word oriented index structures based on hashing (usually not used for large texts)

Sparse Vectors • Terms (tokens) as vectors • Vocabulary and therefore dimensionality of vectors can be very large, ~104 . • However, most documents and queries do not contain most words, so vectors are sparse (i.e. most entries are 0). • Need efficient methods for storing and computing with sparse vectors.

Sparse Vectors as Lists • Store vectors as linked lists of non-zero-weight tokens paired with a weight. • Space proportional to number of unique terms (n) in document. • Requires linear search of the list to find (or change) the weight of a specific term. • Requires quadratic time in worst case to compute vector for a document:

Sparse Vectors as Trees • Index tokens in a document in a balanced binary tree or trie with weights stored with tokens at the leaves. memory  < Balanced Binary Tree film variable   < < bit 2 film 1 memory 1 variable 2

Sparse Vectors as Trees (cont.) • Space overhead for tree structure: ~2n nodes. • O(log n) time to find or update weight of a specific term. • O(n log n) time to construct vector. • Need software package to support such data structures.

Sparse Vectors as Hash Tables • Hashing: • well-defined procedure or mathematical function that converts a large, possibly variable-sized amount of data into a small datum, usually a single integer that may serve as an index to an array • Store tokens in hash table, with token string as key and weight as value. • Storage overhead for hash table ~1.5n • Table must fit in main memory. • Constant time to find or update weight of a specific token (ignoring collisions). • O(n) time to construct vector (ignoring collisions).

Implementation Based on Inverted Files • In practice, document vectors are not stored directly; an inverted organization provides much better efficiency. • The keyword-to-document index can be implemented as a hash table, a sorted array, or a tree-based data structure (trie, B-tree). • Critical issue is logarithmic or constant-time access to token information.

Index File Types - Lucene • Field File • Field infos: .fnm • Stored Fields • field index: .fdx • field data: .fdt • Term Dictionary file • Term infos: .tis • Term info index: .tii • Frequency file: .frq • Position file: .prx

Organization of Inverted Files Term Frequency (Posting file) Term Position File (Inverted index) Term Dictionary Document Files

Efficiency Criteria Storage Inverted files are big, typically 10% to 100% the size of the collection of documents. Update performance It must be possible, with a reasonable amount of computation, to: (a) Add a large batch of documents (b) Add a single document Retrieval performance Retrieval must be fast enough to satisfy users and not use excessive resources.

IR Indexing

IR Indexing

Presentation Transcript

Indexing

Indexing:

IR - Indexing

INF 141 IR Metrics Latent Semantic Analysis and Indexing

Indexing

Indexing

Indexing

Indexing

Indexing

Indexing

Indexing

Indexing

Indexing

Indexing

Indexing

Indexing

Indexing

Chapter 12 Multimedia IR: Indexing and Searching

Indexing

Indexing

Indexing