Introduction to Information Retrieval

Introduction to Information Retrieval

Definition • Information retrieval (IR) is the task of • finding material (usually documents) • of an unstructured nature (usually text) • that satisfies an information need (usually expressed as a query) • from within large collections (usually stored on computers).

Structured vs. Unstructured • Text has a natural (linguistic) structure • Sequential structure • Latent grammatical structure • Latent logical representations • Connections to knowledge bases • When we say ‘unstructured data’, we really mean, we’re just going to ignore all of that structure.

A more detailed look at the task Information retrieval can involve: • Filtering: Finding the set of relevant search results (Boolean retrieval) • Organizing • Often ranking • Can be more complicated: clustering the results, classifying into different categories, … • User interface • Can be simple html (eg, Google) • Can involve complex visualization techniques, to allow user to navigate the space of potentially relevant search results • In general, the system tries to reduce the user’s effort in finding exactly the information they’re looking for.

Preliminaries Terminology and Preprocessing

Terminology • Collection(or corpus): a set of documents • Word token (or just token): a sequence of characters in a document that constitutes a meaningful semantic unit • Word type (or just type): an equivalence class of tokens with the exact same sequence of characters • Index: a data structure used for efficient processing of large datasets • Inverted index: The name for the main data structure used in information retrieval • Term: An equivalence class of word types, perhaps transformed, used to build an inverted index. • Vocabulary: the set of distinct terms in an index.

Preprocessing a corpus • Tokenize: split a document into tokens • Mostly easy in English • hard in, e.g., Mandarin or Japanese • Morphological processing, which may include • Stemming / lemmatization (removing inconsequential endings of words) • Removing capitalization, punctuation, or diacritics and accents • Identifying highly synonymous terms, or “equivalence classes” (e.g., Jan. and January) • Removing very common word types (“stopwords”)

Tokenizing English • Easy rule (about 90% accurate): split on whitespace • Some special cases: • Contractions: I’m  I + `m, Jack’s  Jack + `s, aren’t  are + n’t • Multi-word expressions: New York, Hong Kong • Hyphenation: “travel between San Francisco-Los Angeles” vs. co-occurrence or Hewlett-Packard.

Tokenization in other languages • Word segmentation: In East-Asian character-based writing systems (Mandarin, Thai, Korean, Japanese, …) word boundaries are not indicated by whitespace. • In Germanic languages, words can often be compounded together to form very long words. • Many languages have special forms of contractions

The Inverted Index

Indexing • Indexing is a technique borrowed from databases • An index is a data structure that supports efficient lookups in a large data set • E.g., hash indexes, R-trees, B-trees, etc.

Inverted Index An inverted index stores an entry for every word, and a pointer to every document where that word is seen. VocabularyPostings List term1  Document17, Document 45123 . . . termN  Document991, Document123001

Example Document D1: “yes we got no bananas” Document D2: “Johnny Appleseed planted apple seeds.” Document D3: “we like to eat, eat, eat apples and bananas” VocabularyPostings List yes  D1 we  D1, D3 got  D1 no  D1 bananas  D1, D3 Johnny D2 Appleseed D2 planted D2 apple D2, D3 seeds D2 like  D3 to  D3 eat  D3 and  D3 Query “apples bananas”: “apples”  {D2, D3} “bananas”  {D1, D3} Whole query gives the intersection: {D2, D3} ^ {D1, D3} = {D3}

Variations Word-level index stores document IDs and offsets for the position of the words in each document • Why? • Supports phrase-based queries VocabularyPostings List yes  D1 (+0) we  D1 (+1), D3 (+0) got  D1 (+2) no  D1 (+3) bananas  D1 (+4), D3 (+8) … eat  D3 (+3,+4,+5) …

Variations • Real search engines add all kinds of other information to their postings lists • for efficiency • to support better ranking of results VocabularyPostings List yes (docs=1)  D1 (freq=1) we (docs=2)  D1 (freq=1), D3 (freq=1) … eat (docs=1)  D3 (freq=3) …

Index Construction Algorithm: • Scan through each document, word by word • Write term, docID pair for each word to TempIndex file 2. Sort TempIndex by terms 3. Iterate through sorted TempIndex: merge all entries for the same term into one postings list.

Index Construction Algorithm: • Scan through each document, word by word • Write term, docID pair for each word to TempIndex file we  D3 like  D3 to  D3 eat  D3 eat  D3 eat  D3 apples  D3 and  D3 bananas  D3 Johnny  D2 Appleseed  D2 planted  D2 apple  D2 seeds  D2 yes  D1 we  D1 got  D1 no  D1 bananas  D1 Document D1: “yes we got no bananas” Document D2: “Johnny Appleseed planted apple seeds.” Document D3: “we like to eat, eat, eat apples and bananas”

Index Construction Algorithm: • Scan through each document, word by word • Write term, docID pair for each word to TempIndex file TempIndex:

Index Construction Algorithm: 2. Sort the TempIndex by Terms TempIndex:

Index Construction Algorithm: 3. Merge postings lists for matching terms TempIndex:

Index Construction Algorithm: 3. Merge postings lists for matching terms Final Index:

Efficient Index Construction Problem: Indexes can be huge. How can we efficiently build them? • Blocked Sort-based Construction (BSBI) • Single-Pass In-Memory Indexing (SPIMI) What’s the difference?

Vector Space Model

Problem: Too many matching results for every query Using an inverted index is all fine and good, but if your document collection has 10^12 documents and someone searches for “banana”, they’ll get 90 million results. • We need to be able to return the “most relevant” results. • We need to rank the results.

Vector Space Model Idea: treat each document and query as a vector in a vector space. Then, we can find “most relevant” documents by finding the “closest” vectors. • But how can we make a document into a vector? • And how do we measure “closest”?

Example: Documents as Vectors • Example: • Document D1: “yes we got no bananas” • Document D2: “what you got” • Document D3: “yes I like what you got” Vector V1: Vector V2: Vector V3:

What about queries? The vector space model treats queries as (very short) documents. Example query: “you got” Query Q1:

Measuring Similarity Similarity metric: the cosine of the angle between document vectors. “Cosine Similarity”:

Why cosine? It gives some intuitive similarity judgments: cos (0) = 1 cos (π/4) = cos(45) = .71 cos (π/2) = cos(90) = 0 cos (π) = cos(180) = -1

Example: Computing relevance Query Q1: Vector V1: Vector V2: Vector V3:

Relevance Ranking in the Vector Space Model Definition: relevance between query q and document d in the VSM is: For a given query, the VSM ranks documents from largest to smallest relevance scores.

All words are equal? Our example so far has converted documents to boolean vectors, where each dimension indicates whether a term is present or not. Problems: • If a term appears many times in a document, it should probably count more. • Some words are more “informative” than others (eg, stop words are not informative) • Longer documents contain more terms, and will therefore be considered relevant to more queries, but perhaps not for good reason.

Weighting Scheme • A weighting scheme is a technique for converting documents to real-valued vectors (rather than boolean vectors). • The real value for a term in a document is called the weight of that term in that document. • We’ll write this as wt,d

Term Frequency Weighting • Term Frequency (TF) weighting is a heuristic that sets the weight of a term to be the number of times it appears in the document.

Log-TF • A common variant of TF is to reduce the effect of high-frequency terms by taking the logarithm:

Inverse Collection Frequency Weighting • Inverse Collection Frequency (ICF) weighting is a heuristic that sets the weight of a term to be reduced by the number of times it appears in the whole collection. • This makes really common words (eg, “the”) have really low weight. Where T is the total number of tokens in collection C.

Inverse Document Frequency Weighting • Inverse Document Frequency (IDF) weighting is a heuristic that sets the weight of a term to be reduced by the number of documents it appears in the whole collection. • This makes really common words (eg, “the”) have really low weight. Where N is the total number of documents in collection C.

ICF vs. IDF Nobody uses ICF! • Informative words tend to “clump together” in documents • As a result, words may appear many times, but only in a few documents. • This may make their CF large, while their DF isn’t as big. • In practice, IDF is better at discriminating between “informative” and “non-informative” terms.

Term Frequency –Inverse Document Frequency TF-IDF weighting combines TF and IDF (duh) Probably the most common weighting scheme in practice.

Limitations The vector space model has the following limitations: • Long documents are poorly represented because they have poor similarity values. • Semantic sensitivity: documents with similar context but different term vocabulary won't be associated, resulting in a false negative match. • The order in which the terms appear in the document is lost in the vector space representation.

Language Models for Ranking

VSM and heuristics • One complaint about the VSM is that it relies too heavily on ad-hoc heuristics • Why should similarity be cosine similarity, as opposed to some other similarity function? • TF-IDF seems like a hack • Can we formulate a more principled approach that works well?

Language Models • In theory classes, you’ve all seen models that “accept” or “reject” a string: • Finite automata (deterministic and non-) • Context-sensitive and context-free grammars • Turing machines • Language models are probabilistic versions of these models • Instead of saying “yes” or “no” to each string, they assign a probability of acceptance

Language Model Definition • Let V be the vocabulary (set of word types) for a language. • A language model is a distribution P() over V*, the set of sequences made from V.

Example Language Models • Really simple: If string s contains “the”, P(s) = 1, otherwise, P(s) = 0.* • Slightly less simple: If string s contains “the”, let P(s) = p(“the”). • More sophisticated: *this is not a proper distribution, but for ranking, we’re not going to be too picky.

Ranking with a Language Model We can use a language model to rank documents for a given query, by determining: P(d | q) and ranking according to the probability scores.

Introduction to Information Retrieval