Searching the Web

Searching the Web Basic Information Retrieval

Who I Am • Associate Professor at UCLA Computer Science • Ph.D. from Stanford in Computer Science • B.S. from SNU in Physics • Got involved in early Web-search engine projects • Particularly in Web crawling part • Research on search engines and social Web

Brief Overview of the Course • Basic principles and theories behind Web-search engines • Not much discussion on implementation or tools, but will be happy to discuss them if there are any questions • Topics • Basic IR models, data structures, and algorithms • Topic-based models • Latent Semantic index • Latent Dirichlet Analysis • Link-based ranking • Search-engine architecture • Issues of scale, Web crawling

Who Are You? • Background • Expectation • Career goal

Today’s Topic • Basic Information Retrieval (IR) • Three approaches for computer-based information management • Bag of words assumption • Boolean Model • String-matching algorithm • Inverted index • Vector-space model • Document-term matrix • TF-IDF vector and cosine similarity • Phrase queries • Spell correction

Computer-based Information Management • Basic problem • How to use computers to help humans store, organize and retrieve information? • What approaches have been taken and what has been successful?

Three Major Approaches • Database approach • Expert-system approach • Information-retrieval approach

Database Approach • Information is stored in a highly-structured way • Data is stored in relational tables as tuples • Simple data model and query language • Relational model and SQL query language • Clear interpretation of data and query • No ambition to be “intelligent” like humans • Mainly focus on highly efficient system • “Performance, performance, performance” • It has been hugely successful • All major businesses use a RDB system • >$20B market • What are the pros and cons?

Expert-System Approach • Information is stored as a set of logical predicates • Bird(x), Cat(x), Fly(x), … • Given a query, the system infers the answer through logical inference • Bird(Ostrich)  Fly(Ostrich)? • Popular approach in 80s, but has not been successful for general information retrieval • What are the pros and cons?

Information-Retrieval Approach • Uses existing text documents as information source • No special structuring or database construction required • Text-based query language • Keyword-based query or natural-language query • The system returns best-matching documents given the query • Had a limited appeal until the Web became popular • What are the pros and cons?

Main Challenge of IR Approach • Relational Model • Interpretation of query and data is straightforward • Student(name, birthdate, major, GPA) • SELECT * FROM Student WHERE GPA > 3.0 • Information Retrieval • Both queries and data are “fuzzy” • Unstructured text and “natural language” query • What documents are good matches for a query? • Computers do not “understand” the documents or the queries • Developing a computerizable “model” is essential to implement this approach

Bag of Words: Major Simplification • Consider each document as a “bag of words” • “bag” vs “set” • Ignore word ordering, but keep word count • Consider queries as bag of words as well • Great oversimplification, but works adequately in many cases • “John loves only Jane” vs “Only John loves Jane” • The limitation still shows up on current search engines • Still how do we match documents and queries?

Boolean Model • Return all documents that contain the words in the query • Simplest model for information retrieval • No notion of “ranking” • A document is either a match or non-match • Q: How to find and return matching documents? • Basic algorithm? • Useful data structure?

String-Matching Algorithm • Given string “abcde”, find what documents contain the string • Q: Computational complexity of naïve matching of string of length m over a document of length n? • Q: Any efficient way

String Matching Example (1) • m 0123456789 D: ABCABABABC (doc)W: ABABC (word)i 01234

String Matching Example (2) • m 0123456789 D: ABCABABABC (doc)W: ABABC (word)i 01234 • Two cursors: m=2, i=1 • m: beginning of matching part in D • i: the location of matching char in W

String Matching Example (2) • m 0123456789 D: ABCABABABC (doc)W: ABABC (word)i 01234 • Mismatch at m=0,i=2 • Q: What can we do? Start again at m=1,i=0?

String Matching Example (3) • m 0123456789 D: ABCABABABC (doc)W: ABABC(word)i 01234 • Mismatch at m=3,i=4 • Q: What can we do? Start at m=7,i=0?

Algorithm KMP • If no substring in W is self-repeated, we can slide W “completely” for matched portion • m <- m + i • i <- 0 • If the suffix of the matched part is equal to the prefix of W, we have to slide back a little bit • m <- m + i – x // x is how much to slide back • i <- x • The exact value of x depends on the length of the prefix matching the the suffix of the matched part • T[0…m]: “slide-back” table recording x values

Algorithm KMP W: string to look for D: document T: “slide-back” table in case of mismatch while (m + i) < |D| do: if W[i] = D[m + i], let i = i + 1 if i = |W|, return m otherwise, let m = m + i - T[i], if i > 0, let i = T[i] return no-match

Algorithm KMP: T[i] Table • W: ABCDABD (word)i 0123456 • m <- m + i – T[i] • T[0]= -1, T[1]= 0 • Q: What should be T[i] for i=2…6?

Data Structure for Quick Document Matching • Boolean model • Find all documents that contain the keywords in Q. • Q: What data structure will be useful to do it fast?

Inverted Index • Allows quick lookup of document ids with a particular word • Q: How can we use this to answer “UCLA Physics”? Postings list lexicon/dictionary DIC PL(Stanford) Stanford PL(UCLA) UCLA MIT PL(MIT) …

Inverted Index • Allows quick lookup of document ids with a particular word Postings list lexicon/dictionary DIC PL(Stanford) Stanford PL(UCLA) UCLA MIT PL(MIT) …

Size of Inverted Index (1) • 100M docs, 10KB/doc, 1000 unique words/doc, 10B/word, 4B/docid • Q: Document collection size? • Q: Inverted index size? • Heap’s Law: Vocabulary size = k nb with 30 < k < 100 and 0.4 < b < 1 • k = 50 and b = 0.5 are good rule of thumb

Size of Inverted Index (2) • Q: Between dictionary and postings lists, which one is larger? • Q: Lengths of postings lists? • Zipf’s law: collection term frequency  1/frequency rank • Q: How do we construct an inverted index?

Inverted Index Construction C: set of all documents (corpus) DIC: dictionary of inverted index PL(w): postings list of word w 1: For each document d C: 2: Extract all words in content(d) into W 3: For each w W: 4: If w  DIC, then add w to DIC 5: Append id(d) to PL(w) Q: What if the index is larger than main memory?

Inverted-Index Construction • For large text corpus • Block-sorted based construction • Partition and merge

Evaluation: Precision and Recall • Q: Are all matching documents what users want? • Basic idea: a model is good if it returns document if and only if it is “relevant”. • R: set of “relevant” documentD: set of documents returned by a model

Vector-Space Model • Main problem of Boolean model • Too many matching documents when the corpus is large • Any way to “rank” documents? • Matrix interpretation of Boolean model • Document – Term matrix • Boolean 0 or 1 value for each entry • Basic idea • Assign real-valued weight to the matrix entries depending on the importance of the term • “the” vs “UCLA” • Q: How should we assign the weights?

TF-IDF Vector • A term t is important for document d • If t appears many times in d or • If t is a “rare” term • TF: term frequency • # occurrence of t in d • IDF: inverse document frequency • # documents containing t • TF-IDF weighting • TF X Log(N/IDF) • Q: How to use it to compute query-document relevance?

Cosine Similarity • Represent both query and document as a TF-IDF vector • Take the inner product of the two normalized vectors to compute their similarity • Note: |Q| does not matter for document ranking. Division by |D| penalizes longer document.

Cosine Similarity: Example • idf(UCLA)=10, idf(good)=0.1, idf(university) = idf(car) = idf(racing) = 1 • Q = (UCLA, university), D = (car, racing) • Q = (UCLA, university), D = (UCLA, good) • Q = (UCLA, university), D = (university, good)

Finding High Cosine-Similarity Documents • Q: Under vector-space model, does precision/recall make sense? • Q: How to find the documents with highest cosine similarity from corpus? • Q: Any way to avoid complete scan of corpus?

Word IDF docid TF 1/3530 Stanford D1 2 Lexicon Postinglist 1/9860 UCLA D14 30 1/937 8 MIT D376 … (TF may be normalized by document size) Inverted Index for TF-IDF • Q · di= 0 if di has no query words • Consider only the documents with query words • Inverted Index: Word  Document 35

Phrase Queries • “Havard University Boston” exactly as a phrase • Q: How can we support this query? • Two approaches • Biword index • Positional index • Q: Pros and cons of each approach? • Rule of thumb: x2 – x4 size increase for positional index compared to docid only

Spell correction • Q: What the user may have truly intended for the query “Britnie Spears”? How can we find the correct spelling? • Given a user-typed word w, find its correct spelling c. • Probabilistic approach: Find c with the highest probability P(c|w). • Q: How to estimate it? • Bayes’ rule: P(c|w) = P(w|c)P(c)/P(w) • Q: What are these probabilities and how can we estimate them? • Rule of thumb: 4/3 misspells are within edit distance 1. 98% are within edit distance 2.

Summary • Boolean model • Vector-space model • TF-IDF weight, cosine similarity • String-matching algorithm • Algorithm KMP • Inverted index • Boolean model • TF-IDF model • Phrase queries • Spell correction

Searching the Web