1 / 38

Searching the Web

Searching the Web. Basic Information Retrieval. Who I Am. Associate Professor at UCLA Computer Science Ph.D. from Stanford in Computer Science B.S. from SNU in Physics Got involved in early Web-search engine projects Particularly in Web crawling part

sharne
Download Presentation

Searching the Web

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Searching the Web Basic Information Retrieval

  2. Who I Am • Associate Professor at UCLA Computer Science • Ph.D. from Stanford in Computer Science • B.S. from SNU in Physics • Got involved in early Web-search engine projects • Particularly in Web crawling part • Research on search engines and social Web

  3. Brief Overview of the Course • Basic principles and theories behind Web-search engines • Not much discussion on implementation or tools, but will be happy to discuss them if there are any questions • Topics • Basic IR models, data structures, and algorithms • Topic-based models • Latent Semantic index • Latent Dirichlet Analysis • Link-based ranking • Search-engine architecture • Issues of scale, Web crawling

  4. Who Are You? • Background • Expectation • Career goal

  5. Today’s Topic • Basic Information Retrieval (IR) • Three approaches for computer-based information management • Bag of words assumption • Boolean Model • String-matching algorithm • Inverted index • Vector-space model • Document-term matrix • TF-IDF vector and cosine similarity • Phrase queries • Spell correction

  6. Computer-based Information Management • Basic problem • How to use computers to help humans store, organize and retrieve information? • What approaches have been taken and what has been successful?

  7. Three Major Approaches • Database approach • Expert-system approach • Information-retrieval approach

  8. Database Approach • Information is stored in a highly-structured way • Data is stored in relational tables as tuples • Simple data model and query language • Relational model and SQL query language • Clear interpretation of data and query • No ambition to be “intelligent” like humans • Mainly focus on highly efficient system • “Performance, performance, performance” • It has been hugely successful • All major businesses use a RDB system • >$20B market • What are the pros and cons?

  9. Expert-System Approach • Information is stored as a set of logical predicates • Bird(x), Cat(x), Fly(x), … • Given a query, the system infers the answer through logical inference • Bird(Ostrich)  Fly(Ostrich)? • Popular approach in 80s, but has not been successful for general information retrieval • What are the pros and cons?

  10. Information-Retrieval Approach • Uses existing text documents as information source • No special structuring or database construction required • Text-based query language • Keyword-based query or natural-language query • The system returns best-matching documents given the query • Had a limited appeal until the Web became popular • What are the pros and cons?

  11. Main Challenge of IR Approach • Relational Model • Interpretation of query and data is straightforward • Student(name, birthdate, major, GPA) • SELECT * FROM Student WHERE GPA > 3.0 • Information Retrieval • Both queries and data are “fuzzy” • Unstructured text and “natural language” query • What documents are good matches for a query? • Computers do not “understand” the documents or the queries • Developing a computerizable “model” is essential to implement this approach

  12. Bag of Words: Major Simplification • Consider each document as a “bag of words” • “bag” vs “set” • Ignore word ordering, but keep word count • Consider queries as bag of words as well • Great oversimplification, but works adequately in many cases • “John loves only Jane” vs “Only John loves Jane” • The limitation still shows up on current search engines • Still how do we match documents and queries?

  13. Boolean Model • Return all documents that contain the words in the query • Simplest model for information retrieval • No notion of “ranking” • A document is either a match or non-match • Q: How to find and return matching documents? • Basic algorithm? • Useful data structure?

  14. String-Matching Algorithm • Given string “abcde”, find what documents contain the string • Q: Computational complexity of naïve matching of string of length m over a document of length n? • Q: Any efficient way

  15. String Matching Example (1) • m 0123456789 D: ABCABABABC (doc)W: ABABC (word)i 01234

  16. String Matching Example (2) • m 0123456789 D: ABCABABABC (doc)W: ABABC (word)i 01234 • Two cursors: m=2, i=1 • m: beginning of matching part in D • i: the location of matching char in W

  17. String Matching Example (2) • m 0123456789 D: ABCABABABC (doc)W: ABABC (word)i 01234 • Mismatch at m=0,i=2 • Q: What can we do? Start again at m=1,i=0?

  18. String Matching Example (3) • m 0123456789 D: ABCABABABC (doc)W: ABABC(word)i 01234 • Mismatch at m=3,i=4 • Q: What can we do? Start at m=7,i=0?

  19. Algorithm KMP • If no substring in W is self-repeated, we can slide W “completely” for matched portion • m <- m + i • i <- 0 • If the suffix of the matched part is equal to the prefix of W, we have to slide back a little bit • m <- m + i – x // x is how much to slide back • i <- x • The exact value of x depends on the length of the prefix matching the the suffix of the matched part • T[0…m]: “slide-back” table recording x values

  20. Algorithm KMP W: string to look for D: document T: “slide-back” table in case of mismatch while (m + i) < |D| do: if W[i] = D[m + i], let i = i + 1 if i = |W|, return m otherwise, let m = m + i - T[i], if i > 0, let i = T[i] return no-match

  21. Algorithm KMP: T[i] Table • W: ABCDABD (word)i 0123456 • m <- m + i – T[i] • T[0]= -1, T[1]= 0 • Q: What should be T[i] for i=2…6?

  22. Data Structure for Quick Document Matching • Boolean model • Find all documents that contain the keywords in Q. • Q: What data structure will be useful to do it fast?

  23. Inverted Index • Allows quick lookup of document ids with a particular word • Q: How can we use this to answer “UCLA Physics”? Postings list lexicon/dictionary DIC PL(Stanford) Stanford PL(UCLA) UCLA MIT PL(MIT) …

  24. Inverted Index • Allows quick lookup of document ids with a particular word Postings list lexicon/dictionary DIC PL(Stanford) Stanford PL(UCLA) UCLA MIT PL(MIT) …

  25. Size of Inverted Index (1) • 100M docs, 10KB/doc, 1000 unique words/doc, 10B/word, 4B/docid • Q: Document collection size? • Q: Inverted index size? • Heap’s Law: Vocabulary size = k nb with 30 < k < 100 and 0.4 < b < 1 • k = 50 and b = 0.5 are good rule of thumb

  26. Size of Inverted Index (2) • Q: Between dictionary and postings lists, which one is larger? • Q: Lengths of postings lists? • Zipf’s law: collection term frequency  1/frequency rank • Q: How do we construct an inverted index?

  27. Inverted Index Construction C: set of all documents (corpus) DIC: dictionary of inverted index PL(w): postings list of word w 1: For each document d C: 2: Extract all words in content(d) into W 3: For each w W: 4: If w  DIC, then add w to DIC 5: Append id(d) to PL(w) Q: What if the index is larger than main memory?

  28. Inverted-Index Construction • For large text corpus • Block-sorted based construction • Partition and merge

  29. Evaluation: Precision and Recall • Q: Are all matching documents what users want? • Basic idea: a model is good if it returns document if and only if it is “relevant”. • R: set of “relevant” documentD: set of documents returned by a model

  30. Vector-Space Model • Main problem of Boolean model • Too many matching documents when the corpus is large • Any way to “rank” documents? • Matrix interpretation of Boolean model • Document – Term matrix • Boolean 0 or 1 value for each entry • Basic idea • Assign real-valued weight to the matrix entries depending on the importance of the term • “the” vs “UCLA” • Q: How should we assign the weights?

  31. TF-IDF Vector • A term t is important for document d • If t appears many times in d or • If t is a “rare” term • TF: term frequency • # occurrence of t in d • IDF: inverse document frequency • # documents containing t • TF-IDF weighting • TF X Log(N/IDF) • Q: How to use it to compute query-document relevance?

  32. Cosine Similarity • Represent both query and document as a TF-IDF vector • Take the inner product of the two normalized vectors to compute their similarity • Note: |Q| does not matter for document ranking. Division by |D| penalizes longer document.

  33. Cosine Similarity: Example • idf(UCLA)=10, idf(good)=0.1, idf(university) = idf(car) = idf(racing) = 1 • Q = (UCLA, university), D = (car, racing) • Q = (UCLA, university), D = (UCLA, good) • Q = (UCLA, university), D = (university, good)

  34. Finding High Cosine-Similarity Documents • Q: Under vector-space model, does precision/recall make sense? • Q: How to find the documents with highest cosine similarity from corpus? • Q: Any way to avoid complete scan of corpus?

  35. Word IDF docid TF 1/3530 Stanford D1 2 Lexicon Postinglist 1/9860 UCLA D14 30 1/937 8 MIT D376 … (TF may be normalized by document size) Inverted Index for TF-IDF • Q · di= 0 if di has no query words • Consider only the documents with query words • Inverted Index: Word  Document 35

  36. Phrase Queries • “Havard University Boston” exactly as a phrase • Q: How can we support this query? • Two approaches • Biword index • Positional index • Q: Pros and cons of each approach? • Rule of thumb: x2 – x4 size increase for positional index compared to docid only

  37. Spell correction • Q: What the user may have truly intended for the query “Britnie Spears”? How can we find the correct spelling? • Given a user-typed word w, find its correct spelling c. • Probabilistic approach: Find c with the highest probability P(c|w). • Q: How to estimate it? • Bayes’ rule: P(c|w) = P(w|c)P(c)/P(w) • Q: What are these probabilities and how can we estimate them? • Rule of thumb: 4/3 misspells are within edit distance 1. 98% are within edit distance 2.

  38. Summary • Boolean model • Vector-space model • TF-IDF weight, cosine similarity • String-matching algorithm • Algorithm KMP • Inverted index • Boolean model • TF-IDF model • Phrase queries • Spell correction

More Related