1 / 54

Web Search & Information Retrieval

This article explores the process of preparing a keyword index for a given corpus and responding to keyword queries with a ranked list of documents. It covers various techniques such as tokenization, stemming, stopword removal, and index compression. The article also discusses the challenges of batch indexing and updates in dynamic collections, as well as the use of stop-press indices. Additionally, it explores index compression techniques to optimize storage and retrieval of documents.

leemargaret
Download Presentation

Web Search & Information Retrieval

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Web Search & Information Retrieval

  2. Web search engines Rooted in Information Retrieval (IR) systems Prepare a keyword index for corpus Respond to keyword queries with a ranked list of documents. ARCHIE Earliest application of rudimentary IR systems to the Internet Title search across sites serving files over FTP

  3. Boolean queries: Examples • Simple queries involving relationships between terms and documents • Documents containing the word Java • Documents containing the word Javabut not the word coffee • Proximity queries • Documents containing the phrase Java beansor the term API • Documents where Javaand islandoccur in the same sentence

  4. Document preprocessing • Tokenization • Filtering away tags • Tokens regarded as nonempty sequence of characters excluding spaces and punctuations. • Token represented by a suitable integer, tid, typically 32 bits • Optional: stemming/conflation of words • Result: document (did) transformed into a sequence of integers (tid, pos)

  5. Storing tokens • Straight-forward implementation using a relational database • Example figure • Space scales to almost 10 times • Accesses to table show common pattern • reduce the storage by mapping tids to a lexicographically sorted buffer of (did, pos) tuples. • Indexing = transposing document-term matrix t1 t2 … … ..tn D1 D2 . . Dm

  6. Two variants of the inverted index data structure, usually stored on disk. The simpler version in the middle does not store term offset information; the version to the right stores term offsets. The mapping from terms to documents and positions (written as “document/position”) may be implemented using a B-tree or a hash-table.

  7. Storage • For dynamic corpora • Berkeley DB2 storage manager • Can frequently add, modify and delete documents • For static collections • Index compression techniques (to be discussed)

  8. Stopwords • Function words and connectives • Appear in large number of documents and little use in pinpointing documents • Indexing stopwords • Stopwords not indexed • For reducing index space and improving performance • Replace stopwords with a placeholder (to remember the offset) • Issues • Queries containing only stopwords ruled out • Polysemous words that are stopwords in one sense but not in others • E.g.; can as a verb vs. can as a noun

  9. Stemming • Conflating words to help match a query term with a morphological variant in the corpus. • Remove inflections that convey parts of speech, tense and number • E.g.: university and universal both stem to universe. • Techniques • morphological analysis (e.g., Porter's algorithm) • dictionary lookup (e.g., WordNet). • Stemming may increase recall but at the price of precision • Abbreviations, polysemy and names coined in the technical and commercial sectors • E.g.: Stemming “ides” to “IDE”, “SOCKS” to “sock”, “gated” to “gate”, may be bad !

  10. Batch indexing and updates • Incremental indexing • Time-consuming due to random disk IO • High level of disk block fragmentation • Simple sort-merges. • To replace the indexed update of variable-length postings • For a dynamic collection • single document-level change may need to update hundreds to thousands of records. • Solution : create an additional “stop-press” index.

  11. Maintaining indices over dynamic collections.

  12. Stop-press index • Collection of document in flux • Model document modification as deletion followed by insertion • Documents in flux represented by a signed record (d,t,s) • “s” specifies if “d” has been deleted or inserted. • Getting the final answer to a query • Main index returns a document set D0. • Stop-press index returns two document sets • D+ : documents not yet indexed in D0 matching the query • D- : documents matching the query removed from the collection since D0 was constructed. • Stop-press index getting too large • Rebuild the main index • signed (d, t, s) records are sorted in (t, d, s) order and merge-purged into the master (t, d) records • Stop-press index can be emptied out.

  13. Index compression techniques • Compressing the index so that much of it can be held in memory • Required for high-performance IR installations (as with Web search engines), • Redundancy in index storage • Storage of document IDs. • Delta encoding • Sort Doc IDs in increasing order • Store the first ID in full • Subsequently store only difference (gap) from previous ID

  14. Encoding gaps • Small gap must cost far fewer bits than a document ID. • Binary encoding • Optimalwhen all symbols are equally likely • Unary code • optimal if probability of large gaps decays exponentially

  15. Encoding gaps • Gamma code • Represent gap x as • Unary code for followed by • represented in binary ( bits) • Golomb codes • Further enhancement

  16. Lossy compression mechanisms • Trading off space for time • collect documents into buckets • Construct inverted index from terms to bucket IDs • Document' IDs shrink to half their size. • Cost: time overheads • For each query, all documents in that bucket need to be scanned • Solution: index documents in each bucket separately • E.g.: Glimpse (http://webglimpse.org/)

  17. General dilemmas • Messy updates vs. High compression rate • Storage allocation vs. Random I/Os • Random I/O vs. large scale implementation

  18. Relevance ranking • Keyword queries • In natural language • Not precise, unlike SQL • Boolean decision for response unacceptable • Solution • Rate each document for how likely it is to satisfy the user's information need • Sort in decreasing order of the score • Present results in a ranked list. • No algorithmic way of ensuring that the ranking strategy always favors the information need • Query: only a part of the user's information need

  19. Responding to queries • Set-valued response • Response set may be very large • (E.g., by recent estimates, over 12 million Web pages contain the word java.) • Demanding selective query from user • Guessing user's information need and ranking responses • Evaluating rankings

  20. Evaluating procedure • Given benchmark • Corpus of n documents D • A set of queries Q • For each query, an exhaustive set of relevant documents identified manually • Query submitted system • Ranked list of documents retrieved • compute a 0/1 relevance list • iff • otherwise.

  21. Recall and precision • Recall at rank • Fraction of all relevant documents included in . • Precision at rank • Fraction of the top k responses thatare actually relevant.

  22. Other measures • Average precision • Sum of precision at each relevant hit position in the response list, divided by the total number of relevant documents • . . • avg.precision =1 iff engine retrieves all relevant documents and ranks them ahead of any irrelevant document • Interpolated precision • To combine precision values from multiple queries • Gives precision-vs.-recall curve for the benchmark. • For each query, take the maximum precision obtained for the query for any recall greater than or equal to • average them together for all queries • Others like measures of authority, prestige etc

  23. Precision-Recall tradeoff • Interpolated precision cannot increase with recall • Interpolated precision at recall level 0 may be less than 1 • At level k = 0 • Precision (by convention) = 1, Recall = 0 • Inspecting more documents • Can increase recall • Precision may decrease • we will start encountering more and more irrelevant documents • Search engine with a good ranking function will generally show a negative relation between recall and precision. • Higher the curve, better the engine

  24. Precision and interpolated precision plotted against recall for the given relevance vector. Missing are zeroes.

  25. The vector space model • Documents represented as vectors in a multi-dimensional Euclidean space • Each axis = a term (token) • Coordinate of document d in direction of term t determined by: • Term frequency TF(d,t) • number of times term t occurs in document d, scaled in a variety of ways to normalize document length • Inverse document frequency IDF(t) • to scale down the coordinates of terms that occur in many documents

  26. Term frequency • Cornell SMART system uses a smoothed version

  27. Inverse document frequency • Given • D is the document collection and is the set of documents containing t • Formulae • mostly dampened functions of • SMART

  28. Vector space model • Coordinate of document d in axis t • Transformed to in the TFIDF-space • Query q • Interpreted as a document • Transformed to in the same TFIDF-space as d

  29. Measures of proximity • Distance measure • Magnitude of the vector difference • Document vectors must be normalized to unit ( or ) length • Else shorter documents dominate (since queries are short) • Cosine similarity • cosine of the angle between and • Shorter documents are penalized

  30. Relevance feedback • Web query is often short: 2 words • Incomplete or ambiguous • Users learning how to modify queries • Response list must have least some relevant documents • Relevance feedback • `correcting' the ranks to the user's taste • automates the query refinement process • Rocchio's method • Folding-in user feedback • To query vector • Add a weighted sum of vectors for relevant documents D+ • Subtract a weighted sum of the irrelevant documents D-

  31. Relevance feedback (contd.) • Pseudo-relevance feedback • D+ and D- generated automatically • E.g.: Cornell SMART system • top 10 documents reported by the first round of query execution are included in D+ • typically set to 0; D- not used • Not a commonly available feature • Web users want instant gratification • System complexity • Executing the second round query slower and expensive for major search engines

  32. Ranking by odds ratio • R : Boolean random variable which represents the relevance of document d w.r.t. query q. • Ranking documents by their odds ratio for relevance • Approximating probability of d by product of the probabilities of individual terms in d • Approximately…

  33. Bayesian Inferencing Manual specification of mappings between terms to approximate concepts. Bayesian inference network for relevance ranking. A document is relevant to the extent that setting its corresponding belief node to true lets us assign a high degree of belief in the node corresponding to the query.

  34. Bayesian Inferencing (contd.) • Four layers • Document layer • Representation layer • Query concept layer • Query • Each node is associated with a random Boolean variable, reflecting belief • Directed arcs signify that the belief of a node is a function of the belief of its immediate parents (and so on..)

  35. Bayesian Inferencing systems • 2 & 3 same for basic vector-space IR systems • Verity's Search97 • Allows administrators and users to define hierarchies of concepts in files • Estimation of relevance of a document d w.r.t. the query q • Set the belief of the corresponding node to 1 • Set all other document beliefs to 0 • Compute the belief of the query • Rank documents in decreasing order of belief that they induce in the query • If a node v has k parents u1,…, uk=> Pr(v = true|u1,…, uk)

  36. Other issues • Spamming • Adding popular query terms to a page unrelated to those terms • E.g.: Adding “Hawaii vacation rental” to a page about “Internet gambling” • Little setback due to hyperlink-based ranking • Titles, headings, meta tags and anchor-text • TFIDF framework treats all terms the same • Meta search engines: • Assign weight age to text occurring in tags, meta-tags • Using anchor-text on pages u which link to v • Anchor-text on u offers valuable editorial judgment about v as well.

  37. Other issues (contd..) • Including phrases to rank complex queries • Operators to specify word inclusions and exclusions • With operators and phrases queries/documents can no longer be treated as ordinary points in vector space • Dictionary of phrases (bigrams) • Could be cataloged manually • Could be derived from the corpus itself using statistical techniques • Two separate indices: • one for single terms and another for phrases

  38. = = k k ( t , t ) k k ( t , t ) 00 1 2 01 1 2 = = k k ( t , t ) k k ( t , t ) 10 1 2 11 1 2 Corpus derived phrase dictionary • Two terms and • Null hypothesis = occurrences of and are independent: • To the extent the pair violates the null hypothesis, it is likely to be a phrase • Measuring violation with likelihood ratio of the hypothesis • Pick phrases that violate the null hypothesis with large confidence • Contingency table built from statistics

  39. = = k k ( t , t ) k k ( t , t ) 00 1 2 01 1 2 = = k ( t , t ) k k ( t , t ) k 10 1 2 11 1 2 Corpus derived phrase dictionary • Hypotheses • Null hypothesis (Bernoulli trial/binominal distribution) • Alternative hypothesis • Likelihood ratio • -2logλ asymptotically X2-distribution p2 p1

  40. Likelihood Ratio (Dunning, 1993) • Likelihood ratios are more appropriate for sparse data than the Chi-Square test. In addition, they are easier to interpret than the Chi-Square statistic. • In applying the likelihood ratio test to collocation discovery, use the following two alternative explanations for the occurrence frequency of a bigram w1 w2: • H1: The occurrence of w2 is independent of the previous occurrence of w1: P(w2 | w1) = P(w2 | w1 ) = p • H2: The occurrence of w2 is dependent of the previous occurrence of w1: p1 = P(w2 | w1)  P(w2 | w1) = p2

  41. Likelihood Ratio • Use the MLE for probabilities for p, p1, and p2 and assume the binomial distribution: • Under H1: P(w2 | w1)= P(w2 | w1)= P(w2)= c2/N • Under H2: P(w2 | w1)= c12/ c1= p1, P(w2 | w1)= (c2-c12)/(N-c1)= p2 • Under H1: b(c12; c1, p) gives c12 out of c1; bigrams are w1w2; and b(c2-c12; N-c1, p) gives c2- c12 out of N-c1 ; bigrams are w1w2 • Under H2: b(c12; c1, p1) gives c12 out of c1; bigrams are w1w2 ; and b(c2-c12; N-c1, p2) gives c2- c12 out of N-c1 ; bigrams are w1w2 N c1 c2 c12

  42. Likelihood Ratio • The likelihood of H1 (likelihood of independence) • L(H1)= b(c12; c1, p) b(c2-c12; N-c1, p) • The likelihood of H2(likelihood of dependence) • L(H2)= b(c12; c1, p1) b(c2- c12; N-c1, p2) • The log of likelihood ratio is • log  = log (L(H1)/ L(H2)) = log b(..) + log b(..)– [log b(..) + log b(..)] • The quantity –2 log  is asymptotically 2 distributed.

  43. Approximate string matching • Non-uniformity of word spellings • dialects of English • transliteration from other languages • Two ways to reduce this problem. • Aggressive conflation mechanism to collapse variant spellings into the same token • Decompose terms into a sequence of q-grams or sequences of q characters

  44. Approximate string matching • Aggressive conflation mechanism to collapse variant spellings into the same token • E.g.: Soundex : takes phonetics and pronunciation details into account • used with great success in indexing and searching last names in census and telephone directory data. • Decompose terms into a sequence of q-grams or sequences of q characters • Check for similarity in the grams • Looking up the inverted index : a two-stage affair: • Smaller index of q-grams consulted to expand each query term into a set of slightly distorted query terms • These terms are submitted to the regular index • Used by Google for spelling correction • Idea also adopted for eliminating near-duplicate pages

  45. Meta-search systems • Take the search engine to the document • Forward queries to many geographically distributed repositories • Each has its own search service • Consolidate their responses. • Advantages • Perform non-trivial query rewriting • Suit a single user query to many search engines with different query syntax • Surprisingly small overlap between crawls • Consolidating responses • Function goes beyond just eliminating duplicates • Search services do not provide standard ranks which can be combined meaningfully

  46. Excellent in homework #2 • 下面是HW2中挑出來做的比較好的五個同學的網址連結 • 資訊碩一 李碩彰 P76944241   • http://140.116.245.198/metasearch/meta.html • 資訊碩一 黃偉鈞 P76941104   • http://140.116.247.187:8080/honki/wwwF2.jsp • 資訊碩一 林冠甫 P76944461    • http://140.116.245.199/webresourse/index.php • 資訊碩一 陳建成                       • http://140.116.245.233/web/ • 資管碩二 鄭凱任 R76931101   • http://140.116.96.175:98/metaSearch.html

  47. Beyond Google?

  48. Similarity search • Cluster hypothesis • Documents similar to relevant documents are also likely to be relevant • Handling “find similar” queries • Replication or duplication of pages • Mirroring of sites

  49. Document similarity • Jaccard coefficient of similarity between document and • T(d) = set of tokens in document d • Symmetric, reflexive, not a metric • Forgives any number of occurrences and any permutations of the terms. • is a metric d2 d1

  50. Estimating Jaccard coefficient with random permutations • Generate a set of mrandom permutations • foreach do • compute and • check if • end for • if equality was observed in k cases, estimate

More Related