Web Search & Information Retrieval

Web Search & Information Retrieval

Web search engines Rooted in Information Retrieval (IR) systems Prepare a keyword index for corpus Respond to keyword queries with a ranked list of documents. ARCHIE Earliest application of rudimentary IR systems to the Internet Title search across sites serving files over FTP

Boolean queries: Examples • Simple queries involving relationships between terms and documents • Documents containing the word Java • Documents containing the word Javabut not the word coffee • Proximity queries • Documents containing the phrase Java beansor the term API • Documents where Javaand islandoccur in the same sentence

Document preprocessing • Tokenization • Filtering away tags • Tokens regarded as nonempty sequence of characters excluding spaces and punctuations. • Token represented by a suitable integer, tid, typically 32 bits • Optional: stemming/conflation of words • Result: document (did) transformed into a sequence of integers (tid, pos)

Storing tokens • Straight-forward implementation using a relational database • Example figure • Space scales to almost 10 times • Accesses to table show common pattern • reduce the storage by mapping tids to a lexicographically sorted buffer of (did, pos) tuples. • Indexing = transposing document-term matrix t1 t2 … … ..tn D1 D2 . . Dm

Two variants of the inverted index data structure, usually stored on disk. The simpler version in the middle does not store term offset information; the version to the right stores term offsets. The mapping from terms to documents and positions (written as “document/position”) may be implemented using a B-tree or a hash-table.

Storage • For dynamic corpora • Berkeley DB2 storage manager • Can frequently add, modify and delete documents • For static collections • Index compression techniques (to be discussed)

Stopwords • Function words and connectives • Appear in large number of documents and little use in pinpointing documents • Indexing stopwords • Stopwords not indexed • For reducing index space and improving performance • Replace stopwords with a placeholder (to remember the offset) • Issues • Queries containing only stopwords ruled out • Polysemous words that are stopwords in one sense but not in others • E.g.; can as a verb vs. can as a noun

Stemming • Conflating words to help match a query term with a morphological variant in the corpus. • Remove inflections that convey parts of speech, tense and number • E.g.: university and universal both stem to universe. • Techniques • morphological analysis (e.g., Porter's algorithm) • dictionary lookup (e.g., WordNet). • Stemming may increase recall but at the price of precision • Abbreviations, polysemy and names coined in the technical and commercial sectors • E.g.: Stemming “ides” to “IDE”, “SOCKS” to “sock”, “gated” to “gate”, may be bad !

Batch indexing and updates • Incremental indexing • Time-consuming due to random disk IO • High level of disk block fragmentation • Simple sort-merges. • To replace the indexed update of variable-length postings • For a dynamic collection • single document-level change may need to update hundreds to thousands of records. • Solution : create an additional “stop-press” index.

Maintaining indices over dynamic collections.

Stop-press index • Collection of document in flux • Model document modification as deletion followed by insertion • Documents in flux represented by a signed record (d,t,s) • “s” specifies if “d” has been deleted or inserted. • Getting the final answer to a query • Main index returns a document set D0. • Stop-press index returns two document sets • D+ : documents not yet indexed in D0 matching the query • D- : documents matching the query removed from the collection since D0 was constructed. • Stop-press index getting too large • Rebuild the main index • signed (d, t, s) records are sorted in (t, d, s) order and merge-purged into the master (t, d) records • Stop-press index can be emptied out.

Index compression techniques • Compressing the index so that much of it can be held in memory • Required for high-performance IR installations (as with Web search engines), • Redundancy in index storage • Storage of document IDs. • Delta encoding • Sort Doc IDs in increasing order • Store the first ID in full • Subsequently store only difference (gap) from previous ID

Encoding gaps • Small gap must cost far fewer bits than a document ID. • Binary encoding • Optimalwhen all symbols are equally likely • Unary code • optimal if probability of large gaps decays exponentially

Encoding gaps • Gamma code • Represent gap x as • Unary code for followed by • represented in binary ( bits) • Golomb codes • Further enhancement

Lossy compression mechanisms • Trading off space for time • collect documents into buckets • Construct inverted index from terms to bucket IDs • Document' IDs shrink to half their size. • Cost: time overheads • For each query, all documents in that bucket need to be scanned • Solution: index documents in each bucket separately • E.g.: Glimpse (http://webglimpse.org/)

General dilemmas • Messy updates vs. High compression rate • Storage allocation vs. Random I/Os • Random I/O vs. large scale implementation

Relevance ranking • Keyword queries • In natural language • Not precise, unlike SQL • Boolean decision for response unacceptable • Solution • Rate each document for how likely it is to satisfy the user's information need • Sort in decreasing order of the score • Present results in a ranked list. • No algorithmic way of ensuring that the ranking strategy always favors the information need • Query: only a part of the user's information need

Responding to queries • Set-valued response • Response set may be very large • (E.g., by recent estimates, over 12 million Web pages contain the word java.) • Demanding selective query from user • Guessing user's information need and ranking responses • Evaluating rankings

Evaluating procedure • Given benchmark • Corpus of n documents D • A set of queries Q • For each query, an exhaustive set of relevant documents identified manually • Query submitted system • Ranked list of documents retrieved • compute a 0/1 relevance list • iff • otherwise.

Recall and precision • Recall at rank • Fraction of all relevant documents included in . • Precision at rank • Fraction of the top k responses thatare actually relevant.

Other measures • Average precision • Sum of precision at each relevant hit position in the response list, divided by the total number of relevant documents • . . • avg.precision =1 iff engine retrieves all relevant documents and ranks them ahead of any irrelevant document • Interpolated precision • To combine precision values from multiple queries • Gives precision-vs.-recall curve for the benchmark. • For each query, take the maximum precision obtained for the query for any recall greater than or equal to • average them together for all queries • Others like measures of authority, prestige etc

Precision-Recall tradeoff • Interpolated precision cannot increase with recall • Interpolated precision at recall level 0 may be less than 1 • At level k = 0 • Precision (by convention) = 1, Recall = 0 • Inspecting more documents • Can increase recall • Precision may decrease • we will start encountering more and more irrelevant documents • Search engine with a good ranking function will generally show a negative relation between recall and precision. • Higher the curve, better the engine

Precision and interpolated precision plotted against recall for the given relevance vector. Missing are zeroes.

The vector space model • Documents represented as vectors in a multi-dimensional Euclidean space • Each axis = a term (token) • Coordinate of document d in direction of term t determined by: • Term frequency TF(d,t) • number of times term t occurs in document d, scaled in a variety of ways to normalize document length • Inverse document frequency IDF(t) • to scale down the coordinates of terms that occur in many documents

Term frequency • Cornell SMART system uses a smoothed version

Inverse document frequency • Given • D is the document collection and is the set of documents containing t • Formulae • mostly dampened functions of • SMART

Vector space model • Coordinate of document d in axis t • Transformed to in the TFIDF-space • Query q • Interpreted as a document • Transformed to in the same TFIDF-space as d

Measures of proximity • Distance measure • Magnitude of the vector difference • Document vectors must be normalized to unit ( or ) length • Else shorter documents dominate (since queries are short) • Cosine similarity • cosine of the angle between and • Shorter documents are penalized

Relevance feedback • Web query is often short: 2 words • Incomplete or ambiguous • Users learning how to modify queries • Response list must have least some relevant documents • Relevance feedback • `correcting' the ranks to the user's taste • automates the query refinement process • Rocchio's method • Folding-in user feedback • To query vector • Add a weighted sum of vectors for relevant documents D+ • Subtract a weighted sum of the irrelevant documents D-

Relevance feedback (contd.) • Pseudo-relevance feedback • D+ and D- generated automatically • E.g.: Cornell SMART system • top 10 documents reported by the first round of query execution are included in D+ • typically set to 0; D- not used • Not a commonly available feature • Web users want instant gratification • System complexity • Executing the second round query slower and expensive for major search engines

Ranking by odds ratio • R : Boolean random variable which represents the relevance of document d w.r.t. query q. • Ranking documents by their odds ratio for relevance • Approximating probability of d by product of the probabilities of individual terms in d • Approximately…

Bayesian Inferencing Manual specification of mappings between terms to approximate concepts. Bayesian inference network for relevance ranking. A document is relevant to the extent that setting its corresponding belief node to true lets us assign a high degree of belief in the node corresponding to the query.

Bayesian Inferencing (contd.) • Four layers • Document layer • Representation layer • Query concept layer • Query • Each node is associated with a random Boolean variable, reflecting belief • Directed arcs signify that the belief of a node is a function of the belief of its immediate parents (and so on..)

Bayesian Inferencing systems • 2 & 3 same for basic vector-space IR systems • Verity's Search97 • Allows administrators and users to define hierarchies of concepts in files • Estimation of relevance of a document d w.r.t. the query q • Set the belief of the corresponding node to 1 • Set all other document beliefs to 0 • Compute the belief of the query • Rank documents in decreasing order of belief that they induce in the query • If a node v has k parents u1,…, uk=> Pr(v = true|u1,…, uk)

Other issues • Spamming • Adding popular query terms to a page unrelated to those terms • E.g.: Adding “Hawaii vacation rental” to a page about “Internet gambling” • Little setback due to hyperlink-based ranking • Titles, headings, meta tags and anchor-text • TFIDF framework treats all terms the same • Meta search engines: • Assign weight age to text occurring in tags, meta-tags • Using anchor-text on pages u which link to v • Anchor-text on u offers valuable editorial judgment about v as well.

Other issues (contd..) • Including phrases to rank complex queries • Operators to specify word inclusions and exclusions • With operators and phrases queries/documents can no longer be treated as ordinary points in vector space • Dictionary of phrases (bigrams) • Could be cataloged manually • Could be derived from the corpus itself using statistical techniques • Two separate indices: • one for single terms and another for phrases

= = k k ( t , t ) k k ( t , t ) 00 1 2 01 1 2 = = k k ( t , t ) k k ( t , t ) 10 1 2 11 1 2 Corpus derived phrase dictionary • Two terms and • Null hypothesis = occurrences of and are independent: • To the extent the pair violates the null hypothesis, it is likely to be a phrase • Measuring violation with likelihood ratio of the hypothesis • Pick phrases that violate the null hypothesis with large confidence • Contingency table built from statistics

= = k k ( t , t ) k k ( t , t ) 00 1 2 01 1 2 = = k ( t , t ) k k ( t , t ) k 10 1 2 11 1 2 Corpus derived phrase dictionary • Hypotheses • Null hypothesis (Bernoulli trial/binominal distribution) • Alternative hypothesis • Likelihood ratio • -2logλ asymptotically X2-distribution p2 p1

Likelihood Ratio (Dunning, 1993) • Likelihood ratios are more appropriate for sparse data than the Chi-Square test. In addition, they are easier to interpret than the Chi-Square statistic. • In applying the likelihood ratio test to collocation discovery, use the following two alternative explanations for the occurrence frequency of a bigram w1 w2: • H1: The occurrence of w2 is independent of the previous occurrence of w1: P(w2 | w1) = P(w2 | w1 ) = p • H2: The occurrence of w2 is dependent of the previous occurrence of w1: p1 = P(w2 | w1)  P(w2 | w1) = p2

Likelihood Ratio • Use the MLE for probabilities for p, p1, and p2 and assume the binomial distribution: • Under H1: P(w2 | w1)= P(w2 | w1)= P(w2)= c2/N • Under H2: P(w2 | w1)= c12/ c1= p1, P(w2 | w1)= (c2-c12)/(N-c1)= p2 • Under H1: b(c12; c1, p) gives c12 out of c1; bigrams are w1w2; and b(c2-c12; N-c1, p) gives c2- c12 out of N-c1 ; bigrams are w1w2 • Under H2: b(c12; c1, p1) gives c12 out of c1; bigrams are w1w2 ; and b(c2-c12; N-c1, p2) gives c2- c12 out of N-c1 ; bigrams are w1w2 N c1 c2 c12

Likelihood Ratio • The likelihood of H1 (likelihood of independence) • L(H1)= b(c12; c1, p) b(c2-c12; N-c1, p) • The likelihood of H2(likelihood of dependence) • L(H2)= b(c12; c1, p1) b(c2- c12; N-c1, p2) • The log of likelihood ratio is • log  = log (L(H1)/ L(H2)) = log b(..) + log b(..)– [log b(..) + log b(..)] • The quantity –2 log  is asymptotically 2 distributed.

Approximate string matching • Non-uniformity of word spellings • dialects of English • transliteration from other languages • Two ways to reduce this problem. • Aggressive conflation mechanism to collapse variant spellings into the same token • Decompose terms into a sequence of q-grams or sequences of q characters

Approximate string matching • Aggressive conflation mechanism to collapse variant spellings into the same token • E.g.: Soundex : takes phonetics and pronunciation details into account • used with great success in indexing and searching last names in census and telephone directory data. • Decompose terms into a sequence of q-grams or sequences of q characters • Check for similarity in the grams • Looking up the inverted index : a two-stage affair: • Smaller index of q-grams consulted to expand each query term into a set of slightly distorted query terms • These terms are submitted to the regular index • Used by Google for spelling correction • Idea also adopted for eliminating near-duplicate pages

Meta-search systems • Take the search engine to the document • Forward queries to many geographically distributed repositories • Each has its own search service • Consolidate their responses. • Advantages • Perform non-trivial query rewriting • Suit a single user query to many search engines with different query syntax • Surprisingly small overlap between crawls • Consolidating responses • Function goes beyond just eliminating duplicates • Search services do not provide standard ranks which can be combined meaningfully

Excellent in homework #2 • 下面是HW2中挑出來做的比較好的五個同學的網址連結 • 資訊碩一李碩彰 P76944241 • http://140.116.245.198/metasearch/meta.html • 資訊碩一黃偉鈞 P76941104 • http://140.116.247.187:8080/honki/wwwF2.jsp • 資訊碩一林冠甫 P76944461 • http://140.116.245.199/webresourse/index.php • 資訊碩一陳建成 • http://140.116.245.233/web/ • 資管碩二鄭凱任 R76931101 • http://140.116.96.175:98/metaSearch.html

Beyond Google?

Similarity search • Cluster hypothesis • Documents similar to relevant documents are also likely to be relevant • Handling “find similar” queries • Replication or duplication of pages • Mirroring of sites

Document similarity • Jaccard coefficient of similarity between document and • T(d) = set of tokens in document d • Symmetric, reflexive, not a metric • Forgives any number of occurrences and any permutations of the terms. • is a metric d2 d1

Estimating Jaccard coefficient with random permutations • Generate a set of mrandom permutations • foreach do • compute and • check if • end for • if equality was observed in k cases, estimate

Web Search & Information Retrieval