Conducting a Web Search: Problems & Algorithms

Conducting a Web Search: Problems & Algorithms Anna Rumshisky

Motivation • Web as a distributed knowledge base • Utilization of this knowledge base may be improved through efficient search utilities

Basic Functionality of a Web Search Engine (I) • main functionality components • crawler module, page repository, query engine • algorithmically interesting (“behind the scenes”) components • crawler control, indexer/collection analysis and ranking modules

Basic Functionality of a Web Search Engine (II) • Crawler module (many running in parallel) • takes an initial set of URLs • retrieves and caches the pages • extracts URLs from the retrieved pages • Page repository • the cached collection is stored and used to create indexes as new pages are retrieved and processed • Query engine • processes a query

Basic Functionality of a Web Search Engine (III) • Indexer module • builds a content index (“inverted” index) • for each term, a sorted list of locations where it appears • where “location” is a tuple: • document URL • offset within the document • weight of a given occurence (e.g. occurrences in headings and titles may be assigned higher weights) • builds a link index • in-links and out-links for each URL stored in an adjacency matrix

Basic Functionality of a Web Search Engine (IV) • Collection analysis • creates specialized utility indexes • e.g. indexes based on global page ranking algorithms • maintains a lexicon of terms and term-level statistics • e.g. total number of documents in which a term occurs, etc) • Ranking module • assigns rank to each document relative to a specific query, using utility indexes created by the collection analysis module

Basic Functionality of a Web Search Engine (V) • Crawler control module • prioritizes retrieved URLs to determine the order in which they should be visited by the crawler • uses utility indexes created by the collection analysis module • may use historical data on queries received by the query engine • may use heuristics-based URL analysis (e.g. prefer URLs with fewer slashes) • may use anchor text analysis and location • may use global ranking based on link structure analysis, etc.

Some of the Issues • Refresh strategy - metrics determining the refresh strategy: • Freshness of the collection • % of the pages that are up to date • Age of the collection • average age of cached pages; age of a local copy of a single page is e.g. the time elapsed since it was last current • choosing to visit only the more frequently updated pages will lead to the age of collection growing indefinitely • Scalability of all techniques

Remainder of the Talk - Outline • Structure of the Web Graph • the actual structure of the web graph should influence crawling and caching strategies • design of link-based ranking algorithms • Relevance Ranking Algorithms • used by query engine and by crawler control modules • link-based algorithms, such as PageRank, HITS • content-based algorithms, such as TF*IDF, Latent Semantic Indexing

Modeling the Web Graph (I) • Web as a directed graph • each page is a vertex, each hypertext link is an edge • may or may not consider each link “weighted” based on • where the anchor text occurs • how many times a given link occurs on a page, etc. • “Bow-tie” link structure • 28% constitute strongly connected core • 44% fall onto the sides of the bow-tie that can be reached from the core, but not vice versa • 22% that can reach the core, but can not be reached from it

Modeling the Web Graph (II) • Web as a probabilistic graph • web graph is constantly modified, both nodes and edges are added and removed • Traditional (Erdös-Rényi) random graph model: G(n,p) • n nodes, p is the probability that a given edge exists • number of in-links follows some standard distribution, e.g. binomial: • Prob(in-degree of a node = k) = • used to model sparse random graphs

Modeling the Web Graph (III) • Web graph properties (empirically) • evolving nature of web graph • disjoint bipartite cliques • two subsets, with i nodes and j nodes; each node in first subset connected to each node in the second (total of ij edges) • distribution of in- and out-degrees follow the power law • Prob(in-degree of a node = k) = • experimentally, beta ˜= 2.1

“Evolving” Graph Models (Kumar et al.) • Graph models with stochastic copying • new vertices and new edges added to the graph at discrete intervals of time • allow dependencies between edges • some vertices choose outgoing edges at random, independently • others replicate existing linkage patterns by copying edges from a randomly chosen vertex • two “evolving” graph models • linear growth model (a constant number of nodes added at each interval) • exponential growth (current number of vertices multiplied by a constant factor)

“Evolving” Graph Models (II) • Defining Gt (Vt, Et) - state of the graph at time t • fv(Vt, t) - returns # of vertices added to graph at time = t+1 • fe(Gt, t) - returns the set of edges added to graph at time = t+1 • |Vt+1| = |Vt| + fv(Vt, t) • Et+1 = Et U fe(Gt, t) • Edge selection • new edges may lead from new vertices to old vertices, or be added between old vertices • origin and destination for each edge are chosen randomly • destination selection method (replicated or random destination) is chosen randomly

“Evolving” Graph Models (III) • Evaluation • Very rough assumptions • constant number of edges assumed for each added vertex • no deletion, etc. • creating a new site generates a lot of links between new nodes in that site; since no edges are added between the new nodes, their assumptions appear to collapse each a new site into a single vertex • Claim: these models show the desired properties, • the power law distribution for in- and out-degrees • the presence of directed bi-partite cliques

Link-Based Relevance Ranking: PageRank (I) Goal: Assign global rank to each page on the Web Basic idea: • Each page’s rank is a sum, over all pages that point to it (=referrers), of rank of each referrer, divided by the out-degree of that referrer

Link-Based Relevance Ranking: PageRank (II) Description: • For the total of n web pages, the goal is to obtain a rank vector r = <r1, r2, ..., rn> where ri = • Consider matrix An x n, with the elements • ai,j= 1/outdegree(i) if page i points to page j • ai,j= 0 otherwise • ai,j is the rank contribution of i to j • By our definition of rank vector, we must have then r = AT r • r is the eigenvector of matrix AT corresponding to the eigenvalue 1

Link-Based Relevance Ranking: PageRank (III) Mathematical apparatus • If the graph is strongly connected (every node reachable from every node), eigenvector r for the eigenvalue 1 of such “adjacency” matrix is uniquely defined • The principal eigenvector of a matrix (corresponding to the eigenvalue 1) can be computed using power iteration method • initialize a vector s with random values • apply a given linear transformation to it, until it converges to the principal eigenvector: • r = AT s • r = r / || r ||, where || vector || is vector length (normalization) • || r - s || < epsilon (stop condition) • essentially, r = AT(AT ... (AT s) - with normalization

Link-Based Relevance Ranking: PageRank (IV) Mathematical apparatus (cont’d) • PageRank vector r, as defined above, is proportional to the stationary probability distribution of the random walk on a graph • traverse the graph choosing at random which link to follow at each vertex • The power iteration method is guaranteed to converge only if the graph is aperiodic (i.e. no two cycles such that the length of one is proportional to the length of the other) • The speed of convergence of the power iteration depends on the eigenvalue gap (difference between the two largest eigenvalues)

Link-Based Relevance Ranking: PageRank (V) Practical application of the algorithm • The actual algorithm is merely applying the power iteration method to the matrix AT to obtain r • In practice, there are problems with the assumptions needed for this algorithm to work • the Web graph is NOT guaranteed to be aperiodic, so stationary distribution might not be reached • the Web is NOT strongly connected; the are pages with no outward links at all • slight modifications to the formula for ri take care of that • We don’t really need the actual rank of each page, we just need to sort the pages correctly

Link-Based Relevance Ranking: HITS (Hypertext-Induced Topic Search) Goal: Rank all pages with respect to a given query (obtain both hub and authority score for each page) Motivation: • Two types of web pages: hubs (pages with large collections of links: web directories, link lists, etc.) and authorities (pages well referred to by other pages) • Each pages gets two scores, a hub score and an authority score • A good authority has a high in-degree, and is pointed to by many good hubs • A good hub is has a high out-degree and points to many good authorities Consider the sites of Toyota and Honda: though they will not point to each other, good hubs would point to both

Link-Based Relevance Ranking: HITS (II) Basic idea: • An authority score of a page is obtained by summing up the hub scores of pages that point to it • A hub score of a page is obtained by summing up the authority scores of pages it points to • At query time, a small subgraph of the Web graph is identified, and a link analysis is run on it

Link-Based Relevance Ranking: HITS (III) Description: • Selecting a limited subset of pages: • The query string determines the initial root set of pages • up to t pages containing the same terms as query string • Root set is expanded • by all pages linked from the root set • by d pages pointing to the root set • this is to prevent an over-popular page in the root set - to which everybody points - to force you to add a large portion of the Web graph to your set

Link-Based Relevance Ranking: HITS (IV) Description (cont’d): • Link analysis: • here we wish to obtain two rank vectors a and h: a = <a1, a2, ..., an> and h = <h1, h2, ..., hn> • we obtain them using the following iteration method: • initialize both vectors to random values • for each page i, set the authority score ai equal to the sum of hub scores of all pages within the subgraph that refer to it (=referrers(i)) • for each page i, set the hub score hi equal to the sum up authority scores of all pages within the subgraph that it points to (=referred(i)) • normalize resulting vectors a and h to have length of 1 (divide each ai by ||a|| and each hi by ||h||)

Link-Based Relevance Ranking: HITS (V) • Link analysis (cont’d): • consider the adjacency matrixA for our focused subgraph • h = A a • a = AT h Normalize: hi= hi/||h|| ai= ai/||a|| • hnorm= c1 A AT hnorm • anorm= c2 AT Aanorm • thus, vectors h and a are the principal eigenvectors of matrices AAT and ATA, respectively and we are essentially using the power iteration method (which, as we know, will converge)

Content-Based Relevance Ranking: TF*IDF (I) Goal: Rank all pages with respect to a given query (compute similarity between the query and each document) Background: This is a traditional IR technique used on collections of documents since 1960s, originally proposed by Richard Salton Basic idea: • Using vector-space model to represent documents • Compute the similarity score using a distance metric

Content-Based Relevance Ranking: TF*IDF (II) Description • Each document is represented as a vector <w1, w2 , ..., wk> • k is the total number of terms (lemmas) in a document collection • wi is the weight of the ith term; depends on the number of occurrences of this term in this document • A query is thought of as just another document • There are different schemes for computing term weights • the choice of a particular scheme is usually empirically motivated • TF * IDF is the most common one

Content-Based Relevance Ranking: TF*IDF (III) Description (cont’d) • TF*IDF weighting scheme: • wi = term frequencyi * inverse document frequencyi • term frequencyi = # of times the term i occurs in a document • inverse document frequencyi = log (N / ni )where • N is the total # of documents in collection • ni is the number of documents in which term i occurs • since N is usually large, N / ni is “squashed” with log • 1 < N / ni < N • 0 < log (N / ni ) < log N • lowest weight of 1 is assigned to terms that occur in all documents

Content-Based Relevance Ranking: TF*IDF (IV) Description (cont’d) • Distance metric • cosine of the angle between the two vectors • obtained using scalar product: • this similarity score is indepedent the size of each document

Content-Based Relevance Ranking: TF*IDF (V) Practical application of the algorithm • Since the query is frequently very short (2.3 words); raw term frequency is not usesful • Augmented TF*IDF is used for weighting query terms: • wi = [0.5 + (0.5 * tfi/max tf)] * idfi • max tf is the frequency of the most frequent term • for terms not found in the query, the weight would be 0.5 * idfi • for most terms found in the query, the weight would be 1 * idfi

Content-Based Relevance Ranking: Latent Semantic Indexing Goal: Rank all documents with respect to a given query Background: This is also a technique developed for traditional IR with static document collection (introduced in 1990) Basic idea: • Construct a term x document matrix, using a vector representation similar to the one used in TF*IDF • Matrix Am x n (m terms, n documents), of rank r • Am x n is typically sparse • Using SVD (Singular Value Decomposition), obtain a rank-k approximation to A • Matrix Ak (m terms, n documents) • similar to the least squares method of fitting a line to a set of points

Content-Based Relevance Ranking: LSI (II) Mathematical apparatus: • rank(A) • number of linearly independent columns in Am x n: Rn -> Rm • linear transform of rank r maps the basis vectors of the pre-image into r linearly independent basis vectors of the image • Singular Value Decomposition • Matrix A can be represented as • A = U S VT where columns of U and V are left and right eigenvectors of A AT U and V orthogonal (VVT=I) , and S is a diagonal matrix: S = diag(s1, ..., sn), where si are nonnegative square roots of the r eigenvalues of A AT , and s1 >= s2 >= ... >= sr > sr+1 = ... = sn = 0

Content-Based Relevance Ranking: LSI (III) Mathematical apparatus (cont’d): • Um x k ,, Sk x k , Vk x n • ||A -Ak||F2

Content-Based Relevance Ranking: LSI (IV) Practical application of the algorithm • The SVD computation on the term x document matrix is performed in advance, not at query processing time • Each document is represented as a column in the Ak matrix • Scalar product-based metric for distances between document vectors is used • Query vector <a1q, a2q, ..., amq> • pseudo-document, added to Ak postfactum • Vq = AqT US-1 • values from AkTAk give scalar product of document vectors • AkTAk = V S2 V

References • Arasu, Cho, Garcia-Molina, Paepcke, Raghavan (2001). Searching the Web. • Kleinberg, J. (1999). Authoritative Sources in a Hyperlinked Environment. Journal of ACM. • Kumar, Raghavan, Rajagopalan, Sivakumar (2000). Stochastic Models for the Web Graph. IEEE. • Berry, Dumais & O'Brien (1994). Using Linear Algebra for Intelligent Information Retrieval. • Deerwester, Dumais, Furnas, Landauer, Harshman (1990). Indexing by Latent Semantic Analysis. Journal of American Society for Information Sciences. • Manning & Schutze (1999). Foundations of Statistical Natural Language Processing.

Content-Based Relevance Ranking: LSI (III) • Mathematical apparatus (cont’d): • ||A -Ak|| • Vq = AqT US-1 • A= US V T • AT = VSTU T = VS U T, • AT (U T)-1 = V S • AT (U T)-1 S-1 = V

Conducting a Web Search: Problems & Algorithms