What’s HOT on the Web John Tomlin February 2003 Expanded from a talk given at INFORMS

What’s HOT on the Web John Tomlin February 2003 Expanded from a talk given at INFORMS San Jose, CA November 2002

Motivation There are obvious similarities between the behavior of web surfers, and the behavior of road traffic, as well as obvious differences. Can we adapt some of the ideas from Traffic Theory/Transportation Analysis to study and understand traffic on the WWW ?

Transportation Study Steps • Land use planning (work out the urban zones) • Traffic Distribution * (figure out interzonaltransfers) • Traffic Assignment (map interzonal transfersonto the network)* This is the step that is most relevant here.

Traffic Distribution model Destination Zones of Trips bj Origin Zones Of Trips ai Possible Assignments

Simple Linear Programming Formulation i = 1,..., m Origins j = 1,..., n Destinations Data: ai Number of travelers at origin i bj Number traveling to destination j cij Travel cost for route (i,j) Variables: xij (>=0) Number of travelers from I to j Constraints: Sjxij = ai (i = 1,..., m) Si xij = bj (j = 1,..., n) Objective: Maximize Sij cijxij

Simple LP Formulation (continued) It is often convenient to normalize the model so that the variables are probabilities: pij = xij / X (where X is the sum of all the xij). The equations then become: Constraints: Sjpij = ai/X (i = 1,..., m) Si pij = bj/X (j = 1,..., n) Objective: Maximize Sij cijpij

Simple Linear Programming Formulation • Advantages • Simplicity • Very fast algorithms available • Relatively modest data requirements • Disadvantages • Deterministic • Unrealistic solutions • i.e. of the (m x n) possible O-D pairings • only (m + n) will have a nonzero weightxij . • Most O-D pairings will never occur. • The simple fix doesn't work* * i.e. Placing lower bounds on the variables

Entropy Maximization And Information Let x be a random variable which can take values x1, x2, x3, ….., xn with probabilities p1, p2, p3, ….., pn . These probabilities are not known. All we know are the expectations of some functions fr(x) E[ fr(x)] = Spifr(xi) (r = 1,…,m) (1) Where of course Spi= 1 (2) i i Jaynes – “out problem is that of finding a probability assignment that avoids bias, while agreeing with whatever information is given. The great advance provided by information theory lies in the discovery that there is a unique, unambiguous criterion for the “amount of uncertainty” represented by a discrete probability distribution.”

This measure of the uncertainty of a probability distribution was given by Shannon as: S(p1, p2, p3, ….., pn ) = -KSpiln pi (3) and this is called the “entropy”. Jaynes again – “It is now evident how to solve our problem; in making inferences on the basis of partial information we must use the probability distribution which has maximum entropy subject to whatever is known.” Using the method of Lagrange multipliers to maximize (3) subject to (1) and (2), with multipliers kron equations (1) and k0 on (2), the unique Solution can be shown to be: pi = exp[ -k0-Skr fr(xi) ] (4) WhereZ(p) = e k0 = S exp{ -Skr fr(xi) } (5) is the partition function of the distribution. i r i r r

Back to our Model OK, so we want a maximum entropy solution for the traffic distribution problem. Switching from the pito pij , we modify the treatment just given so that: pi lpij fra(xi) l fra(xij) = frb(xi) l frb(xij) = Then: pij = exp[ -k0-kia - kjb ] for (i,j)` E and Z = e k0 = S exp{-kia - kjb} { 1 if r = i 0 otherwise { 1 if r = j 0 otherwise (i,j) `E

Entropy Model (continued) A purely random assignment of ads to users corresponds to solving: Maximize -Sijxijln xij Subject to: Sjxij = ai (i = 1,..., m) Sixij = bj (j = 1,..., n) The objective here is (only) the entropy function We can then add the travel cost constraint: Sij cij xij = C for some reasonable value of C. Suppose we assign this a Lagrange multiplier 1/ c

Entropy Model (continued) This in turn is equivalent to solving: Maximize Sijcij xij -c(Sijxijln xij) Subject to: Sjxij = ai (i = 1,..., m) Sixij = bj (j = 1,..., n) Wherecis a constant, which can be related to the average of the cij values. Even though this a nonlinear problem it can be solved by a fast (iterative scaling) algorithm and has solutions of the form: xij = Ai Bj exp(-cij / c) Note that because of the logarithm term, the xijvalues must necessarily be positive.

The World Wide Web • Over 1 billion pages today • More tomorrow – growing fast • About ~10 links per page

j i m k l n Graph model of the web: G = (V,E) Where V is set of vertices (i,j,k,…..) or nodes or pages And E is the set of edges (i,j)

PageRank PageRank can be described in a number of ways; in terms of Markov Chains, equations, or eigensystems. Intuitively, each page is assigned a rank in terms of the ranks of the pages that point to it. For simplicity let us initially assume that G is strongly connected (i.e. there is a directed path between every pair of pages in the set V)

Let lij = probability that a surfer at page i will click through to page j The standard PageRank computation assumes that these probabilities are uniform, i.e. if di = out degree (number of outlinks) of page i, then lij = 1/di for all j such that (i,j) ` E M = [lij] defines a Markov chain. The stochastic vector w = <w1, …, wm>, where w1 + …. +wm = 1 is a stationary state of the Markov chain if wT = wTM

Let matrix A = MT , i.e. aij = 1/dj for (j,i) `E where dj is the out degree of node (page) j, then the ideal PageRank xi for page i is computed from: xi = S aijxj Note the very strong assumption that the aij are fixed, and fixed with values 1/dj . (j,i) `E

j i m k l n PageRank of i is xi = (1/2) xj + (1/3) xk + xl of j is xj = (1/3) xm + (1/2) xk etc.

Beyond the Valley of the Markov Chains - A Network Flow Model Let us maintain the graph G as our basic model, and the assumption that each user clicks through to a page at each tick of the “clock”, but define flow variables: yij = flow per unit time from page i to page j. WhereSyij = Y (const) (*) (i,j) `E (We will usually find it convenient to work with the normalized values pij = yij / Y (probabilities) )

The flows must satisfy conservation equations (Kirchoff Conditions) • Syhi - Syij = 0 i` V • (h,i) `E (i,j) `E as well as yij> 0 and (*). The yij can take any values which satisfy these constraints. What should we estimate them to be?

Noting that we can write the number of “hits” per unit time on page i as: Hi = Syhi (h,i) `E The PageRank assumption is that: yij = Hi / difor all(i,j) `E (Where di is the out-degree of page i) Is is easy to see, by direct substitution, that these flows, suitably scaled to satisfy (*), satisfy the conservation equations. Now, what alternatives do we have?

Max Entropy for the Web OK, so we want a maximum entropy solution for the network flow problem. Switching from the yijto pij , we modify the entropy treatment given so that: pi lpij +1 for j=r, (i,r)` E fr(xi) l fr(xij) = -1 for i=r, (r,j)` E 0 otherwise E[fr(x)] = 0 for r`V Then: pij = exp[ -k0-ki + kj ] for (i,j)` E and Z = e k0 = S exp{-ki + kj } { (i,j) `E

Letting ai = e-ki , A = diag(a1 , …, am) and cij = Z-1 for (i,j)` E 0 otherwise, then P = ACA-1 subject to Sphi - Spij = 0 i` V { • (h.i) È • (i,j) È Spij = 1 • (i,j) È This is a matrix balancing (iterative scaling) problem, which can be solved relatively efficiently.

General idea: • 0. Guess the value of Z-1 • Start with initial values for the ai (e.g. 1) , denoted ai(0), • and letpij(k) = Z-1ai(k)/aj(k). At each iteration: • Compute}i(k)= Spij(k) , ri(k)= Spji(k) • Let gi(k) = (ri(k)/ }i(k))1/2 • Update ai(k+1)bgi(k)ai(k) , for some or all i • Stop if 1 – e [ gi(k)[ 1+ e , else step k and go to 1. • Check if sum of the final pij is 1.0. If not, adjust • Z and go to 1. j j Note the work per inner iteration is about twice an iteration of power iteration (or Gauss-Seidel).

However, in practice: j i m k l n Many pages have no in-links or no out-links. The “random surfer” can never reach the green pages or escape from the red pages. G is not strongly connected

In matrix terms this means that the Markov chain matrix is reducible, i.e. M can be arranged in the form: M11 0 M21 M22 [ ] In this case all the w’s associated with M11 will be zero and hence the rank of those pages will also be zero. Also the pages with zero in-degree have no rank to confer. Solution idea - fudge in some “seed” rank. xi = const + S aijxj (j,i) `E

More sophisticated idea: Assume that every few clicks the surfer makes a“random jump” to anywhere on the web, with frequency (1 – a), i.e. modify the Markov chain so that: M* = (1 – a) e eT + aM n where e = <1,1,….,1>, and 0 < a < 1 (e.g. a = 0.9) This corresponds to a modified PageRank calculation xi = (1 – a) e eTx + S aijxj n or x = [(1 – a) e eT + A ] x n (j,i) `E

A Modified Network Formulation • We adopt the same idea – that with frequency (1 – a) • surfers make a random jump. However instead of fixing this • frequency to be the same for surfers at each page, we an • overall fraction (1 – a) for the entire network. • Define an additional node n+1 and add it to V to get V’. • We then construct edges from every node in V to n+1 and • from n+1 to every other node. • The total traffic through this node is required to be • Spi,n+1 = (1 – a) = Spn+1,j • The new edge set E’ is E plus the new edges and also a self • loop at each dead-end page. i j

j i m k l n n+1 (All nodes should be linked to and from n+1 but all such links are not drawn)

The extended model is now: • Maximize S = - Spijln pij • Subject to: • Sphi - Spij = 0 i` V • Spi,n+1 = (1 – a) • Spn+1,j = (1 – a) • S pij= 1 • (i,j) È’ • (h.i) È’ • (i,j) È’ i j • (i,j) È’

The extended model still corresponds to a (hybrid) matrix balancing problem, but with some row and column sums required to take particular values – a simple extension. Note that if we know the flow through any node we can write down the equations, just as we did for the “artificial” node, and treat them in the same way. In other words we can reduce the uncertainty in the model by applying additional information. The solution algorithm requires only minor modification, and produces not only the “traffic” primal variables (pij) , but as an essential by-product the exponentials of the Lagrange multipliers: ai = e-ki

Traffic Ranking It is natural to consider the estimated traffic through a page as measure of its “importance”. We therefore compute: Hi = Sphi , (h,i) `E normalize these values so that the largest is 1.0, and sort in decreasing order to obtain a “traffic ranking” from the primal solution values. What about the dual values (Lagrange multipliers)?

Entropy and Temperature We noted earlierthat the maximum entropy value is a function of the optimal Lagrange multipliers: S = k0+SkrE[fr(x)] (6) Now, varying the functions fr(x) in an arbitrary way, so that dfr(xij) may be specified independently for each r and (i,j), and Letting the expectations of the fr change in a manner which Is also independent, we obtain from (5) dk0 = d log Z = - S{dkrE[fr] + krE[dfr] } And so from (6): dS = Skr {dE[fr] - E[dfr] } (7) dS = Skr dQr (8) where (7) defines Qr as the rth form of “heat”. r r r r

Why?????? • Because in classical thermodynamics one has the following • relation between entropy and heat:- • dS = dQ • T • where Q is heat added (this defines absolute temperature T). • Since we have: • dS = Skr dQr • the kr play the role of the inverse of temperature. • Thus we may hypothesize a page temperature: • Trl 1/ kr • and rank the pages by any order preserving function of 1/ kr • We currently use e-ki , since these fall out of the algorithm. • These values form the Hyperlinked Object Temperature Scale • (HOTS)

Computational results 1. IBM Intranet crawls made in 2002(a) 19 million pages ,206 million links (b) 17 million pages. 2. Partial internet crawl made in 2001.173 million pages (constraints)2 billion links (variables)

Intranet Results The results given are for Intranet crawls carried out in 2002, Covering 19 and 17 million pages respectively. Some attempt was made to avoid duplicate pages and to canonicalize Domino URLs. (To avoid having many views of the same base document treated as distinct URLs.) However, there are still some duplicates and bogus URLs in the crawl. There are about 200,000,000 links in the resulting graph.

What’s the Quality Like? To measure “quality” of the ranking, we took a list of management blessed queries and the corresponding top URL’s they expected a good search engine to return (in late 2001). There were about 100 of these found in the first crawl and about 200 for the second. We then computed the average rank of the ones that were there, as our measure of quality. (Note that a rank of 1 is the highest, so a small average is good).

Intranet Results Average Scores ( x 106) (Possible scale 1 to n ) Test PageRank Traffic HOTness 1 0.644 2.275 0.461 2 1.242 1.417 1.160 Traffic < PageRank < HOTness

Internet Result Quality Again use “human expert” ranking. In this case, the pages chosen by the Open Database Project (ODP http://dmoz.org). We compute the average rank at each level of the hierarchy (including pages at the specified level, or above), e.g. Level 1: /Top/Computers Level 2: /Top/Computers/Education plus level 1 Level 3: /Top/Computers/Education/Internet plus level2 Etc.

Internet Result Quality (cont.) Average Ranks (x 107) Level Number PageRank TrafficRank HOTness 1 27 0.753 6.404 1.656 2 4258 3.143 2.862 2.614 3 65343 4.448 4.385 3.949 4 228943 4.686 4.887 4.286 5 427578 4.817 5.127 4.438 All 990354 5.236 5.677 4.812 Except at level 1 we again have: Traffic < PageRank < HOTness

Rank Aggregation How do we use multiple static ranks (especially when they appear to rank by tenuously related criteria) ? For intranet crawl 1 we tried taking just the lowest of the 3 ranks for each URL in the list. This gives a new average score of 85113 (as opposed to ~461,000) However the scale is now compressed by at most 3. Aggregation is covered in another paper (Fagin et al.)

Generalized Model • Maximize S = S{zij pij- pijln pij } • Subject to: • Sphi - Spij = 0 iÙ • Sphi = Hi i`V-U • Spij = Hi • Scij pij = C • S pij= 1 • (i,j) È’ • (h.i) È’ • (i,j) È’ • (h.i) È’ • (i,j) È’ • (i,j) È’ • (i,j) È’

Future Research • Efficient numerical methods for the hybrid matrixbalancing problem • Rank aggregation methods • Use in an actual search engine (especially for Intranet) • Obtaining and using traffic, a priori probability andimpedance data. • Vulnerability to spam? • Non-equilibrium (dynamic) extensions – web sensitivity.

What’s HOT on the Web John Tomlin February 2003 Expanded from a talk given at INFORMS