The web graph

Web searching and graph similarityVincent Blondel and Paul Van Dooren*CESAME, Universite Catholique de Louvainhttp://www.inma.ucl.ac.be/ * Thanks to P. Sennelart GAMM, 2003

The web graph Nodes = web pages, Edges = hyperlinks between pages 3 billion (Google searched 3,083,324,625 webpages in 2002) Average of 7 outgoing links

The web graph Nodes = web pages, Edges = hyperlinks between pages 3 billion (Google searched 3,083,324,625 webpages in 2002) Average of 7 outgoing links Growth of a few % every month

Outline 1. Structure of the web 2. Methods for searching the web (Google PageRank and Kleinberg Hits) 3. Similarity in graphs 4. Application to synonym extraction (Blondel-Sennelart)

Structure of the web Experiments : two crawls over 200 million pages in 1999 found a giant strongly connected component (core) • Contains most prominent sites • It contains 30% of all pages • Average distance between nodes is 16 • Small world Ref : Broder et al., Graph structure in the web, WWW9, 2000

The web is a bowtie Ref : The web is a bowtie, Nature, May 11, 2000

In- and out-degree distributions Power law distribution : number of pages of in-degree n is proportional to 1/n2.1 (Zipf law)

A score for every page The score of a page is high if the page has many incoming links coming from pages with high page score One browses from page to page by following outgoing links with equal probability. Score = frequency a page is visited.

A score for every page The score of a page is high if the page has many incoming links coming from pages with high page score One browses from page to page by following outgoing links with equal probability. Score = frequency a page is visited. … some pages may have no outgoing links … many pages have zero frequency

PageRank : teleporting random score The surfer follows a path by choosing an outgoing link with probability p/dout(i) or teleports to a random web page with probability 0<1-p <1. Put the transition probability of i to j in a matrix M (bij=1 if i→j) mij = p bij /dout(i) + (1-p)/n then the vector xof probability distribution on the nodes of the graph is the steady state vector of the iteration xk+1=Mxk i.e. the dominant eigenvector of the matrix M (unique because of Perron-Frobenius) PageRank of node i is the (relative) size of element i of this vector

Matlab News and Notes, October 2002

and my own page rank ? use Google toolbar some top pages : PageRank In-degree 1 http://www.yahoo.com 10 654,000 2 http://www.adobe.com 10 646,000 5 http://www.google.com 10 252,000 8 http://www.microsoft.com 10 129,000 12 http://www.nasa.gov 10 93,900 20 http://mit.edu 10 47,600 23 http://www.nsf.gov 10 39,400 26 http://www.inria.fr 10 17,400 72 http://www.stanford.edu 9 36,300

Kleinberg’s structure graph The score of a page is high if the page has many incoming links The score is high if the incoming links are from pages that have high scores

Kleinberg’s structure graph The score of a page is high if the page has many incoming links The score is high if the incoming links are from pages that have high scores This inspired Kleinberg’s “structure graph” hub authority

Good authorities for “University Belgium”

A good hub for “University Belgium”

Hub and authority scores Web pages have a hub scorehj and an authority score aj which are mutually reinforcing : pages with large hj point to pages with high aj pages with large aj are pointed to by pages with high hj hj ← Σ i:(j→i)ai aj ← Σ i:(i→j)hi or, using the adjacency matrix B of the graph (bij=1 if j→i is an edge) h0 B hh1 ak+1BT0 a k a01 Use limiting vectora (dominant eigenvector of BTB) to rank pages = =

Extension to another structure graph Give three scores to each web page : begin b, center c, end e b c e Use again mutual reinforcement to define the iteration bj ← Σ i:(j→i)ci cj ← Σ i:(i→j)bi + Σ i:(j→i)ei ej ← Σ i:(i→j)ci Defines a limiting vector for the iteration b0 B 0 xk+1 = M xk, x0=1 where x = c , M =BT 0 B e0 BT 0

Towards arbitrary graphs For the graph • → • A = and M = For the graph •→ • → • A = and M = Formula for M for two arbitrary graphs GA and GB: M= A B + AT BT With xk =vec(Xk) iteration xk+1 = M xkis equivalent toXk+1 = BXk AT+BTXk A

Convergence ? The (normalized) sequence Zk+1 = (BZk AT+BTZk A)/ ||BZk AT+BTZk A||2 has two fixed points Zeven and Zodd for every Z0>0 Similarity matrix S = limk→∞ Z2k , Z0 =1 Si,j is the similarity score between Vj (A)and Vi (B) Properties • ρS=BSAT+BTSA, ρ=||BSAT+BTSA||2 • Fixed point of largest 1-norm • Robust fixed point for M+ε1 • Linear convergence (power method for sparse M)

Bow tie example S= S= if m>n if n>m not satisfactory graph A 1 • → • 2 graph B 2 1 n+1 n+m+1

Bow tie example S= central score is good graph A 1• → • → •3 2 graph B 2 1 n+1 n+m+1

Other properties • Central score is a dominant eigenvector of BBT+BTB (cfr. hub score of BBTand authority score of BTB) • Similarity matrix of a graph with itself is square and semi-definite. Path graph • → • → • Cycle graph

The dictionary graph OPTED, based on Webster’s unabridged dictionary http://msowww.anu.edu.au/~ralph/OPTED Nodes = words present in the dictionary : 112,169 nodes Edge (u,v) if v appears in the definition of u : 1,398,424 edges Average of 12 edges per node

In and out degree distribution Very similar to web (power law) Words with highest in degree : of, a, the, or, to, in … Words with null out degree : 14159, Fe3O4, Aaron, and some undefined or misspelled words

Neighborhood graph is the subset of vertices used for finding synonyms : it contains all parents and children of the node neighborhood graph of likely “Central” uses this sub-graph to rank automatically synonyms Comparison with Vectors, ArcRank (automatic) Wordnet, Microsoft Word (manual)

Disappear

Parallelogram

Science

Sugar

Conclusion • New notion of similarity between vertices of a graph • Easy to compute : start from X0 = 1 and take even normalized iterates of Xk+1=BXkAT+BTXkA • Potential use for data-mining, classification, clustering • Successful implementation for the french dictionary “Le petit Robert” • Applications in texts, internet, reference lists, telephone networks, bipartite graphs… (Melnik, Widom, …) • Different from sub-graph problems !

The web graph

The web graph

Presentation Transcript

CS728 Lecture 5 Generative Graph Models and the Web

Web Graph representation and compression

Web Graph Characteristics

Graph Databases and the Semantic Web

The Line Graph

Interpreting the Graph

Web as Graph – Empirical Studies

Traffic-driven model of the World-Wide-Web Graph

Web Page Clustering using Heuristic Search in the Web Graph

A Random-Surfer Web-Graph Model

The Web is a Graph

Generative Models for the Web Graph

The Web as a graph

The Web Graph & The Laws of The Web

Web Graph & Link Analysis

Web as a graph

The Graph Theory

Graph Undirected graph Directed graph

The Web Graph & The Laws of The Web

Compressed storage of the Web-graph

The web graph

The web graph

Presentation Transcript

CS728 Lecture 5 Generative Graph Models and the Web

Web Graph representation and compression

Web Graph Characteristics

Graph Databases and the Semantic Web

The Line Graph

Interpreting the Graph

Web as Graph – Empirical Studies

Traffic-driven model of the World-Wide-Web Graph

Web Page Clustering using Heuristic Search in the Web Graph

A Random-Surfer Web-Graph Model

The Web is a Graph

Generative Models for the Web Graph

The Web as a graph

The Web Graph &amp; The Laws of The Web

Web Graph &amp; Link Analysis

Web as a graph

The Graph Theory

Graph Undirected graph Directed graph

The Web Graph &amp; The Laws of The Web

Compressed storage of the Web-graph

The Web Graph & The Laws of The Web

Web Graph & Link Analysis

The Web Graph & The Laws of The Web