Google’s Billion Dollar Eigenvector

Google’s Billion Dollar Eigenvector Gerald Kruse, PhD. John ‘54 and Irene ‘58 Dale Professor of MA, CS and I T Interim Assistant Provost 2013-14 Juniata College Huntingdon, PA kruse@juniata.edu http://faculty.juniata.edu/kruse

Now, back to Search Engines…What must they do? • Crawl the web and locate all public pages • Index the “crawled” data so it can be searched • Rank the pages for more effective searching ( the focus of this talk )

PageRank is NOT a simple citation index Which is the more popular page below, A or B?What if the links to A were from unpopular pages, and the one link to B was from www.yahoo.com ? A B NOTE: While PageRank is an important part of Google’s search results, it is not the sole means used to rank pages.

Intuitively PageRank is analogous to popularity • The web as a graph: each page is a vertex, each hyperlink a directed edge. • A page is popular if a few very popular pages point (via hyperlinks) to it. • A page could be popular if many not-necessarily popular pages point (via hyperlinks) to it. Page A Page B Which of these three would have the highest page rank? Page C

So what is the mathematical definition of PageRank? In particular, a page’s rank is equal to the sum of the ranks of all the pages pointing to it. note the scaling of each page rank

Writing out the equation for each web-page in our example gives: Page A Page B Page C

Even though this is a circular definition we can calculate the ranks.Re-write the system of equations as a Matrix-Vector product. The PageRank vector is simply an eigenvector of the coefficient matrix, with

PageRank = 0.4 PageRank = 0.2 Page A Page B Page C PageRank = 0.4 Note: we choose the eigenvector with

Note that the coefficient matrix is column-stochastic* Every column-stochastic matrix has 1 as an eigenvalue.* As long as there are no “dangling nodes” and the graph is connected.

Dangling Nodes have no outgoing links In this example, Page C is a dangling node. Note that its associated column in the coefficient matrix is all 0. Matrices like these are called column-substochastic. Page A Page C Page B In Page, Brin, et. al. [1], they suggest dangling nodes most likely would occur from pages which haven’t been crawled yet, and so they “simply remove them from the system until all the PageRanks are calculated.”It is interesting to note that a column-substochastic does have a positive eigenvalue and corresponding eigenvector with non-negative entries, which is called the Perron eigenvector, as detailed in Bryan and Leise [2].

A disconnected graph could lead to non-unique rankings Notice the block diagonal structure of the coefficient matrix. Note: Re-ordering via permutation doesn’t change the ranking, as in [2]. Page A Page C Page E Page B Page D In this example, the eigenspace assiciated with eigenvalue is two-dimensional. Which eigenvector should be used for ranking?

Add a “random-surfer” term to the simple PageRank formula. Let S be an n x n matrix with all entries 1/n. S is column-stochastic, and we consider the matrix M , which is a weighted average of A and S. This models the behavior of a real web-surfer, who might jump to another page by directly typing in a URL or by choosing a bookmark, rather than clicking on a hyperlink. Originally, m=0.15 in Google, according to [2]. can also be written as: Important Note: We will use this formulation with A when computing x , and s is a column vector with all entries 1/n, where if

This gives a regular matrix • In matrix notation we have • Since we can rewrite as • The new coefficient matrix is regular, so we can calculate the eigenvector iteratively. • This iterative process is a series of matrix-vector products, beginning with an initial vector (typically the previous PageRank vector). These products can be calculated without explicitly creating the huge coefficient matrix.

M for our previous disconnected graph, with m=0.15 Page A Page C Page E Page B Page D The eigenspace associated with is one-dimensional, and the normalized eigenvector is So the addition of the random surfer term permits comparison between pages in different subwebs.

Iterative Calculation The web currently contains tens of billions of pages. How does Google compute an eigenvector for something this large?One possibility is the power method.In [2], it is shown that every positive (all entries are > 0) column-stochastic matrix M has a unique vector q with positivecomponents such that Mq = q, with , and it can becomputed as , for any initial guess withpositive components and .

Iterative Calculation continued Rather than calculating the powers of M directly, we could use the iteration, .Since M is positive, would be an calculation. As we mentioned previously, Google uses the equivalent expression in the computation:These products can be calculated without explicitly creating the huge coefficient matrix, since A contains mostly 0’s. The iteration is guaranteed to converge, and it will converge quicker with a better first guess, so the previous PageRank vector is used as the initial vector.

“Google-ing” Google

Results in an early paper from Page, Brin et. al. while in graduate school

Attempts to Manipulate Search Results Via a “Google Bomb”

Liberals vs. Conservatives!

Juniata’s own “Google Bomb”

At Juniata, CS 315 is my “Analysis and Algorithms” course

“Ego Surfing” Be very careful…

More than one Gerald Kruse…

Try a search in Google on “PigeonRank.” What types of sites would Google NOT give good results on? PageRank is not the only means Google uses to order search results. Miscellaneous points

[1] S. Brin, L. Page, et. al., The PageRank Citation Ranking: Bringing Order to the Web, http://dbpubs.stanford.edu/pub/1999-66 , Stanford Digital Libraries Project (January 29, 1998). [2] K. Bryan and T. Leise, The $25,000,000,000 Eigenvector: The Linear Algebra behind Google, SIAM Review, 48 (2006), pp. 569-581. [3] G. Strang, Linear Algebra and Its Applications, Brooks-Cole, Boston, MA, 2005. [4] D. Poole, Linear Algebra: A Modern Introduction, Brooks-Cole, Boston, MA, 2005. Bibliography

Any Questions? Slides available at http://faculty.juniata.edu/kruse

Google’s Billion Dollar Eigenvector

Google’s Billion Dollar Eigenvector

Presentation Transcript

The Next Billion-Dollar Brand

Information retrieval – LSI, pLSI and LDA

Google for Genealogists

What is stratification?

Optimization Problems in Internet Advertising

The Dollar

Anatomy of a Large-Scale Hypertextual Web Search Engine (e.g. Google)

Inherency Patent Law 2/10/2004

Chapter 9: Google Goes Public Chapter 10: Google Today, Tomorrow

The $25 Billion Eigenvector

Billion Dollar Green Challenge: Financing Sustainability Sponsored by Energy Corps and ASHRAE

Introduction to Google API…

GFS

The Gerald R Ford-class aircraft carrier 14 billion dollar price

Eigenvector and Eigenvalue Calculation

Dollar vs. Euro

Google Like A Pro!

Money

Google Web API

Opportunities in Military Foodservice