1 / 49

Google Pagerank: how Google orders your webpages

Google Pagerank: how Google orders your webpages. Dan Teague NCSSM. The Problem. Imagine a library containing 40 billion documents but with no centralized organization and no librarians. In addition, anyone may add a document at any time without telling anyone.

lara
Download Presentation

Google Pagerank: how Google orders your webpages

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Google Pagerank: how Google orders your webpages Dan Teague NCSSM

  2. The Problem • Imagine a library containing 40 billion documents but with no centralized organization and no librarians. • In addition, anyone may add a document at any time without telling anyone. • If one of these documents is vitally important to you, how could you find it?

  3. Why This Order?

  4. Google Pagerank System Google was developed by Sergey Brin and Larry Page This is the method that Larry Page developed to rank and order the pages. Hence, the Pagerank.

  5. Larry Page (new CEO of Google) Co-founder Larry Page once described the “perfect search engine” as something that “understands exactly what you mean and gives you back exactly what you want.”

  6. Eagle Ray at Eden Rock

  7. How would you order these site? • Suppose each of the nodes at right have the links shown in the directed graph. Which node is most important and should appear first?

  8. The Basic Idea • PageRank is a numeric value that represents how important a page is on the web. Google figures that when one page links to another page, it is effectively casting a vote for the other page. The more votes that are cast for a page, the more important the page must be. Also, the importance of the page that is casting the vote determines how important the vote itself is. http://www.webworkshop.net/pagerank.html

  9. Bucket Brigade Matrix

  10. OutdegreeMatrix H

  11. Markov Chain We would like to think of this matrix as a transition matrix (like a Markov chain). If we move around on the graph at random, at which nodes will we spend most of our time? These most important nodes can be found in a Markov chain by considering the powers of H.

  12. Or we can look for solutions to HX = X. This means we want the eigenvector X associated with the eigenvalue of 1. • This is why the Pagerank is known as the $25,000,000,000 eigenvector.

  13. Consider powers of H

  14. Where did all the Importancego?

  15. Things that go wrong:

  16. Dangling Nodes

  17. Dangling Node

  18. Cycles

  19. Dangling Subgraphs

  20. Graph not strongly connected

  21. Powers of Hs

  22. States 4-7 Disappear

  23. How Do We Handle These Problems? • The Dangling Node • The Cycle • The Sub-graph Sink

  24. The Dangling Node • The Dangling Node we handle by requiring a transition to another node at random. • Pick a node, move there, and then move forward.

  25. We alter our Bucket Brigade matrix by adding in matrix A.

  26. Matrix H + A

  27. What About the Other Problems? Dangling Nodes are easy to find. Cycles and Sub-graph sinks are more difficult and time consuming. Pagerank handles these problems without actually finding them. • The Cycle • The Sub-graph Sink

  28. Probabilistic Movement • Roll a die. • If anything but a 6 shows, • then follow the web, that is, • use our matrix (H + A). • However, if you roll a 6, then pick a page at random and go there. • This gives us an out when we are trapped either by a cycle or by a sub-graph sink.

  29. How Often Should We Look for an Escape? • Would it be better to roll a 20-sided die or flip a coin?

  30. How do you implement the coin flip? • Create a matrix all of whose entries are 1. This is the One matrix. If we multiply this matrix by 1/n, where n is the number of nodes in the graph (in our example 11, in reality 40 billion), then we have an equal chance of traveling from any point to any other point. We pretend that the web is a complete graph.

  31. Roll the die • We will use the Web-ordered matrix H+A with probability p and the One matrix with probability (1-p). • What’s a good value for p?

  32. The Basic Google Equation

  33. G = p(H + A) + (1-p)One (1/n) We know that (H + A) and One(1/n) are both Markov chains. Is G also? So, powers of G should tell us what we want to know.

  34. G = p(H + A) + (1-p)One (1/n) • But powers of G is an incredibly inefficient way to go on the “real world” of the web. • Instead, the iterative method is employed.

  35. Iterating Xn+1 = GXn

  36. The Pagerank order is 8-7-11-10-9-6-5-1-2-3-4

  37. What about p? • What role does p play and what value is actually used?

  38. p determines the rate of convergence

  39. p = 0.95 has not yet converged

  40. Google Pagerank • Google claims that it uses p = 0.85 (roll of the die is just about right) and about 50 iterations of the matrix G, where G = p(H + A) + (1-p)One (1/n). It recomputes every month.

  41. References:

  42. Convergence?

More Related