The math behind pagerank
Sponsored Links
This presentation is the property of its rightful owner.
1 / 22

The math behind PageRank PowerPoint PPT Presentation

  • Uploaded on
  • Presentation posted in: General

The math behind PageRank. A detailed analysis of the mathematical aspects of PageRank Computational Mathematics class presentation Ravi S Sinha LIT lab, UNT. Partial citations of references. The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin and Lawrence Page

Download Presentation

The math behind PageRank

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

The math behind PageRank

A detailed analysis of the mathematical aspects of PageRank

Computational Mathematics class presentation

Ravi S Sinha

LIT lab, UNT

Partial citations of references

  • The Anatomy of a Large-Scale Hypertextual Web Search Engine

    • Sergey Brin and Lawrence Page

  • Inside PageRank

    • Monica Bianchini, Marco Gori, and Franco Scarselli

  • Deeper Inside PageRank

    • Amy Langville and Carl Meyer

  • Efficient Computation of PageRank

    • TaherHaveliwala

  • Topic Sensitive PageRank

    • TaherHaveliwala

Overview of the talk

  • Why PageRank

  • What is PageRank

  • How PageRank is used

  • Math

  • More math

  • Remaining math

Why PageRank

  • Need to build a better automatic search engine

    • Why?

      • Human maintained lists subjective and expensive to build (non-automatic)

      • Automatic engines based on keyword matching do a horrible job (just page content is not enough; cleverly placed words in a page can mislead search engines)

      • Advertisers sometimes mislead search engines

  • Solution: Google [modern day: much more than PageRank; getting smarter]

    • Exact technology: not public domain

    • Core technology: PageRank (utilizes link structure)

  • Other uses

    • Any problem that can be visualized as a graph problem where the centrality of the vertices needs to be computed (NLP, etc.)

What is PageRank

  • A way to find the most ‘important’ vertices in a graph

  • PR(A) = (1-d) + d [ PR(T1) / C(T1) + … + PR(Tn) / C(Tn) ]

  • Forms a probability distribution over the vertices [sum = 1]

  • How does this relate to Web search?

    • Vertices = pages

    • Incoming edges = hyperlinks from other pages

    • Outgoing edges = hyperlinks to other pages

Simple visualization: the simplest variant of PageRank in use [user behavior]

Random surfer

Only one incoming link, yet high PageRank

Damping factor

Lexical Substitution: A crash course

There are different types of managed care systems

PageRank in use: Lexical Substitution

Weights: word similarity

Directed/ undirected: whole other realm

And now, the cool stuff

The math behind PageRank

  • Intuitive correctness

  • Mathematical foundation

  • Stability

  • Complexity of computational scheme

  • Critical role of the parameters involved

  • The distribution of the page score

  • Role of dangling pages

  • How to promote certain vertices (Web pages)

Intuitive correctness

  • Concept of ‘voting’

    • Related to citation in scientific literature

    • More citations indicate great/ important piece of work

  • Random surfer / random walk

  • A page with many links to it must be important

  • A very important page must point to something equally important

Mathematical foundation

  • Most researchers: Markov chains

    • Caveat: Only applicable in absence of dangling nodes

  • Basic idea: authority of a Web page unrelated to its contents [comes from the link structure]

  • Simple representation

  • Vector representation

IN = [1, 1, 1 … 1]’

Transition matrix: ∑(each column) = 1 or 0

Mathematical foundation (2)

Google’s iterative version: converges to a stationary solution

Jacobi algorithm

Alternative computation

||x(t)||1 = 1; normalized

Web communities: Energy balance [measure of authority]

More on energy

Even more on energy [community promotion]

  • Split same content into smaller vertices

  • Avoid dangling pages

  • Avoid many outgoing links

Page promotion

  • Treat certain pages as communities

  • Bias certain pages by using a non-uniform distribution in the vector IN

  • Tinker with the connectivity [PageRank is proved to be affected by the regularity of the connection pattern]

Computation of PageRank

  • PageRank can be computed on a graph changing over time

    • Practical interest [Web is alive]

  • An optimal algorithm exists for computing PageRank

    • Practical applications: Search engines, PageRank on billions of pages – efficiency!

    • Ο(|Η| log 1/ε)

    • NOT dependent on the connectivity or other dimensions

    • Ideal computation: stops when the ranking of vertices between two computations does not change [converge]

The Markov model from the Web

  • The PageRank vector can only exist if the Markov chain is irreducible

  • By nature, the Web is non-bipartite, sparse, and produces a reducible Markov chain

  • The Web hyperlinked matrix is forced to be

    • Stochastic [non-negatives, all columns sum up to 1]

      • Remove dangling nodes/ replace relevant rows/ columns with a small value, usually [1/n].eT

      • Introduce personalization vector

    • Primitive

      • Non-negative

      • One positive element on the main diagonal

      • Irredicible

More on the Markov structure

  • A convex combination of the original stochastic matrix and a stochastic perturbation matrix

    • Produces a stochastic, irreducible matrix

    • The PageRank vector is guaranteed to exist for this matrix

  • Every node directly connected to another node, all probabilities non zero

    • Irreducible Markov chain, will converge

There’s more to PageRank

  • Computation

    • Power method

      • Notoriously slow

      • Method of choice

      • Requires no computation of intermediate matrices

      • Converges quickly

    • Linear systems method

  • The damping factor [usually 0.85]

    • Greater value: more iterations required

    • ‘Truer’ PageRanks

  • Dangling pages

  • Storage issues

The end [for today]

Thanks for listening!

  • Login