Ranking the web
This presentation is the property of its rightful owner.
Sponsored Links
1 / 45

Ranking the Web PowerPoint PPT Presentation


  • 86 Views
  • Uploaded on
  • Presentation posted in: General

Ranking the Web. Gianna M. Del CorsoAntonio Gull í Dipartimento Informatica, Pisa IIT-CNR, Pisa. Overview. Web Statistics Some Web Ranking Algorithms Zooming on PageRank Personalization Fast PageRank Fun Results and Web Comparison Online demo. Web Statistics. Web Statistics.

Download Presentation

Ranking the Web

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Ranking the web

Ranking the Web

Gianna M. Del CorsoAntonio Gullí

Dipartimento Informatica, Pisa

IIT-CNR, Pisa


Overview

Overview

  • Web Statistics

  • Some Web Ranking Algorithms

  • Zooming on PageRank

    • Personalization

    • Fast PageRank

  • Fun Results and Web Comparison

    • Online demo

Fun 04 G.M. Del Corso, A. Gulli


Ranking the web

Web Statistics

Fun 04 G.M. Del Corso, A. Gulli


Web statistics

Web Statistics

  • January 2004, 151 millions active in the U.S.

  • 76% used a SE at least once a month.

  • Time spent searching ~ 40 mins.

[Nielsen//NetRatings]

Fun 04 G.M. Del Corso, A. Gulli


Share of searches february 2004

Share Of Searches: February 2004

  • February 2004

  • 1.5Millions US web surfers

[comScore Media Metrix]

Fun 04 G.M. Del Corso, A. Gulli


Search referrals

Search Referrals

  • March 2004

  • 25 Millions Web Pages

[WebSideStory]

Fun 04 G.M. Del Corso, A. Gulli


Ranking the web

Censured

[google-watch.org]

Fun 04 G.M. Del Corso, A. Gulli


A cash cow business

A Cash Cow Business

  • Jupiter Media Metrix estimates Paid Ad will reach as much as $4 billion by 2005

  • Business growing rate increase of 20% in next five years

Fun 04 G.M. Del Corso, A. Gulli


Google s numbers

Google’s numbers

IPO To Happen, Files For Public Offering

$2,718,281,828

For those not blessed with a PhD and a job at google, is euler'snumber…

[Google’s IPO Sec Filing]

Fun 04 G.M. Del Corso, A. Gulli


Ranking the web

Web Ranking

Fun 04 G.M. Del Corso, A. Gulli


Web ranking

Web Ranking

  • The author of p gives a vote to q

p

q

Fun 04 G.M. Del Corso, A. Gulli


Ranking the web

Hits

  • Eigenvectors computation can be used by:

    Where

    a: Vector of Authorities’ scores

    h: Vector of Hubs’ scores.

    W: Adjacency matrix in which wi,j = 1 if points to j.

Fun 04 G.M. Del Corso, A. Gulli


Ranking the web

Authority

Hubness

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Authority and hubness weights

Hits

Fun 04 G.M. Del Corso, A. Gulli


Salsa

Salsa

  • Two separate random walks

    • Hub walk

    • Authority walk

Fun 04 G.M. Del Corso, A. Gulli


Hits vs salsa

Hits vs Salsa

  • H = WrWcT A = WcTWr

  • W is the adjacency matrix of G

  • Wr is W divided by the sum of entries in its rows

  • Wc is W divided by the sum of entries in its cols

  • Stationary distribution proportional to in-links and out-links!!

Fun 04 G.M. Del Corso, A. Gulli


Google s pagerank

Google’s PageRank

Fun 04 G.M. Del Corso, A. Gulli


Google s pagerank1

Google’s PageRank

  • “Random Surfer Model” - Rank of page equals to the probability of sitting on that page

  • Where

    B(i) : set of pages inlinking to i.

    N(j) : num outgoing links from j.

Fun 04 G.M. Del Corso, A. Gulli


Google s pagerank2

Google’s PageRank

  • Dangling nodes, i.e. Web pages with no outlinks

  • P, Web Graph Matrix

  • Cyclic paths

    • Surfer get bored and jump to another place

  • v is a personalization vector, α is the dumping factor

Fun 04 G.M. Del Corso, A. Gulli


Personalized pagerank

a

b

[Hawelivala 02]

Personalized PageRank

Biased Rank

Fun 04 G.M. Del Corso, A. Gulli


Eurekester

Eurekester

Create and join SearchGroups to focus your search by area of interest

Fun 04 G.M. Del Corso, A. Gulli


Ranking the web

Fast PageRank

Fun 04 G.M. Del Corso, A. Gulli


Pagerank

PageRank

  • Standard Algorithm for computing PR: Power Method applied to

  • Takes several days due to the size of Web Graph

Fun 04 G.M. Del Corso, A. Gulli


Why we need a fast link based rank

Why we need a fast link-based rank?

“…The link structure of the Web is significantly more dynamic than the contents on the Web. Every week, about 25% new links are created. After a year, about 80% of the links on the Web are replaced with new ones. This result indicates that search engines need to update link-based ranking metrics very often…”

[ Cho et al., 04]

Fun 04 G.M. Del Corso, A. Gulli


Accelerating pagerank

Accelerating PageRank

  • Web Graph Compression to fit in internal memory [Boldi et al., 04]

  • Efficient External memory implementation [Haveliwala, 99; Chen et al., 02]

  • Mathematical approaches

  • Combination of the above strategies

Fun 04 G.M. Del Corso, A. Gulli


Accelerating pagerank1

Accelerating PageRank

Adaptive Power method:

C = set of pages converged, N = set of pages not yet converged

Run PM on

detecting converged components. In the paper, many other adapting strategies!!

Slow-converging pages have high PageRank

SpeedUp: 22% time reduction, Precision: 10-3

DataSet: 280.000 nodes

[ Kamvar et al., 03 ]

Fun 04 G.M. Del Corso, A. Gulli


Accelerating pagerank2

Accelerating PageRank

Extrapolation strategies:

where ui eigvs

periodically subtract off estimates of non-principal eigenvectors from x(k) … Much improved over PM as α→ 1

SpeedUp: 69% time reduction, Precision: 10-3

DataSet: 80Millions nodes

[ Kamvar et al., 03 ]

Fun 04 G.M. Del Corso, A. Gulli


Accelerating pagerank3

Accelerating PageRank

Block Structure

  • Reordering web pages according to a lexicographical order.

  • Compute “local Rank”

  • Create a new starting vector

Berkeley

Stanford

SpeedUp: 75% time reduction, Precision: 10-3

DataSet: 70Million nodes

[ Kamvar et al., 03 ]

Fun 04 G.M. Del Corso, A. Gulli


Accelerating pagerank4

Accelerating PageRank

Sparse Linear Permutation

  • Viewing PR as a linear system problem

  • Transforming it in a sparse formulation

  • Exploiting reducibility via permutations

  • Comparing different scalar and block solvers

SpeedUp: 89% time reduction, Precision: 10-7

DataSet: 24M nodes

[ Del Corso et al., 04 ]

Fun 04 G.M. Del Corso, A. Gulli


Rich get richer phenomenon

“Rich Get Richer” phenomenon

“.. From our experimental data, we could observe that the top 20% of the pages with the highest number of incoming links obtained 70% of the new links after 7 months, while the bottom 60% of the pages obtained virtually no new incoming links during that period…”

[ Cho et al., 04 ]

Fun 04 G.M. Del Corso, A. Gulli


Ranking the web

Web Spamming

Fun 04 G.M. Del Corso, A. Gulli


Spamming pagerank

Spamming PageRank

An Optimal Link Structure

Spam Farm (SF), rules of thumb

  • Use all available own pages in the SF, ↑rstatic

  • Accumulate the maximum number of inlinks to SF, ↑ rin.

  • Suppress links pointing outside the SF, rout = 0.

  • Avoid dangling nodes within the SF, every page (including t) has some outlinks.

[Garcia-Molina et al., 04]

Fun 04 G.M. Del Corso, A. Gulli


Spamming pagerank1

W

1.

Spamming PageRank

Setting up sophisticated link structures within a spam farm does not improve the ranking of the target page.

Fun 04 G.M. Del Corso, A. Gulli


Spamming hits

Spamming Hits

  • Easy to spam

  • Create a new page p pointing to many authority pages (e.g., Yahoo, Google, etc.) p becomes a good hub page

    … On p, add a link to your home page

Fun 04 G.M. Del Corso, A. Gulli


Fun results aka google bombing

Fun Results (aka “Google Bombing”)

Fun 04 G.M. Del Corso, A. Gulli


Ranking the web

Fun Search Resuls and Demo

Fun 04 G.M. Del Corso, A. Gulli


Fun results aka google bombing1

Fun Results (aka “Google Bombing”)

  • Some Recent (as of 2004) and popular examples :

    • “weapons of mass destruction - hoax, IE error look-a-like saying “weapons of mass destruction cannot be found”.

    • great president - biography of George W. Bush.

    • litigious bastards - homepage of the SCO Group.

    • Buffone - Facce da culo - Discorsi Folli – Silvio Berlusconi

    • out of touch executives – Google’s own corporate info page

    • Waffle – John Kerry’s site (blog spamming campaign)

[ wikipedia ]

Fun 04 G.M. Del Corso, A. Gulli


Will google still dominate search in 2005

Will Google still dominate search in 2005?

  • Every three years, a new search engine takes the lead and has its 15 minutes of fame.

  • A timeline is at http://www.investors.com/

  • Open Source alternative [ Nutch ]

Fun 04 G.M. Del Corso, A. Gulli


Ranking the web

Fun 04 G.M. Del Corso, A. Gulli


Comparing ranks online demo

Comparing Ranks (Online Demo)

Fun 04 G.M. Del Corso, A. Gulli


Bibliography

Bibliography

  • K. Bharat, M. Henzinger: Improved Algorithms for Topic Distillation in a Hyperlinked Environment, SIGIR Conference, 1998

  • P. Boldi and S. Vigna. The WebGraph framework I: Compression techniques. To appear in Proc. of the Thirteenth International World−Wide Web Conference.

  • S. Brin and L. Page, The anatomy of a large-scale hypertextual Web search engine, Computer Networks and ISDN Systems vol. 30 num 1-7, 1998

  • S Brin, L. Page: The Anatomy of a Large-Scale Hypertextual Web Search Engine, WWW Conference, 1998

  • M. Bianchini, M. Gori, F. Scarselli, "Inside PageRank". Technical report DII 1/03, Department of Information Engineering, University of Siena, 2001.

  • Y.Y. Chen, Q. Gan, T. Suel: I/O-Efficient Techniques for Computing Pagerank", Proceedings of the eleventh international conference on Information and knowledge management

  • J. Cho, S. Roy: Impact of Web Search Engines on Page Popularity In Proceedings of the World-Wide Web Conference (WWW), May 2004.

  • G.M. Del Corso, A. Gulli, F. Romani: Fast PageRank Computation Via a Sparse Linear System, ITT-CNR TechReport 2004

  • C.P.C Lee, G.H. Golub, S.A. Zenios: A Fast two stage algorithm for computing PageRank, Stanford Tech-Report 2004

Fun 04 G.M. Del Corso, A. Gulli


Bibliography1

Bibliography

  • R. Lempel, S. Moran: SALSA: The Stochastic Approach for Link-Structure Analysis, ACM Transactions on Information Systems Vol. 19 No.2, 2001

  • T. H. Haveliwala: Topic-Sensitive PageRank: A Context-Sensitive Ranking Algorithm for Web Search, IEEE Trans. on Knowledge and Data Eng, 2003

  • T. H. Haveliwala, Sepandar D. Kamvar, and Glen Jeh, "An Analytical Comparison of Approaches to Personalizing PageRank", Preprint, June, 2003

  • S.D. Kamvar, T.H. Haveliwala, C.D. Manning, G.H. Golub: Extrapolation Methods for Accelerating PageRank Computations, WWW Conf., 2003

  • S.D. Kamvar, T.H. Haveliwala, C.D. Manning, G.H. Golub: Exploiting the Block Structure of the Web for Computing PageRank, Stanford Tech.Rep, 2003

  • S.D. Kamvar, T. H. Haveliwala, and G. H. Golub, "Adaptive Methods for the Computation of PageRank", Linear Algebra and its Applications, Special Issue on the Numerical Solution of Markov Chains, Nov., 2003.

  • Kleinberg: Authoritative Sources in a Hyperlinked Environment, Journal of the ACM Vol.46 No.5, 1999

  • A. Ntoulas, J. Cho, C. Olston "What's New on the Web? The Evolution of the Web from a Search Engine Perspective." World-Wide Web Conference, May 2004.

  • G., Zoltan; Garcia-Molina, Hector. Web Spam Taxonomy. Technical Report, Stanford University, 2004

Fun 04 G.M. Del Corso, A. Gulli


Ranking the web

Fun 04 G.M. Del Corso, A. Gulli


Broder s altavista

Broder’s Altavista

Patented May 2003

  • A, Attractor Matrix: sites externally endorsed

  • N, Non Attractor Matrix: sites deemed to be avoided

  • Use a linear combination of A, N and other matrices

  • Suggest to also use non principal eigenvectors

Fun 04 G.M. Del Corso, A. Gulli


Accelerating pagerank5

Accelerating PageRank

Two-stage algorithm

The Markov Chain associated with P is lumpable

Combine D nodes into a block. P1 is the transition matrix

Compute the stationary distribution of P1

Combine ND nodes into a block. P2 is the transition matrix

Compute the stationary distribution of P2

Concatenate the results

D are the dangling nodes, ND the non dangling nodes

SpeedUp: 80% time reduction, Precision: 10-9

DataSet: 451.000 nodes

[ Lee et al., 04 ]

Fun 04 G.M. Del Corso, A. Gulli


Finally the perfect search engine

Finally…the perfect search engine?

Sergei Brin: “It would be the mind of God. Larry says it would know exactly what you want and give you back exactly what you need.”

Chackabarti: “The web grew exponentially from almost zero to 800 million pages between 1991 and 1999. In comparison, it took 3.5 million years for the human brain to grow linearly from 400 to 1400 cubic centimeters. How do we work with the web without getting overwhelmed? We look for relevance and quality. Can we design programs to recognize these properties?”

Fun 04 G.M. Del Corso, A. Gulli


  • Login