1 / 28

The Anatomy of a Large-Scale Hypertextual Web Search Engine

The Anatomy of a Large-Scale Hypertextual Web Search Engine. A review by: Adam Chamberlain, Adrian Hudnott, Rob Garrood & Ben Smith. November 2005. Agenda. Introduction Overview of Google PageRank Motivation & Description Example Issues & Comparison Further Work Application

yuli-avery
Download Presentation

The Anatomy of a Large-Scale Hypertextual Web Search Engine

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The Anatomy of a Large-ScaleHypertextual Web Search Engine A review by: Adam Chamberlain, Adrian Hudnott, Rob Garrood & Ben Smith November 2005

  2. Agenda • Introduction • Overview of Google • PageRank • Motivation & Description • Example • Issues & Comparison • Further Work • Application • Conclusions

  3. Introduction • About the paper • Brin & Page, 1998, Stanford University • Details a prototype search engine, Google • Covers both architecture and algorithms • Cited in web metrics with relation to significance • Also relevant to Web Graph Properties • PageRank • Covered in a separate paper from Brin & Page • Is the primary metric used in the paper

  4. Overview : What is Google? • Web search engine • Tackles issues faced by previous crawlers of scalability and manipulation • Academic • Built on strong understanding of web metrics • Use of hyperlink structures • Transparent • Initially released into the public domain • Support for informatics research

  5. Crawler Barrels Sorter Overview : Architecture URL Server Store Server Anchors Repository Check sums URL Resolver Indexer Links Doc Index Lexicon Searcher PageRank

  6. Overview: Google Architecture (Explanation for handout only.) • URL Server: Finds pages to surf. • Crawler: Downloads pages and places them in the repository. • Store Server: Document compression. • Repository: Cached copies of most web pages. • Indexer: Creates the forward index (documents  words) and extracts hyperlink tags into the Anchors file. • URL Resolver: Converts relative URLs into absolute URLs and creates the Links file. • Links file: Ordered pairs of document IDs where a hyperlink exists between them. • Sorter: Re-sorts the forward index to create the inverted index (words  documents) and creates the Lexicon. • Lexicon: Dictionary of all possible search keywords. • Doc Index: Maps document identifier codes to URLs. • PageRank: An influential web metric used to sort Google’s matches. • Searcher: Performs searches!

  7. Overview : Forward Index • Indexer identifies key word ‘hits’ in a document • Maps document (page) ID’s to word ID’s in Lexicon • Word ID’s partially sorted into barrels • 64 of these • Word ID’s within a barrel are unsorted. • Individual document may spread over barrels. • However, not useful for search!

  8. Overview : Inverted Index • Want to know in what documents a key word occurs • Need the ‘Inverted Index’ • Sorts the forward index into its inverted form • Function performed by the ‘Sorter’

  9. Overview : Ranking System • Proximity of keyword ‘hits’ • This is the sum of the distance between them • Hits have ‘types’ • Types: body text, heading text, anchor text, url, … • Relative font size factor used • Count how many hits occur of each type and range of proximity values • Apply a function to each type-proximity count • These form a type-proximity vector, C

  10. f(x) Hit Count, x Overview : Ranking System (2) • V = C·W (dot product) is computed. • W is the importance associated with each type-proximity class. • Combine V with the PageRank score • Effect of increasing hits declines • Prevents large scale manipulation

  11. PageRank : Motivation • Academic Citation Analysis* attempted, but… • Web has no formal quality control or peer review • Possible to inflate citation counts artificially • Web pages vary more than academic papers • Consider: • One link from the University’s main page, or one link from Yahoo’s main page… • Which citation should carry the higher weight ? *Also known as bibliometrics

  12. PageRank : Description • Informal Definition: • “A page has a high rank if the sum of the ranks of its backlinks are high” • Handles ‘Yahoo’ case on previous slide • Intuitive Definition: • Corresponds to the Random Surfer Model • User keeps clicking on links ‘linearly’ then gets bored and restarts at a random location • Now for the maths…

  13. PageRank : Description (2) • Formal Definition: • c is a ‘dampening’ factor, was 0.85 • Nv is number of out-links from page v • Bu is the set of backlinks from the current page • cE(u) corresponds to the surfer getting ‘bored’

  14. A B E D C PageRank : Example • Considering an example network • Calculating A: c = dampening factor N = out-degree R = PageRank

  15. A B E D C PageRank : Example (2) • Initially set all PageRank to 1 • First Iteration:

  16. PageRank : Example (3) • Repeat process for B, C, D and E • Feed computed values into next iteration

  17. PageRank : Analysis • Converges in log n time • Constrained by the time to build a full-text index more than anything • Rank ‘Sinks’ • Caused by two pages that point to each other but not to any other pages: rank accumulates • Solved by random surfer model • Manipulation – ‘Google Bombing’ • French Military ‘Victories’ links to ‘Defeats’ • ‘Miserable Failure’ links to George Bush biography

  18. PageRank : Comparison • Web Graph Properties • Uses graph of the entire web: depends on full crawl • More sophisticated than simply summing in/out-degrees • Web Page Significance • Uses Boolean Spread Activation – match all words • Enhanced citation analysis – building on work of Kleinberg, Egghe & Rousseau • Doesn’t suffer from Tightly Knit Communities effect of Kleinberg’s Hubs & Authorities

  19. PageRank : Further Work • Personalised PageRank, Haveliwala, 1999 • In-memory, block oriented, algorithm • PageRank can be computed in an hour on a PIII 450Mhz using less than 100Mb of main memory • Compute PageRank on the client-side • Use local information: bookmarks, searches, history • Provide the link structure of the web on a DVD • 11/11/05, “Personalized Search” released

  20. PageRank : Further Work (2) • Topic Sensitive PageRank, Haveliwala, 2002 • Improve Google by giving weight to the informational relationship between sites • A) Uniform Results • Similar to ‘current’ Google but with topics • B) Personalised to a particular user • Based on previous searches and users’ surfing habits

  21. Applications : Google • Google Inc. • Largest search engine • Technologies utilised by others (e.g. Yahoo!) • Biggest ever technology IPO, 2004 • Redefining search • Set a trend for other search providers • Raised importance of quality web search results • Combining information retrieval methods • Business model based on advertising • Potential area for conflict • Over 100 factors now influence results

  22. Applications : PageRank • Back-link prediction • Desire for optimal web crawling strategy • Better indicator than citation counts! • Improving user navigation • ‘The PageRank Proxy’ • Providing PageRank information with links • Establishing trust • Wealth of authors on the web, who to trust? • Use PageRank to rate trust

  23. Applications : The Future • Internal Development • Project no longer in academic realm • Lack of transparency initially intended • Role of PageRank unclear • Likely focus on extensions and results tuning • External Development • API’s • Allowing innovative use of Google technologies • Open Source Code • Focused on developing infrastructure

  24. Conclusions • Academic Background • Success from strong academic understanding • Raised profile of informatics and search • Good platform for future research • Success as a failure • Intention for transparency and use in academia • Commercial success has removed transparency • Potentially bad for further research in this area

  25. Summary • We have seen: • The architecture used by Google • PageRank as a web metric • Strengths and potential manipulations • The commercial success of Google • Applications • Potential areas of future research

  26. References • Work by Brin & Page (now at Google) • Brin, S., Page, L. (1998), ‘The anatomy of a large-scale hypertextual search engine’, Computer Networks and ISDN Systems, 30(1-7):107--117. • Page, L., Brin, S., Motwani, R. and Winograd, T. (1998), ‘The PageRank Citation Ranking: Bringing Order to the Web', Stanford Digital Library Technologies Project. • More papers at: http://www.google.com on many aspects of web metrics and search in general • PageRank • http://www.iprcom.com/papers/pagerank/ • Take a look at the example at: http://www.dcs.warwick.ac.uk/~csucbu • http://en.wikipedia.org/wiki/Google_bomb

  27. References (2) • Further Developments • Haveliwala, T. H. (1999), ‘Efficient computation of PageRank’. Technical report, Stanford University, Stanford, CA, 1999. • Haveliwala, T. H. (2002), ‘Topic-sensitive PageRank’. In Proceedings of the Eleventh International World Wide Web Conference, Honolulu, Hawaii, May 2002. • Commercial Aspect • http://money.cnn.com/2004/04/29/technology/google/ • http://www.google.com/corporate/history.html • Web Metrics • Dhyani, D., Keong N., W. , and Bhowmick, S. (2002), ‘A survey of web metrics’, ACM Computing Surveys, 34(4):469--503.

More Related