1 / 9

Evolution of Web from a Search Engine Perspective

Evolution of Web from a Search Engine Perspective. Saket Singam sks2141@columbia.edu. Introduction. Larger and Diverse growth of Web => Search Engine becoming “Killer Application” Search Engines typically “crawl” web pages in advance

baylee
Download Presentation

Evolution of Web from a Search Engine Perspective

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Evolution of Web from a Search Engine Perspective Saket Singam sks2141@columbia.edu

  2. Introduction • Larger and Diverse growth of Web => Search Engine becoming “Killer Application” • Search Engines typically “crawl” web pages in advance • Discussion 1) What’s new on the Web ? • New pages created @ rate of 8% per week • 20 % of Web pages are accessible after 1 year • Borrowing content from the existing pages- 62 % of the content in these pages is new, after 1 year, 50% of the Web has new content 2) How much change ? • Once a page is created, it is likely to go through either a minor change or no change 3) Can we predict future changes ? • Frequency of changes • Degree of Change

  3. Experimental Setup • Selection of Sites • “Representative” as well as “Interesting” samples of Web • About 5 top-ranked pages from a subset of Topical Categories of the Google Directory • Download of Pages (almost a year) • Pages from 154 “Popular” Web Sites • Downloaded weekly in a Breadth-first order starting from Root pages of the Web Site until all reachable pages or a maximum of 200,000 pages • Total Number of pages in weekly download = 3-5 million (avg 4.4 million) • Size = 65 Gb before compression per week • Total of 3.3TB of web history data and 4TB of arrived data (links,shingles) • Table : “Fraction of pages included in this Experiment”

  4. What’s New on the Web? – Pages, Content and links • Weekly Birth Rate of pages • How many new pages are created per week ? • Identity - URL of the popular page • Average Weekly Birth rate is 8% • Once every month, # new pages higher than in previous week • Birth, death and replacement • How many new pages created, disappear and replaced • Crawling in Slow Mode and over a period of 39 weeks • 20% Survival rate of web pages

  5. Creation of a new Content • How much new content is present • Shingling Technique used • W-shingle- contiguous ordered subsequence of “w” words • New shingles are created at slower rate than the new pages • New shingles @ 5% per week => 62% of URL content is new • Link-Structure Evolution • Search engines should efficiently capture the Link Structure • Significantly Dynamic Structure • Initial links are available @ 25% per week as compared to 8% for new pages and 5% for new content

  6. Changes in the Existing Pages • Change Frequency Distribution (Presence of Change) • how often the web page is “Altered” • Most pages change very frequently or very infrequently • Degree of Change (SEO) • Metrics:- TF.IDF Word Distance • Exact order of Terms ignored • Minor changes such as advertisements, counters etc cause minor changes in the content of the pages that are detected • Search engines can exploit this only by re-downloading revised pages

  7. Predictability • Overall Predictability • Metrics:- Group A (Red) :- top 80% Group B (Yellow) :- top 80-90% Group C (Green) :- top 90-95% Group D (Blue) :Remaining pages • Why is this degree of predictability required? • Predictability - individual site • Individual sites – www.columbia.edu and www.eonline.com considered for study

  8. Conclusion • Aspects of Evolving Web that are of particular interest in terms of search engine design has been studied through this research over a period of 1 year • Existing pages are been removed “Rapidly” from the Web and replaced by New ones, whereas the new pages tend to borrow the contents from the existing ones • Pages that are changing significantly over time have predictable degree of change • Link Structure is evolving at a faster rate than most of the pages themselves • Effort is to maximize Search Quality by making effective use of available resources to incorporate the changes

  9. Thank You • References: ->B.E. Brewington and G. Cybenko.How dynamic is the web? In proceeding of the Ninth WWW Conference, Amsterdam, The Netherlands, 2000 ->S. Brin and L.Page. The anatomy of large-scale hypertextual Web search engine. In the Proceeding of Seventh WWW Conference, Brisbane, Australia, 1998 -> D.Fetterly, M. Manasse, M. Najork and J.L. Wiener. A large-scale study of evolution of web pages. In Proceedings of Twelfth WWW Conference, Budapest, Hungary, 2003 -> B.H. Murray and A.Moore. Sizing the internet. White Paper, Cyveillance, Inc., 2000

More Related