1 / 35

Measuring the Size of the Web

Measuring the Size of the Web. Dongwon Lee, Ph.D. IST 501, Fall 2014 Penn State. Studying the Web. To study the characteristics of the Web Statistics Topology Behavior … Why Scientific curiosity Practical values Eg, search engine coverage. Nature 1999. Web as Platform.

Download Presentation

Measuring the Size of the Web

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Measuring the Size of the Web Dongwon Lee, Ph.D. IST 501, Fall 2014 Penn State

  2. Studying the Web • To study the characteristics of the Web • Statistics • Topology • Behavior • … • Why • Scientific curiosity • Practical values • Eg, search engine coverage Nature 1999

  3. Web as Platform • Web becomes a new computation platform • Pauses new challenges • Scale • Efficiency • Heterogeneity • Impact to People’s lives

  4. Eg, How Big is the Web? • Q1: How many web sites? • Q2: How many web pages? • Q3: How many surface/deep web pages? • Research Method • Mostly used Experimental method to validate novel solutions

  5. Q1: How Many Web Sites? • DNS Registrars • List of domain names • Issues • Not every domain is web site • A domain contains more than one web site • Registrars are under no obligations for their correctness • So many of them …

  6. How Many Web Sites? • Brute-force: Polling every IP • IPv4: 256.256.256.256 • 2^32 = 4 billion • IPv6: 2^128 • 10 sec/IP, 1000 simultaneous connection: • 2^32*10/(1000*24*60*60) = 460 days • Not going to work !!

  7. How Many Web Sites? • 2nd attempt: Sampling T: All 4 BillionIPs S: Sampled IPs V: Valid reply

  8. How Many Web Sites? • Select |S| random IPs • Send HTTP requests to port 80 at the selected IPs • Count valid replies: “HTTP 200 OK” = |V| • |T| = 2^32 Q: What are the issues here?

  9. Issues • Virtual hosting • Ports other than 80 • Temporarily unavailable sites • …

  10. OCLC Survey (2002) • OCLC (Online Computer Library) Results http://wcp.oclc.org/ • Still room for growth (at least for Web sites) ??

  11. NetCraft Web Server Survey (2010) • Goal is to measure web server market share • Also record # of sites their crawlers visited • August 2010: 213,458,815 distinct sites http://news.netcraft.com/archives/category/web-server-survey/

  12. NetCraft Web Server Survey (2013) • Goal is to measure web server market share • Also record # of sites their crawlers visited • August 2013: 716,822,317 distinct sites http://news.netcraft.com/archives/category/web-server-survey/

  13. NetCraft Web Server Survey (2014) • Goal is to measure web server market share • Also record # of sites their crawlers visited • August 2013: 992,177,228 distinct sites http://news.netcraft.com/archives/category/web-server-survey/

  14. T: All URLs S: Sampled URLs V: Valid reply Q2: How Many Web Pages? • Sampling based? • Issue here?

  15. How Many Web Pages? • Method #1: • For each site with valid reply, download all pages • Measure average # of pages per site • Avg # of pages X total # of sites • Result [Lawrence & Giles, 1999] • 289 pages per site, 2.8M sites • 289 * 2.8M =~ 800M web pages

  16. 99.99% of the sites Further Issues • A small #of sites with TONS of pages • Sampling could miss these sites • Majority of sites with small # of pages • Lots of samples necessary

  17. T: All pages B: Base set S: Random samples How Many Web Pages? • Method #2: Random sampling • Assume:

  18. Random Page? • Idea: Random walk • Start from a Portal home page (eg, Yahoo) • Estimate the size of the portal: B • Follow random links, say 10,000 times • Select the pages • At the end, a set of random web pages S are gathered

  19. Straightforward Random Walk amazon.com google.com Follow a random out-link at each step 4 7 1 6 9 3 5 8 2 Issues? pike.psu.edu

  20. Straightforward Random Walk amazon.com google.com Follow a random out-link at each step 4 7 1 6 9 3 5 8 2 Issues? pike.psu.edu • Gets stuck in sinks and in dense Web communities • Biased towards popular pages • Converges slowly, if at all

  21. Going to Converge? • Random walks on regular, undirected graph uniformly distributed sample • Theorem [Markov chain folklore]: After steps, a random walk reaches the stationary distribution • : depends on the graph structure • N: number of nodes • Idea: • Transform the Web graph to a regular, undirected graph • Perform a random walk • Problem • Web is neither regular nor undirected

  22. Intuition • Random walk on undirected Web graph (not regular) • High chance to be at a “popular” node at a particular time • Increase the chance to be at a “unpopular” node by staying there longer through self loop. Popular node Unpopular nodes

  23. WebWalker: Undirected Regular Random Walk on the Web Follow arandom out-link or a random in-linkat each step Useweighted self loopsto even out pages’ degrees 3 5 amazon.com 3 2 3 0 4 google.com 0 1 4 3 3 2 1 1 3 2 w(v) = degmax - deg(v) 2 2 4 pike.psu.edu Fact: A random walk on a connected undirected regular graph converges to a uniform stationary distribution after certain # of steps.

  24. Ideal Random Walk • Generate the regular, undirected graph: • Make edges undirected • Decide d the maximum # of edges per page:say, 300,000 • If edge(n) < 300,000, then add self-loop • Perform random walks on the graph •   10-5 for the 1996 Web, N  109

  25. WebWalker Results (2000) • Size of the Web pages • Altavista: |B| = 250M • |BS|/|S| = 35% • Estimated |T| = ~ 720M • Avg page size: 12K • Avg # of out-links: 10 Ziv Bar-Yossef, Alexander Berg, Steve Chien, JittatFakcharoenphol, and DrorWeitz, Approximating Aggregate Queries about Web Pages via Random Walks. VLDB, 2000

  26. How large is SE’s Index? • Prepare a representative corpus (eg, DMOZ) • Draw a word W with known frequency percentage F • Eg, “The” is present in 60% of all documents within the corpus • Submit W to a search engine E • If E reports there are X number of documents containing W, one can extrapolate the total size of E’s index as=~ X / F • Repeat multiple times for computing average

  27. http://www.worldwidewebsize.com/ (2010) 28 Billions

  28. http://www.worldwidewebsize.com/ (2011) 46 Billions

  29. http://www.worldwidewebsize.com/ (2013) 46 Billions

  30. http://www.worldwidewebsize.com/ (2013) 10 Billions

  31. Google Reveals Itself (2008) • 1998: 26 Million URLs • 2000: 1 Billion URLs • 2008: 1 trillion URLs • Not all of them are indexed • Duplicates • Auto-generated (eg, Calendar) • Spams • Experts suspect (2010) • Google index at least 40 Billions

  32. Deep Web (aka Hidden Web) HTML FORM Interface Query Answers

  33. Q3: Size of Deep Web? • Deep Web: Information reachable only through query interface (eg, HTML FORM) • Often backed by DBMS • Estimation: • How to estimate? • By sampling (Avg size of record) X (Avg # of records per site) X (Total # of Deep Web sites)

  34. Size of Deep Web? • Total # of Deep Web sites: • |BS|/|S| • Avg size of a record: • Issue random queries • Estimate reply size • Avg # of records per site: • Permute all possible queries for the FORM • Issue all queries and count valid return

  35. Size of Deep Web (2005) • BrightPlanet report estimates: • Avg size of a record: 14KB • Avg # of records per site: 5MB • Total # of Deep Web sites: 200,000 • Size of the Deep Web: 10^16 (10 petabytes) • 1,000 times larger than the “Surface Web” • How to access it? • Wrapper/Mediator (aka. Web scrapping) http://brightplanet.com/the-deep-web/deep-web-faqs/ : obsolete now

More Related