CS246

CS246 Web Characteristics

Web Characteristics • What is the Web like? • Any questions on some of the characteristics and/or properties of the Web? Junghoo "John" Cho (UCLA Computer Science)

Web Characteristics • Size of the Web • Search engine coverage • Link structure of the Web Junghoo "John" Cho (UCLA Computer Science)

How Many Web Sites? • Polling every IP • 2^32 = 4B sites, 10 sec/IP, 1000 simultaneous connection: • 2^32*10/(1000*24*60*60) = 460 days Junghoo "John" Cho (UCLA Computer Science)

T: All IPs S: Sampled IPs V: Valid reply How Many Web Sites? • Sampling based Junghoo "John" Cho (UCLA Computer Science)

How Many Web Sites? • Select |S| random IPs • Send HTTP requests to port 80 at the selected IPs • Count valid replies: “HTTP 200 OK” = |V| • |T| = 2^32 Junghoo "John" Cho (UCLA Computer Science)

How Many Web Sites? • OCLC (Online Computer Library) results • http://wcp.oclc.org • Total number of available IPs: 2^32 = 4.2 Billion • Growth (in terms of sites) has slowed down Junghoo "John" Cho (UCLA Computer Science)

Issues • Multi-hosted servers • cnn.com: 207.25.71.5, 207.25.71.20, … • Select the lowest IP addressFor each sampled IP: • Look up domain name • Resolve the name to IP • Is our sampled IP the lowest? Junghoo "John" Cho (UCLA Computer Science)

Issues • Virtual hosting • Multiple sites on the same IP • Find the average number of hosted sites per IP • 7.4M sites on 3.4M IPs by polling all available site names [Netcraft, 2000] • Other ports? • Temporarily unavailable sites? Junghoo "John" Cho (UCLA Computer Science)

Where Are They Located? Junghoo "John" Cho (UCLA Computer Science)

What Language? (Based on Web sites) Junghoo "John" Cho (UCLA Computer Science)

Questions? Junghoo "John" Cho (UCLA Computer Science)

T: All URLs S: Sampled URLs V: Valid reply How Many Web Pages? • Sampling based? • Infinite number of URLs Junghoo "John" Cho (UCLA Computer Science)

How Many Web Pages? • Solution 1: • Estimate the average number of pages per site:(average no of pages) * (total no of sites) • Algorithm: • For each site with valid reply, download all pages • Take average • Result [LG99]: • 289 pages per site, 2.8M sites • 800M pages Junghoo "John" Cho (UCLA Computer Science)

99.99% of the sites Issues • A small number of sites with TONS of pages • Very likely to miss these sites • Lots of samples necessary Junghoo "John" Cho (UCLA Computer Science)

T: All pages B: Base set S: Random samples How Many Pages? • Solution 2: Sampling-based Junghoo "John" Cho (UCLA Computer Science)

Related Question • How many deer in Yosemite National Park? Junghoo "John" Cho (UCLA Computer Science)

Random Page? • Idea: Random walk • Start from the Yahoo home page • Follow random links, say 10,000 times • Select the page • Problem: • Biased to “popular” pages. e.g., Microsoft, Google Junghoo "John" Cho (UCLA Computer Science)

Random Page? • Random walks on regular, undirected graph uniform random sample • Regular graph: an equal number of edges for all nodes • After steps • : depends on the graph structure • N: number of nodes • Idea: • Transform the Web graph to a regular, undirected graph • Perform a random walk Junghoo "John" Cho (UCLA Computer Science)

Ideal Random Walk • Generate the regular, undirected graph: • Make edges undirected • Decide d the maximum # of edges per page:say, 300,000 • If edge(n) < 300,000, then add self-loop • Perform random walks on the graph •   10-5 for the 1996 Web, N  109 • 3,000,000 steps, but mostly self-loops • 100 actual walk Junghoo "John" Cho (UCLA Computer Science)

Different Interpretation • Random walk on irregular Web graph • High chance to be at a “popular” node at a particular time • Increase the chance to be at an “unpopular” node by staying there longer through self loops. Popular node Unpopular nodes Junghoo "John" Cho (UCLA Computer Science)

Issues • How to get edges to/from node n? • Edges discovered so far • From search engines, like Altavista, HotBot • Still limited incoming links Junghoo "John" Cho (UCLA Computer Science)

WebWalker [BBCF00] • Our graph does not have to be the same as the real Web • Construct regular undirected graphs while performing the random walk • Add new node n when it visits n • Find edges for node n at that time • Edges discovered so far • From search engines • Add self-loops as necessary • Ignore any more edges to n later Junghoo "John" Cho (UCLA Computer Science)

WebWalker • d = 5 1 2 2 3 1 Junghoo "John" Cho (UCLA Computer Science)

WebWalker • Why ignore “new incoming” edges? • Make the graph regular.“Discovered parts” of the graph do not change • “Uniformity theorem” still holds • Can we arrive at “all reachable” pages? • We ignore only the edges to “visited nodes” • Can we use the same  ? • No Junghoo "John" Cho (UCLA Computer Science)

WebWalker results • Size of the Web • Altavista: |B| = 250M • |BS|/|S| = 35% • |T| = 720M • Avg page size: 12K • Avg no of out-links: 10 Junghoo "John" Cho (UCLA Computer Science)

WebWalker results • Pages by domain • .com: 49% • .edu: 8% • .org: 7% • .net: 6% • .de: 4% • .jp: 3% • .uk: 3% • .gov: 2% Junghoo "John" Cho (UCLA Computer Science)

What About Other Web Pages? • Pages that are • Available within corporate Intranet • Protected by authentication • Not reachable through following links • E.g., pages within e-commerce sites • Deep Web vs Hidden Web • Information reachable through search interface • What if a page is reachable both through links and search interface?

Size of Deep Web? • Estimation: • (Avg no of records per site) * (Total no of Deep Web sites) • How to estimate? • By sampling Junghoo "John" Cho (UCLA Computer Science)

Size of Deep Web? • Total # of Deep Web sites: • |BS|/|S| • Avg no of records per site: • Contact the site directly • Use “Not zzxxyyxx,” if the site reports no of matches Junghoo "John" Cho (UCLA Computer Science)

Size of Deep Web • BrightPlanet report • Avg no of records per site: 5 million • Total no of Deep Web sites: 200,000 • Avg size of a record: 14KB • Size of the Deep Web: 10^16 (10 petabytes) • 1000 larger than the “Surface Web” • How to access it? Junghoo "John" Cho (UCLA Computer Science)

Web Characteristics • Size of the Web • Search engines • Link structure of the Web Junghoo "John" Cho (UCLA Computer Science)

Search Engines • Coverage • Overlap • Dead links • Indexing delay Junghoo "John" Cho (UCLA Computer Science)

Coverage? • Q: How to estimate coverage? • A: Create a random sample and measure how many of them are indexed by a search engine • In 1999 • Estimated Web size: 800M, 1999 • Reported indexed pages: 128M (Northern light) 16% • No reliable Web size estimate at this point • Search engines often claim ~20B index Junghoo "John" Cho (UCLA Computer Science)

Overlap? • How many pages are commonly indexed? • Method 1 • Create a random sample and measure how many are indexed only by A or B and commonly by A and B • Method 2 • Send common queries, compare returned pages, and measure overlap • Result from method 2: Little overlap • E.g., Infoseek and AltaVista: 20% overlap [Bharat and Broder 1997] • Is it still true? • Results seem to converge Junghoo "John" Cho (UCLA Computer Science)

Dead Links? • Q: How can we measure what fraction of pages in search engines are dead? • A: Issue random queries and check and see whether returned pages are dead? • Result in Feb 2000 • AltaVista: 13.7% • Excite: 8.7% • Google: 4.3% • Search engines have got much better due to better recrawling algorithms • A topic for later study Junghoo "John" Cho (UCLA Computer Science)

How Early Pages Get Indexed? • Method 1: • Create pages at random locations • Check when they are available at search engines • Cons: Difficult to create pages at random locations • Method 2: • Repeatedly issue same queries over time • When a new page appears in the result, record the “last modified date” • Cons: last modified date is only a “lower bound” Junghoo "John" Cho (UCLA Computer Science)

How Early are Pages Indexed? • Mean time [Lawrence and Giles 2000] • Northern Light: 141 days • AltaVista: 166 days • HotBot: 192 days Junghoo "John" Cho (UCLA Computer Science)

How Stable Are the Sites? • Monitor a set of random sites • Percentage of Web servers available: (similar results for other years) Junghoo "John" Cho (UCLA Computer Science)

Web Characteristics • Size of the Web • Search engines • Link structure of the Web Junghoo "John" Cho (UCLA Computer Science)

Web As A Graph • Page: Node • Link: Edge Junghoo "John" Cho (UCLA Computer Science)

Link Degree • How many links? • In-degree Power law Why consistently 2.1? Junghoo "John" Cho (UCLA Computer Science)

Link Degree • Out-degree Junghoo "John" Cho (UCLA Computer Science)

Large-Scale Structure? • Study by AltaVista & IBM, 1999 • Based on 200M pages downloaded by AltaVista crawler • “Bow-tie” result based on two experiments Junghoo "John" Cho (UCLA Computer Science)

Experiment 1:Strongly Connected Components • Strongly connected component (SCC): • C is a strongly connected component if: a, b  C, there are pathsfrom a b and from ba a SCC? a b c b c No Yes Junghoo "John" Cho (UCLA Computer Science)

Result 1: SCC • Identified all SCCs from 200M pages • Biggest SCC: 50M (25%) • Other SCCs are small • Second largest: 150K • Mostly fewer than 1000 nodes Junghoo "John" Cho (UCLA Computer Science)

Experiment 2: Reachability • How many pages can we reach starting from a random page? • Experiment • Pick 500 random pages • Follow links in the Breadth-first manner until no more links • Repeated the same experiments following links in the “reverse direction” Junghoo "John" Cho (UCLA Computer Science)

Result 2: Reachability • Out-links (forward direction) • 50% reaches 100M • 50% reaches fewer than 1000 Junghoo "John" Cho (UCLA Computer Science)

Result 2: Reachability • In-links (reverse direction) • 50% reaches 100M • 50% reaches fewer than 1000 Junghoo "John" Cho (UCLA Computer Science)

What Can We Conclude? • 50M (25%) SCC SCC (50M, 25%) Junghoo "John" Cho (UCLA Computer Science)

CS246

CS246

Presentation Transcript

CS246 TA Session: Hadoop Tutorial

CS246 TA Session: Hadoop Tutorial

CS246

CS246

CS246

CS246

CS246

CS246

CS246: Web Information Systems

CS246: Midterm Review

CS246: Page Selection

CS246 Data & File Structures Secondary Memory

CS246 Data & File Structures Lecture 1 Introduction to File Systems

CS246

CS246

Presentation Transcript

CS246 TA Session: Hadoop Tutorial

CS246 TA Session: Hadoop Tutorial

CS246

CS246

CS246

CS246

CS246

CS246

CS246: Web Information Systems

CS246: Midterm Review

CS246: Page Selection

CS246 Data &amp; File Structures Secondary Memory

CS246 Data &amp; File Structures Lecture 1 Introduction to File Systems

CS246 Data & File Structures Secondary Memory

CS246 Data & File Structures Lecture 1 Introduction to File Systems