1 / 11

Sampling a web subgraph

Sampling a web subgraph. Paraskevas V. Lekeas Proceedings of the 5 th Algorithms, Scientific Computing, Modeling and Simulation (ASCOMS), Web conference, New York, USA, Sept. 15-17, 2003. Web Sampling. In order to study the web we have to crawl it.

sharne
Download Presentation

Sampling a web subgraph

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Sampling a web subgraph Paraskevas V. Lekeas Proceedings of the 5th Algorithms, Scientific Computing, Modeling and Simulation (ASCOMS), Web conference, New York, USA, Sept. 15-17, 2003.

  2. Web Sampling • In order to study the web we have to crawl it • We can’t exhaustively crawl the whole web because • i) it is very big • ii) it grows exponentially • We rather use sampling techniques to collect • representative samples (pages) of the web and then • study these pages • 2 main methods of web sampling • i) “stochastic sampling” (random walks) • ii) “deterministic sampling” (IP sampling)

  3. Stochastic Sampling • A Stochastic sampler starts from a node of the web graph • (pages-nodes, links-edges), picks (with some probability)a link in • that node, follows it and visits another node etc. • The sampler stops when it reaches equilibrium distribution • (if the transition matrix of the process is P and the sampler is at state π, • then equilibrium distribution is a state which π=πP)and outputs the • sample (all the visited nodes) • Problems are • i) We need connectivity (links) between nodes • ii) We don’t know how to choose a node uniformly at • random to start the stochastic sampler • iii) We don’t know how long does it take to reach • equilibrium distribution • iv) There is statistical dependency among the nodes • that the sampler visits (no clean statistics)

  4. Deterministic Sampling • A deterministic sampler does not sample the web graph • but the IPv4 (Internet Protocol version 4) adress space • The sampler collects IPs from the IPv4 space (pre-sample) • and converts them into their web representation (final-sample) • Problems are • i) difficulties in accessing many hosts when converting • the IP addresses into web nodes • ii) multihosting (one IP may belong to various web nodes but the • resolution mechanism shows only one node) • iii) scalability problems (the new internet IPv6)

  5. Sampling a web Subgraph 1/4 • Say we want to study a web subgraph (say a country code Top • Level Domain .gr, .uk etc.) • We can’t use a stochastic sampler since if we start it from a • node inside the domain the sampler is not going to stay • there (also if we force the sampler to stay inside we ruin the stochasticity • of the process) • We can’t also use as it is a deterministic sampler since IPv4 • is a huge pool of IPs and our subgraph contains only a small • part of them • In this work we built a modified deterministic sampler that • solves the above problem

  6. IP addresses of web subgraph random number generator final-sample (web nodes) Resolver pre-sample (IP addresses) Sampling a web Subgraph 2/4 • The sampler gets as input the IP addresses of the subgraph • (population). The IPs of the subgraph are collected from • Regional Internet Registries (such as RIPE)

  7. IP addresses of web subgraph random number generator final-sample (web nodes) Resolver pre-sample (IP addresses) Sampling a web Subgraph 3/4 • The sampler uses sampling theory to compute the size of • the sample, produces the appropriate amount of random • numbers and draw a pre-sample of IP addresses

  8. IP addresses of web subgraph random number generator final-sample (web nodes) Resolver pre-sample (IP addresses) Sampling a web Subgraph 4/4 • The sampler resolves the pre-sample and outputs the final • sample that contains web nodes (pages)

  9. Define a variable • Then is the total number of web nodes in N • An estimator of the percentage of web nodes in N is • The size n of the sample we need to draw in order • to estimate p with error of magnitude B is • (q=1-p) • From above we estimate that in late 2002 • which agrees with RIPE statistics for the same period Testing the Sampler (test 1) • We want to predict the % of web nodes in a domain (.gr) and • say that inside this domain there exist N IPs. Some of them • are web nodes while some other are not

  10. Out degrees, InTree links chopped Fit: 11,2456-0.0085x (x,y)=(Log degree, Log rank) Testing the Sampler (test 2) • The out degree distribution of the sample obeys a power • law which is an intrinsic property of the web graph • The roughly linear plot is skewed in y=4 and this is due to • a porn site with hundreds of repetitions of the same links

  11. Uses of the sampler • The above sampler i) can help us collect information about web communities or validate laws in internet domains ii) can be used as input to stochastic samplers which need to start from random sets of web nodes iii) can be used as a crawler if we force it not to draw samples, but to exhaustively visit all the IP addresses that we give to it

More Related