Decoding the Structure of the WWW : A Comparative Analysis of Web Crawls

Decoding the Structure of the WWW : A Comparative Analysis of Web Crawls AUTHORS: M.Angeles Serrano Ana Maguitman Marian Boguna Santo Fortunato Alessandro Vespignani

AGENDA • INTRODUCTION • MAIN OBJECTIVE • MAIN CONTRIBUTIONS • PREVIOS RESEARCH AND RESULTS • RELATED WORK DONE AND RESULTS • IMPLEMENATION • CONCLUSION AND FUTURE WORK

INTRODUCTION • Topological Structure of the “World Wide Web” can be represented by the properties of it’s representative graphs. • Vertices identified with Web Pages and Directed Edges identified with Hyperlinks. • The Comparative Analysis of different WWW graphs differ Quantitatively, Qualitatively depending on the domain and crawl used for gathering data. • Degree Distribution, Degree-Degree Correlation Functions , Statistics of Reciprocal Connections are used as measurement for the analysis of Web Graphs.

CONT. • The dynamical nature of the web and its huge size make very difficult the process of compressing, ranking, indexing or mining the web. • Exact policies and strategies followed by Crawl engines helps in decoding the huge structure of WWW.

MAIN OBJECTIVE • Give the clear picture on the reliability of the widely accepted large-scale statistical properties of the Web and provide the new measures to discover whether or not inconsistencies are found when measuring the same properties across different crawls.

MAIN CONTRIBUTIONS • A careful comparative analysis of the structural and statistical large-scale different Web graphs, with evident qualitative and quantitative differences across different samples. • Introduced single and two-vertex correlations and Reciprocal Linksfor a full connectivity pattern and structural ordering of the Web graph. And these all properties depend on the communication patterns among the constituent sites of the network.

PREVIOUS RESEARCHES AND RESULTS • Measurement of the Directed Degree Distributions P(kin) and P(kout), where the in/out-degree, kin or kout respectively, is defined as the number of incoming/outgoing links connecting a page to its neighbors. • Kumar et al. [1999] worked on a big crawl of about 40M nodes, and that by Barab´ asi and Albert [1999] on a smaller set of over 0.3M nodes restricted to the domain of the University of Notre Dame, resulted a scale-free nature for the WWW with power-law behaviors both for the in- and out-degree distributions.

CONT. • Border et al. [2000] worked on two sets from AltaVista crawls, in May and October for year 1999 with 200 million pages and 1.5 billion links. And concluded that the structure of the Web was relatively insensitive to the particular large crawl used and the connectivity structure was resilient to the removal of a significant number of nodes. • Donato et al. [2004] worked on the same lines with a large 2001 data set of 200M pages and about 1.4 billion edges made available by the WebBase project at Stanford. The obtained results were compared with the ones presented in the work by Broder et al.[2000]. One of the reported differences is the deviation from the power-law behavior of the out-degree distribution.

RELATED WORK DONE AND RESULTS • Analysis and Comparison for Four Data Sets from years, from 2001 to 2004, and different domains, general and .uk and .it domains. The sets have been gathered within two different projects: theWebBase project and the WebGraph project, with own Web crawler, WebVac and UbiCrawler, respectively. • While pages in the .uk domain have higher probability to point to pages outside the domain,due to English , and the links in the Italian .it domain may be much more endogenous, which could potentially have a high effect on the Web description derived from the data.

CONT. Table I. Number of Nodes and Edges of the Networks Considered, After Extracting Multiple Links and Self-Connections • Measurements carried out during research: • Structural Properties • Degree Correlation • The Role of Reciprocal Links

Structural Properties • Connected Components • Strongly Connected Component : All pages mutually connected by a path. • In-Component: Vertices from which it is possible to reach SCC using directed path. • Out-Component: Vertices which can be reached from SCC using directed path. • Tendrils: Pages which cannot reach the SCC and cannot be reached from it. • Tubes: Directly connect the IN and OUT components without crossing the SCC.

CONT. Table II Sizes of the IN,OUT, SCC and their union MAIN

CONT. • Crawlers perform a directed exploration that they follow outgoing hyperlinks to reach pointed pages, but cannot navigate backwards using incoming hyperlinks. In summary, the structure of Web graphs is strongly dependent on the data set considered Degree Distribution • For directed networks, the in-degree distribution P(kin) and the out-degree distribution P(kout),probabilities of having kin incoming links and kout outgoing links, respectively. • In-degree of a vertex is the sum of all the hyperlinks incoming from all the Web pages in the WWW. And there is no limit to the number of incoming hyperlinks, that is determined only by the popularity of the Web page itself. Out-degree is determined by the number of hyperlinks present in the page, which are controlled by Web administrators.

DEGREE CORRELATION • Degree Correlation between In and Out degree can determine that the n/w will or will not have a bow structure and for the model validation quantitatively. • Single Vertex Degree Correlation • Shows that more popular pages tend to point to a higher number of other pages. This positive correlation is found to be true for a range of in-degrees that spans from kin = 1 to kin =102∼ 103, depending on the specific set. The set for the Italian domain is more noisy, but this pattern appears to be independent of the crawl used to gather the data.

CONT. 2. Two Vertex Degree Correlation • The implication from the Two Vertex Degree Correlation can help in the study of Page Rank, as this includes the neighborhood of each single node i into neighboring nodes connected to it by incoming and outgoing links. And it shows the popularity of the web page in terms of the number of pages pointing to them.

ROLE OF RECIPROCAL LINKS • It plays an important role in percolation catalysts, the fine structure of the web components and the navigability of the web. • Degree Distribution: Depends on the crawl examined. • One Degree Correlation: No clear relation between nonreciprocal in-and out- degrees but there is a positive correlation between reciprocal and nonreciprocal in-degrees. • Degree-Degree Correlation: Shows most of the correlations of web graphs are found in vertices connected by reciprocal links which depends on the web structure. • Reciprocal Sub graph: Shows the organization of the reciprocal sub graph is a set of star like structures combined with cliques, or communities, of highly interconnected pages.

Statistical Properties Of RECIPROCAL SUBGRAPH Average Degree qr , Maximum Degree qmaxr , Standard Deviation σr , Heterogeneity Parameter κr , and Maximum Likelihood Estimate of the Exponent of the Power-Law in-Degree Distribution γr (Precision Error ±0.1) (The symbol ∞ means that the distribution decays faster than a power-law.)

CONCLUSION AND FUTURE WORK

CONT. • Despite an approximate view of the web from data provided by Web Crawlers , still lacking an exact definitive description of its large scale properties and architecture which can affect the navigation, indexing searching and mining. FUTURE WORK: • Differences among crawls should be further investigated in relation to crawling policies adopted in designing of the engines. • The Reciprocal links role has to be explored in detail.

Decoding the Structure of the WWW : A Comparative Analysis of Web Crawls

Decoding the Structure of the WWW : A Comparative Analysis of Web Crawls

Presentation Transcript

Arithmetic Coding: Basic Ideas

Comparative Cultures

Texture Profile Analysis A closer look at this popular way of characterising the structure of foods

DOMAIN I ENGLISH LANGUAGE ARTS AND READING (32% of test)

Jillian L. Wendt University of the District of Columbia Deanna Nisbet Regent University

Decoding the Meaning of the Reading Results with an emphasis on dyslexia

Structural Analysis of Protein Structure

Chapter 6

Decoding Mobile Technologies:

The Comparative Method

Comparative Advantage

fMRI Analysis with the FreeSurfer Functional Analysis Stream (FS-FAST) Preprocessing, First Level Analysis, and Group An

Decoding ENCODE

Comparative analysis of RNA- Seq data with DESeq and DEXSeq

Comparison and Analysis of the 4 Market Structures

Overview

Allele Mining: with respect to Comparative Protein Structure Modelling and Docking study

Chapter 4 Syntactic Analysis

3D Structure Prediction and Assessment

Overview

DNA sequencing “the technology lecture”