DYNAMICS OF COMPLEX SYSTEMS Self-similar phenomena and Networks. Guido Caldarelli CNR-INFM Istituto dei Sistemi Complessi [email protected] 5/6. STRUCTURE OF THE COURSE. SELF-SIMILARITY (ORIGIN AND NATURE OF POWER-LAWS) GRAPH THEORY AND DATA SOCIAL AND FINANCIAL NETWORKS
5.2) STATISTICAL PROPERTIES OF INTERNET
5.3) WORLD WIDE WEB
5.4) STATISTICAL PROPERTIES OF WWW
5.5) HITS ALGORITHM
5.8) STATISTICAL PROPERTIES OF WIKIPEDIA
Previous maps have been computed through extensive collection of traceroutes
[email protected]> traceroute www.louvre.fr
1 188.8.131.52 Rome pcpil 2 184.108.40.206 Unknown 3 220.127.116.11 Unknown rc-infnrmi.rm.garr.net 4 18.104.22.168 Unknown rt-rc-1.rm.garr.net 5 22.214.171.124 Unknown mi-rm-1.garr.net 6 126.96.36.199 South Cambridgesh garr.it.ten-155.net 7 188.8.131.52 South Cambridgesh ch-it.ch.ten-155.net 8 184.108.40.206 Genève geneva5.ch.eqip.net 9 220.127.116.11 Genève geneva1.ch.eqip.net10 0.0.0.0 Unknown No Response11 18.104.22.168 Unknown p6.genar2.geneva.opentransit.net12 22.214.171.124 PARIS, FR p43.bagbb1.paris.opentransit.net
Results are that we can quantify the
hierarchical nature of the AS connections
Plot of the C(A) show the same optimisation of the Food webs
skitter is a tool for actively probing the Internet to analyze topology and performance.
This happens at both domain and router server
Faloutsos et al. (1999)
Vazquez Pastor-Satorras and Vespignani
PRE 65 066130 (2002)
Nodes: (static) HTML pages
Edges (directed): hyperlinks beetween pages
From link analysis:
With a “good” WebGraph model:
Broder et al. , Graph structure in the web
Kumar et al. , Crawling the Web for Emerging Cyber Communities
A first way to search WWW has been based on a rough division of the sites
A site i can have both an authority x and hubness y
The authority is the sum of the hubness pointing in
The hubness is the sum of the authorities pointed
Why GOOGLE works so well?
PageRank does not distinguish between different kinds of pages. Everyone has a “PageRank’’ that is shared by destination pages
The larger the out-degree the larger the reduction of the “vote”
A preliminar definition of PageRank can be the following
So the PageRank is defined as the eigenvector of the normal matrix
The eigenvector are computer by recursion, but THERE ARE TRAPPING STATES! These destroy convergence….
The suggestion of Brin and Page was to allow every now and then a jump to a randomly selected page.
The value of awas originally taken as 0.85
STATISTICAL PROPERTIES OF THE WIKIGRAPH
Centro “E. Fermi”
A Nature investigation aimed to find if Wikipedia is an authoritative source of information with respect to established sources as Encyclopedia Britannica.
(Nature 438, 900-901; 2005)
Actually, things are a little bit more complicated
There is not only “control” by users, but also conflict of interests. Thereby sometimes is not possible to modify 100% of the structure since some sites are locked.
One of the biggest scandal was the biography of Journalist John Seigenthaler who was accused to be involved in the murder of President J.F. Kennedy
Some issues and languages have more controls than others. An experiment made by Italian newspaper “L’espresso” introduced
Deliberately some errors in two voices
WHY STUDYING WIKIPEDIA?
WE RECOVER A PREFERENTIAL ATTACHMENT MECHANISM FROM THE DATA.
DIFFERENT LANGUAGES PRODUCE SIMILAR STRUCTURES
WE FIND A SYSTEM SIMILAR TO THE WWW EVEN IF THE MICROSCOPIC RULE OF GROWTH IS VERY DIFFERENT.
The datasets of each language are available in two selfextracting files for mysql database. The table cur contains the current on-line articles, whereas the table old contains all previous versions of each current article. Old versions of an article are identified for using the same title, and not the same id. The dataset dumps are updated almost weekly, so the current graph is usually not more than a week old.
For generating a graph from the link structure of a dataset, each article is considered a node and each hyperlink between articles is a link in this graph. In the wikipedia datasets, each webpage is a single article. An article also might contain some external links that point pages outside the dataset. Usually wikipedia articles has no external links, or just a few of them. These kind of links are not considered for generating the wikigraphs, since we want to restrict the graph to pages into the set being analyzed.
We generated six wikigraphs, wikiEN, wikiDE, wikiFR, wikiES, wikiIT and wikiPT, generated from the English, German, French, Spanish, Italian and Portuguese datasets, respectively. The graphs were obtained from an old dump of June 13, 2004. We are not using the current data due to disk space restrictions. The English dataset of June 2005 has more than 36 GB compacted, that is about 200 GB expanded.
The page that was mostly visited was the main pages for wikiEN, wikiDE, wikiFR and wikiES, while that for the datasets wikiIT and wikiPT there were no visits associated with the pages.
The Bow-tie structure, found in the WWW
(Broder et al. Comp. Net. 33, 309, 2000)
The measure/size of the Wikigraph for the various languages.
The percentage of the various components of the Wikigraph for the various languages.
The Degree shows fat tails that can be approximated by a power-law function of the kind P(k) ~ k-g
Where the exponent is
the same both for
in-degree and out-degree.
In the case of WWW
2 ≤ gin ≤ 2.1
in–degree(empty) and out–degree(filled) Occurrency distributions for the Wikgraph
in English (○) and Portuguese ().
As regards the assortativity (as measured by the average degree of the neighbours of a vertex with degree k) there is no evidence of any assortative behaviour.
The average neighbors’ in–degree, computed along incoming edges, as a function of the in–degree for the English (○) and Portuguese ()
The pagerank distribution for wikiEN is a power law function with γ = 2.1. Previous measures in webgraphs also exhibit the same behaviour for the pagerank distribution.
We list the number of visits of the top ranked pages just to show that this value is not related with the pagerank values. We confirm that very little correlation was found between the link analysis characteristics and the actual number of visits.
Given the history of growth one can verify the hypothesis of preferential attachment. This is done by means of the histogram P(k) who gives the number of vertices (whose degree is k) acquiring new connections at time t.
This is quantity is weighted by the factor
We find preferential attachment for in and out degree.
English (○) and Portuguese ().
Filled = out-degree
In our opinion the nature of this preferential attachment is effective ratther than the real driving force in the phenomenon.
In other words the linear preferential attachment can be originated by a copying procedure (new vertices are introduced by copying old ones and keeping most of the edges). Also we could have a sort of fitness for the various entries (but in this case one has a multidimensional series of quantities describing the importance of one page).
Apart the interpretation the data show a rather clear
LINEAR PREFERENTIAL ATTACHMENT
related to dyamics need to be explained
For example the number of updates also follows a power law.
Each point presents the number of nodes (y axis) that were updated exactly x times.
This feature is
From these data it seems that a model in the spirit of BA could reproduce most of the features of the system.
* See for example Krapivsky Rodgers and Redner PRL 86 5401 (2001)
The model can be solved analytically
We can use for the model the empirical values of
Already measured for the English version of Wikigraph
P(kin) ~kin- gin gin = -(1+1/(1-R2))
P(kout) ~kout- gout gout = -(1+1/(1-R1))
gin 2.100 gout 2.027
The model can be solved analytically
Knnin(kin) ~M N1-R1R1R2/R3 (R3≠0)
Both cases is constant
Knnin(kin) ~M R1R2 ln (N) (R3=0)
The value of the constant depends also upon the initial conditions. The two lines refer to two realizations of the model where in one case the 0.5% of the first vertices has been removed.
It turns out that the pagerank of the pages is not related with the number of visit opens a very interesting scenario for further research work. Since, by definition, pagerank should give us the visit time of the page and since actually it is complety indipendent by the number of visits, we wonder if pagerank is a good measure of the authoritativeness of the pages in wikigraphs and which modifications should be introduced in order to tune its performances.