mining di dati web
Download
Skip this Video
Download Presentation
Mining di dati web

Loading in 2 Seconds...

play fullscreen
1 / 26

Mining di dati web - PowerPoint PPT Presentation


  • 90 Views
  • Uploaded on

Mining di dati web. Lezione n° 2 Il grafo del Web A.A 2006/2007. The Web Graph. The linkage structure of Web Pages forms a graph structure. The Web Graph (hereinafter called W ) is a directed graph W = (V,E) V is the vertex set and each vertex represents a page in the Web.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Mining di dati web' - pearl


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
mining di dati web

Mining di dati web

Lezione n° 2

Il grafo del Web

A.A 2006/2007

the web graph
The Web Graph
  • The linkage structure of Web Pages forms a graph structure.
  • The Web Graph (hereinafter called W) is a directed graph W = (V,E)
    • V is the vertex set and each vertex represents a page in the Web.
    • E is the edge set and each directed edge (e1,e2) exists whenever a link appears in the page represented by e1 to the page represented by e2.
a toy example of w
2

1

Link21

Link11

Link22

Link12

4

Link31

3

Link41

1

2

3

4

1

1

2,4

2,2

=

1

0

1

1

2

2

2

3,1

3,4

0

=

1

1

3

3

3

1

1

1

0

=

=

4

4

4

3

3

0

0

1

=

A Toy Example of W

V= {1,2,3,4}

E= {(1,2), (1,4),

(2,3), (2,4),

(3,1),

(4,3)}

the size of w
The size of W
  • What is being measured?
    • Number of hosts
    • Number of (static) html pages
      • Volume of data
  • Number of hosts - netcraft survey
    • http://news.netcraft.com/archives/web_server_survey.html
    • Monthly report on how many web hosts & servers are out there!
  • Number of pages - numerous estimates
  • Recently Yahoo announced an index with 20B pages.
the real size of w
The “real” size of W
  • The web is really infinite
    • Dynamic content, e.g. calendars, online organizers, etc.
    • http://www.raingod.com/raingod/resources/Programming/JavaScript/Software/RandomStrings/index.html
  • Static web contains syntactic duplication, mostly due to mirroring (~ 20-30%)
  • Some servers are seldom connected.
recent measurement of w
Recent Measurement of W
  • [Gulli & Signorini, 2005]. Total web > 11.5B.
  • 2.3B the pages unknown to popular Search Engines.
  • 35-120B of pages are within the hidden web.
  • The index intersection between the largest available search engines -- namely Google, Yahoo!, MSN, AskJeeves -- is estimated to be 28.8%.
evolution of w
Evolution of W
  • All of these numbers keep changing.
  • Relatively few scientific studies of the evolution of the web [Fetterly & al., 2003]
    • http://research.microsoft.com/research/sv/sv-pubs/p97-fetterly/p97-fetterly.pdf
  • Sometimes possible to extrapolate from small samples (fractal models) [Dill & al., 2001]
    • http://www.vldb.org/conf/2001/P069.pdf
rate of change
Rate of change
  • There a number of different studies analyzing the rate of changes of pages in V.
  • [Cho & al., 2000] 720K pages from 270 popular sites sampled daily from Feb 17 - Jun 14, 1999
    • Any changes: 40% weekly, 23% daily
  • [Fetterly & al., 2003] Massive study 151M pages checked over few months
    • Significant changed -- 7% weekly
    • Slightly changed -- 25% weekly
  • [Ntoulas & al., 2004] 154 large sites re-crawled from scratch weekly
    • 8% new pages / week
    • 8% die
    • 5% new content
    • 25% new links/week
the power of power laws
The Power of Power Laws
  • A power law relationship between two scalar quantities x and y is one where the relationship can be written as

y= axk

where a (the constant of proportionality) and k (the exponent of the power law) are constants.

  • Power laws are observed in many subject areas, including physics, biology, geography, sociology, economics, and linguistics.
  • Power laws are among the most frequent scaling laws that describe the scale invariance found in many natural phenomena.
power law probability distributions
Power Law Probability Distributions
  • Sometimes called heavy-tail or long-tail distributions.
  • Examples of power law probability distributions:
    • The Pareto distribution, for example, the distribution of wealth in capitalist economies
    • Zipf's law, for example, the frequency of unique words in large texts http://wordcount.org/main.php
    • Scale-free networks, where the distribution of links is given by a power law (in particular, the World Wide Web)
    • Frequency of events or effects of varying size in self-organized critical systems, e.g. Gutenberg-Richter Law of earthquake magnitudes and Horton's laws describing river systems
the in out degree
The in/out-degree

Power law trend:

random graphs
Random Graphs
  • RGs are structures introduced by Paul Erdos and Alfred Reny.
  • There are several models of RGs. We are concerned with the model Gn,p.
  • A graph G = (V,E)  Gn,p is such that |V|=n and an edge (u,v)  E is selected uniformly at random with probability p.
w cannot be a rg
W cannot be a RG
  • Let Xk be a discrete value indicating the number of nodes having degree equal to k.
  • Obviously in Gn,p the expected value of XpE(Xp) is .
  • Xk is asintotically distributed as a Poisson variable with mean k.
the avg distance of a graph g
The avg distance of a graph G
  • Let u, vV be two nodes of G.
  • Let d(u,v) be the distance from u to v expressed as the length of the shortest path connecting u to v. If u and v are not connected then the distance is set to .
  • Definewhere S is the set of pairs of distinct nodes u, v of W with the property that d(u,v) is finite.
the avg distance of w
The avg distance of W
  • A small world graph is a graph whose avg distance is much smaller that the order of the graph.
  • For instance L(G)  O(log(|V(G)|)).
  • L(W) is about 7.
  • Ld(W) is about 18
what s the best model for w
It is still an open problem

to find a web graph model

that produces graphs which

provably has all four properties.

What’s the best model for W?
  • A graph model for the web should have (at least) the following features:
    • On-line property. The number of nodes and edges changes with time.
    • Power law degree distribution. The degree distribution follows a power law, with an exponent >2.
    • Small world property. The average distance is much smaller that the order of the graph.
    • Many dense bipartite subgraphs. The number of distinct bipartite cliques or cores is large when compared to a random graph with the same number of nodes and edges.
w models proposed so far
W Models proposed so far.
  • [Bollobas & al., 2001]. Linearized Chord Diagram (LCD).
  • [Aiello & al., 2001]. ACL.
  • [Chung & al., 2003]. CL.
  • [Kumar & al., 1999]. Copying model.
  • [Chung & al., 2004]. CL-del growth-deletion model.
  • [Cooper & al., 2004]. CFV.
references
References
  • [Gulli & Signorini, 2005]. Antonio Gulli and Alessio Signorini. The indexable web is more than 11.5 billion pages. WWW (Special interest tracks and posters) 2005: 902-903.
  • [Fetterly & al., 2003]. Dennis Fetterly, Mark Manasse, Marc Najork, and Janet Wiener. A Large-Scale Study of the Evolution of Web Pages. 12th International World Wide Web Conference (May 2003), pages 669-678.
  • [Dill & al., 2001]. Stephen Dill, Ravi Kumar, Kevin S. McCurley, Sridhar Rajagopalan, D. Sivakumar, Andrew Tomkins: Self-similarity in the web. ACM Trans. Internet Techn. 2(3): 205-223 (2002).
references1
References
  • [Cho & al., 2000]. Junghoo Cho, Hector Garcia-Molina. The Evolution of the Web and Implications for an Incremental Crawler. VLDB 2000: 200-209.
  • [Ntoulas & al., 2004]. Alexandros Ntoulas, Junghoo Cho, Christopher Olston. What's new on the web?: the evolution of the web from a search engine perspective. WWW 2004: 1-12.
  • [Bollobas & al., 2001]. Bela Bollobas, Oliver Riordan, G. Tusnary and Joel Spencer. The degree sequence of a scale-free random graph process. Random Structures and Algorithms, vol 18, 2001, 279-290.
references2
References
  • [Aiello & al., 2001]. William Aiello, Fan R. K. Chung, Linyuan Lul. Random Evolution in Massive Graphs. FOCS 2001: 510-519.
  • [Chung & al., 2003]. Fan R. K. Chung, L. Lu. The average distances in random graphs with given expected degrees. Internet Mathematics. 1(2003): 91-114.
  • [Kumar & al., 1999]. R. Kumar, P. Raghavan, S. Rajagopalan, D. Sivakumar, A. Tomkins, and Eli Upfal. Stochastic models for the Web graph. Proceedings of the 41th FOCS. 2000, pp. 57-65.
references3
References
  • [Chung & al., 2004]. F. Chung, L. Lu. Coupling Online and Offline Analyses for Random Power Law Graphs. Internet Mathematics. Vol 1 (2003). 409-461.
  • [Cooper & al., 2004]. C. Cooper, A. Frieze, J. Vera. Random Deletions in a Scale Free Random Graph Process. Internet Mathematics. Vol 1 (2003). 463 - 483.
ad