1 / 43

The influence of search engines on preferential attachment

The influence of search engines on preferential attachment. Dan Li CS3150 Spring 2006. The paper. The influence of search engines on preferential attachment Soumen Chakrabarti, Alan Frieze and Juan Vera. Background. The evolution of social networks through time Web graph Models

nibal
Download Presentation

The influence of search engines on preferential attachment

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The influence of search engines on preferential attachment Dan Li CS3150 Spring 2006

  2. The paper • The influence of search engines on preferential attachment • Soumen Chakrabarti, Alan Frieze and Juan Vera

  3. Background • The evolution of social networks through time • Web graph • Models • Preferential Attachment • Copying Model

  4. Background • Evolution of the Web • Power-law • Preferential attachment( Barabasi and Albert) • Copying Model • The author of a newborn page u picks a random reference page v from the web, and with some probability, copies out-links from v to u. • Power-law: power ~ 2 • Organic Evolution • NO POWERFUL CENTRAL ENTIRY!

  5. The New Problem • How the page authors find existing pages and create links to them? • Highly popular search engines limit the attention of the page authors to a small set of celebrity pages. • Page authors frequently use search engines to locate pages, and include the HOT pages they visit (with probability p)

  6. The New Problem • The evolution of the Web graph has been influenced permanently and pervasively by the existence of search engines. • A search engine ranks a page highly, • Authors find the page more often, some of them link to it, raising its in-degree and Pagerank, which leads to a further improvement or entrenchment of its rank.

  7. The Results in This Paper • The celebrity nodes eventually accumulate a constant fraction of all links created with high probability • The degree of the other nodes still follow a power-law distribution with a steeper power:

  8. The New Model • Modeling how the web graphs evolves if the author use search engine to decide on links that they insert into new pages. • How the degree distribution deviates from the traditional model

  9. The New Model • Undirected Web Graph • Query to the Search Engine is fixed • The search Engine returns a fix number of URLs ordered by their degree at the previous time-step • Limit the analysis to one topic at a time with out loss of generality?? • Comments: A new page may involve multiple topics at the same time and include different number of links for each topic.

  10. The New Model • Growth process: • Generates a sequence of graphs Gt, t =1,2,3,… • At time t, the Graph Gt = (Vt, Et) has t vertices and mt edges. • Parameters: • p: a probability • N: maximum number of celebrity nodes listed by the search engine

  11. The New Model – Comments • Comments: • The number of links each new page creates is fixed? Is this real? How does this affect the results? • Intuitively, the page author may not have a number in mind of how many links he wants to include, he will only determine whether a link will be included based on the content of that link

  12. Some Notations in the new model

  13. Formal Definition of Process P

  14. The New Model • In both cases yi is selected by preferential attachment within the target subset of old nodes, i.e. for x in U

  15. The New Model - Comments • The m random edges may have duplicate vertices. For different i, the same vertex may be selected! When t is smaller than m, we have a lot of loops. • Should we not start from one vertex? Instead, we can start from m vertices or N vertices and the initial web graph is created at random. • With high probability, the oldest links become celebrity page. • What happens in the real world? • A page becomes hot not only by random, but also due to its contents, can we model this??

  16. The simulation results • Very different from the standard preferential attachment! • The celebrities is far from the Power-Law straight line in log-log plot. • As p increases, the power increases as well! • P Simulated power Computed power • P = 0 2.8 3 • P = 0.3 3.96 3.857 • P = 0.6 5.9 6 • The celebrities command a constant fraction of the total degree over all nodes, this fraction grows with p.

  17. The simulation results

  18. Results

  19. Theorem 1

  20. Interpretations • Celebrities capture a large?(depends on the constant) fraction oflinks. • Non-celebrities follow a power-law degree distributionwith a power steeper than in preferential attachment.

  21. The Proof • The celebrity list becomes fixed whp after some time tf • Oncethe celebrity list is fixed, process P looks very similar to an analogous process P*: • In eachstep, P*takes the N oldest vertices as St, instead of the N largest-degree vertices. • This is quite reasonable, basically, the oldest vertices have higher degree, since they have longer time to be included

  22. Coupling Gt and Gt*

  23. Analysis of the degree distribution of Gt*

  24. Basic Proof to Lemma 2 • Finding recurrence of • Finding a similar recurrence:

  25. Lemma 3

  26. Basic Proof to Lemma 3

  27. The celebrity list get fixed • WHP, adding m edges to a single non-celebrity will not make it a celebrity. • The total degree of celebrities is concentrated to a constant fraction of all edges ever added to the graph

  28. List-fixing Lemma

  29. Proof to Lemma 4

  30. Lemma 5

  31. Lemma 6 • With low degree, the celebrity has low degree

  32. Lemma 7 • With low probability, the non-celebrity has high degree

  33. Lemma 8 • With low probability, the gap will keep small

  34. Proof of Theorem 1 • Lst tf to be the last time that St changes in the process P

  35. Proof of Theorem 1 cont.

  36. Proof of Theorem 1 cont.

  37. Proof of Theorem 1 cont.

  38. Proof of Theorem 1 cont.

  39. Proof of Theorem 1 cont.

  40. Conclusions • Modeling the influence of a search engine within the preferential attachment framework leads to a qualitative change in the familiar power-law degree distribution. • Each of a clot of celebrities captures a constant fraction of the total degree of the graph, and the degree of the remaining nodes follow a steeper power law.

  41. Is this Model real? • The model differs from the reality. • Edges are undirected? • Outlinks are not modified after creation • Pages do not die • No topic-based clustering

  42. Comments • This model is used on to one topic • There may be interactions between topics • The author may include links for different topics into the same page • The number of links on a page is fixed, which is not the real case

  43. Thank you! Have a nice summer!

More Related