1 / 40

The Efficacy of Collusions in Web Ranking and the Countermeasurements

The Efficacy of Collusions in Web Ranking and the Countermeasurements. Hui Zhang University of Southern California. Outline. Problem Statement. PageRank algorithm : a brief introduction. Study of PageRank’s robustness to collusion.

petula
Download Presentation

The Efficacy of Collusions in Web Ranking and the Countermeasurements

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The Efficacy of Collusions in Web Ranking and the Countermeasurements Hui Zhang University of Southern California

  2. Outline • Problem Statement. • PageRank algorithm : a brief introduction. • Study of PageRank’s robustness to collusion. • Adaptive-resetting: make PageRank robust to collusion. • Conclusions. USC CS599

  3. Search Engine Optimization (SEO) USC CS599

  4. Web spam [Gyongyin et al. 2004] • Web spamming refers to actions intended to mislead search engines and give some pages higher ranking than they deserve. • A spammer will play with two factors which decide the rank score of a page in a query: • Relevance – textual similarity between the query and a page. • Importance – the global popularity of a page, which is query-independent. USC CS599

  5. Collusion in Web ranking • A manipulation of the hyperlink structure by a group of users with the intention of improving the rating one or more users in the group. USC CS599

  6. PageRank [Brin1998] • An eigenvector-based rating scheme to rank hypertext documents on the WWW. • An iterative algorithm to calculate the importance of a web page based on the importance of its parent pages. • Can be applied to other systems than WWW. USC CS599

  7. With prob. , I will restart the walk at a random node. : resetting probability : resetting probability • With prob. (1-), I will continue the walk to a random successor node. PageRank: random walk model node referential link The walker X 1/2 1/3 Z Y • As time goes on, the expected percentage of steps the walker is at each node vconverges to the PageRank weight PR(v). USC CS599

  8. I’m not colluding! • I’m not colluding! • I’m not colluding! • I’m not colluding! PageRank: is it collusion-proof? • Can a node easily boost its rank by manipulating its out-going links with others’? USC CS599

  9. Amp(G): a metric on group collusion : resetting probability real group weight + (1-) (1-) • “actual” group weight (1-W(G’)) • In the system of node group G, for a subgroup G’, • the amplification factor Amp(G’) = + y x G 2 j PR(x) PR(x) PR(y) PR(y) PR(y) i G’ 4 4 2 3 3 N WG(G’) =PR(i)+PR(j) Win(G’) = USC CS599

  10. Answer for (1+1 = ?) in PageRank • In the original PageRank system, • where  is the resetting probability. USC CS599

  11. Two experimental topologies • W, a Web link topology • Contains the link structure of upwards of 80 million URLs. • Source: the Stanford WebBase. • B, a weblog blogrolling topology • Contains the blogrolling structure of upwards of 72,000 blogs. • Source: www.blogstreet.com, the XML-RPC webblog service. USC CS599

  12. Experiment 1: Collusion200 • Model a small number of web pages simultaneously colluding. • Methodology: • 100 colluding groups of 200 nodes; • Each colluding group has the circle topology consisting of two nodes with adjacent ranks; • Arbitrarily chose node pairs originally ranked around 1000th, 2000th, …, 100000th. •  = 0.15. USC CS599

  13. Experiment result of Collusion200 (I) Figure 1: W - Amplification factors of the 100 colluding groups in Collusion200. USC CS599

  14. Old rank: 100009th New rank: 5038th Old rank: 1005th Old rank: 10001th New rank: 67th New rank: 450th Experiment result of Collusion200 (III) Figure 2: W – new PR rank after Collusion200. USC CS599

  15. There is a long flat portion… Figure 3: The PR weight distribution of 4 topologies. USC CS599

  16. Next step: how to detect collusions? • Identifying colluding groups is unlikely to be computationally tractable. • The densest k-subgraph problem[Feige et al. 1997]. • The classical CLIQUE problem. • The problem of finding hiding large cliques in random graphs[Juels 1998]. USC CS599

  17. Hardness on Amp • Theorem on Hardness. • Max G’G Amp(G’) is a NP-Hard problem. USC CS599

  18. How about using finer statistics of the random walk N-2 N-1 0 N N+1 1 2 A counterexample: a star+dangling circle topology Figure E: • The revisit intervals of the random walk on a colluding node will likely to have a large variance compared to its expectation. USC CS599

  19. G G’ An observation on collusion behaviors • To increase their PR weight, i.e., the stationary weight in the random walk, the colluding nodes will stall the random walk. • When the resetting probability  increases, the colluding nodes must suffer a significant drop in PR weight. • Therefore, we expect the PR weight of colluding nodes to be highly correlated with 1/  (the average walk length), while that of non-colluding nodes is relatively insensitive to the change in . USC CS599

  20. An intuitive example node referential link USC CS599

  21. An intuitive example node referential link A colluding group USC CS599

  22. y x • A colluding node x: PR(x) = , and co-co(PR(x), 1/ )  1. (co-co: correlation coefficient) • A non-colluding node y: PR(x) = , and co-co(PR(y), 1/ )  0. N: the system size; K: the colluding group size; K << N. An intuitive example node referential link A colluding group USC CS599

  23. Adaptive-resetting scheme • Part II –  personalization: • Calculate each node x's out-link personalized- = F(default, co-co(x)). • Exponential function FExp= . • Linear function FLinear= default+(0.5-default)*co-co(x) • The final PR weight vector is calculated with these personalized resetting values. • Part I – collusion detection: • Given the topology, calculate the PR vector under different  values. • {} = {0.0375, 0.05, 0.075, 0.15, 0.3, 0.45, 0.6}, default = 0.15. • Calculate the correlation coefficient between the curve of each node x's PR weight and the curve of 1/ . Label it as co-co(x). USC CS599

  24. Experiment result of Collusion200 (IV) Figure 5: W - Amplification factors of the 100 colluding groups in Collusion200. USC CS599

  25. Experiment result of Collusion200 (VI) Figure 6: W – new PR rank after Collusion200. USC CS599

  26. Experiment 2: Collusion22 • Model various colluding subgraphs. • Methodology: • 3 colluding groups: node referential link G1: 10-node ring G2: 10-node star topology G3: 2-node ring USC CS599

  27. Experiment result of Collusion22 (I) Figure 7: Amplification factors of the 3 colluding groups in Collusion22. USC CS599

  28. Experiment result of Collusion22 (II) Figure 8: W – new PR weight after Collusion22. USC CS599

  29. Dropped out New top-25 URL list in W Dropping New USC CS599

  30. Conclusions • Simple collusions lead to effective Web ranking improvement. • A simple scheme based on PageRank algorithm effectively counteracts Web ranking collusions. USC CS599

  31. Backup slides USC CS599

  32. Reputation systems [Okita2003] • A means of describing social trust networks. • The basic concept is a democratic meritocracy. • A rating system is used to evaluate individual members, and those results are then collated to produce a consensus about the merit of any given member. • Examples: • Livejournal, Friendster, eBay, Advogato USC CS599

  33. PageRank algorithm [Brin1998] • Basic algorithm • v Rank(v) = • Enhanced algorithm against rank sinks • v Rank(v) = : damping factor • Assume N pages. • Assign all pages the initial value 1/N • Let Nu be the out-degree of Page u, Rank(v) the importance of Page v, Bv the set of pages pointing to v. USC CS599

  34. Co-co distribution in real-world graphs Figure 4: the co-co PDF distribution in W and B: the [0, 0.1] range actually corresponds to [-1, 0.1] range. USC CS599

  35. Experiment result of Collusion200 (II) Figure A: W – new PR weight after Collusion200. USC CS599

  36. Experiment result of Collusion200 (VII) Figure B: B – new PR rank after Collusion200 USC CS599

  37. Experiment result of Collusion200 (X) Figure C: B – new PR weight after Collusion200 USC CS599

  38. Experiment result of Collusion200 (V) Figure 6: W – new PR weight after Collusion200. USC CS599

  39. Correlation coefficient USC CS599

  40. Experiment result of Collusion22 (III) Figure D: W – new PR rank after Collusion22. USC CS599

More Related