1 / 16

Web Spam Detection with Anti-Trust Rank

Web Spam Detection with Anti-Trust Rank. Vijay Krishnan Rashmi Raj Computer Science Department Stanford University. The World Wide Web. Huge Distributed content creation, linking (no coordination) Structured databases, unstructured text, semi-structured data.

Download Presentation

Web Spam Detection with Anti-Trust Rank

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Web Spam Detection with Anti-Trust Rank Vijay Krishnan Rashmi Raj Computer Science Department Stanford University

  2. The World Wide Web • Huge • Distributed content creation, linking (no coordination) • Structured databases, unstructured text, semi-structured data. • Content includes truth, lies, obsolete information, contradictions, …

  3. PageRank • Intuition: “a page is important if important pages link to it.” • In high-falutin’ terms: importance = the principal eigenvector of the stochastic matrix of the Web. (A few fixups needed.)

  4. PageRank • Web graph encoded by matrix M • NXN matrix (N = number of web pages) • Mij = 1/|O(j)| iff there is a link from j to i • Mij = 0 otherwise • O(j) = set of pages node i links to • Define matrix A as follows • Aij = βMij + (1-β)/N, where 0<β<1 • 1-β is the “tax” discussed in prior lecture • Page rank r is first eigenvector of A • Ar = r

  5. Many Random Walkers Model • Imagine a large number M of independent, identical random walkers (MÀN) • At any point in time, let M(p) be the number of random walkers at page p • The page rank of p is the fraction of random walkers that are expected to be at page p i.e., E[M(p)]/M.

  6. Economic Considerations • Search has become the default gateway to the web • Very high premium to appear on the first page of search results • e.g., e-commerce sites • advertising-driven sites

  7. What is Web Spam? • Spamming = any deliberate action solely in order to boost a web page’s position in search engine results, incommensurate with page’s real value • Spam = web pages that are the result of spamming • This is a very broad defintion • SEO industry might disagree! • SEO = search engine optimization • Approximately 10-15% of web pages are spam

  8. Types of Spamming Techniques • Term spamming • Manipulating the text of web pages in order to appear relevant to queries • Link spamming • Creating link structures that boost page rank or hubs and authorities scores

  9. Link Spam • Three kinds of web pages from a spammer’s point of view • Inaccessible pages • Accessible pages • e.g., web log comments pages • spammer can post links to his pages • Own pages • Completely controlled by spammer • May span multiple domain names

  10. Link Spam Detection • Open research area • One approach: TrustRank

  11. Trust Rank • Basic principle: approximate isolation • It is rare for a “good” page to point to a “bad” (spam) page • Sample a set of “seed pages” from the web. • Set trust of each trusted page to 1 • Propagate trust through links • Each page gets a trust value between 0 and 1 • Use a threshold value and mark all pages below the trust threshold as spam

  12. Anti-Trust Approach • Broadly based on the same “approximate isolation principle” • This principle also implies that the pages pointing to spam pages are very likely to be spam pages themselves. • Anti-Trust is propagated in the reverse direction along incoming links, starting from a seed set of spam pages. • A page can be classified as a spam page if it has Anti-Trust Rank value more than a chosen threshold value.

  13. Seed Set selection • Seed spam set chosen from pages with high page rank. • Nearly 100% URLS containing certain terms like {viagra,gambling, hardporn} as substrings are spam. Use these for evaluation. • Also some seed pages were chosen by an Oracle (Human Expert).

  14. Results • Overall Percentage of “spam” pages =0.28%. • Average page rank of “spam”/Average Page Rank = 2.6. • % of “spam” pages in: • top 1000 Anti-Trust rank pages = 25.3% • Bottom 1000 Trust rank pages = 0.68% • Ratio of average page ranks of spam pages returned by ATR vs. TR is roughly 6.

  15. Results

  16. References • The PageRank citation ranking: Bringing order to the web. L. Page, S. Brin, R. Motwani and T. Winograd. Technical Report, Stanford University, 1998. • Combating Web Spam with Trust Rank. Zoltan Gyongyi, Hector Garcia-Molina and Jan Pedersen. In VLDB 2004. • Topic-sensitive PageRank. Taher Haveliwala. In WWW 2002. • The WebGraph dataset. Online at: • http://webgraph-data.dsi.unimi.it/

More Related