Know your Neighbors: Web Spam Detection Using the Web Topology. Carlos Castillo(1), Debora Donato(1), Aristides Gionis( 1) , Vanessa Murdock( 1) , Fabrizio Silvestri( 2). 1. Yahoo! Research Barcelona – Catalunya, Spain 2. ISTI-CNR –Pisa,Italy ACM SIGIR, 25 July 2007, Amsterdam. Presented By,
Carlos Castillo(1), Debora Donato(1), Aristides Gionis(1),
Vanessa Murdock(1), Fabrizio Silvestri(2).
1. Yahoo! Research Barcelona – Catalunya, Spain
2. ISTI-CNR –Pisa,Italy
ACM SIGIR, 25 July 2007, Amsterdam
There is a fierce competition for your attention!
Ease of publication for personal publication as well as commercial publication, advertisements, and economic activity.
…and there’s lots lots lots lots…lots of spam!
Every undeserved gain in ranking for a spammer is a loss of search precision for the search engine.
Single-level farms can be detected by searching groups of nodes sharing their out-links [Gibson et al., 2005]
Measures are related to in-degree and out-degree
Edge-reciprocity (the number of links that are reciprocal)
Assortativity (the ratio between the degree of a particular page and the average degree of its neighbors
TrustRank: an algorithm that picks trusted nodes derived from page-ranks but tests the degree of relationship one page has with other known trusted pages. This is given a TrustRank score.
Ratio between TrustRank and Page Rank
Number of home pages.
Cons: this alone is not sufficient as there are many false positives.
Most of the features reported in [Ntoulas et al., 2006]
F: set of most frequent terms in the collection
Q: set of most frequent terms in a query log
P: set of terms in a page
corpus precision: the fraction of words(except stopwords) in a page that appear in the set of popular terms of a data collection.
corpus recall: the fraction of popular terms of the data collection that appear in the page.
query precision: the fraction of words in a page that appear in the set of q most popular terms appearing in a query log.
query recall: the fraction of q most popular terms of the query log that appear in the page.
Figure: Histogram of the corpus precision in non-spam vs. spam pages.
Figure: Histogram of the average word length in non-spam vs. spam pages for k = 500.
Figure: Histogram of the query precision in non-spam vs. spam pages for k = 500.
Let SOUT(x) be the fraction of spam hosts linked by host x out of all labeled hosts linked by host x. This figure shows the histogram of SOUTfor spam and non-spam hosts. We see that almost all non-spam hosts link mostly to non-spam hosts.
Let SIN(x) be the fraction of spam hosts that link to host x out of all labeled hosts that link to x. This figure shows the histograms of SINfor spam and non-spam hosts.In this case there is a clear separation between spam and non-spam hosts.
if the majority of a cluster is predicted to be spam then we change the prediction for all hosts in the cluster to spam. The inverse holds true too.Clustering:
combine the regularization (any method of preventing overfitting of data by a model) methods at hand in order to improve the overall accuracy
Why is Spam bad?
How Do We Detect Spam?
Every undeserved gain in ranking for a spammer, is a loss of precision for the search engine.