Know your Neighbors: Web Spam Detection using the Web Topology

Know your Neighbors: Web Spam Detection using the Web Topology SIGIR 2007, Carlos Castillo et al., Yahoo! 2008.10.30. 이 승 민

Contents 1 Introduction 2 Data Set 3 Features 4 Classification 5 Smoothing 6 Conclusion

1. Introduction • What is web spam ? • It includes malicious attempts to influence the outcome of ranking algorithms • Web spam is not a new problem, and is not likely to be solved in the near future • How to ? • Traditional Machine Learning assumes that data instances are independent • In Web, there are dependencies among pages and hosts

1. Introduction • Host graph • White node: non-spam node • Black node: spam node • An edge: more than 100 links

1. Introduction • Previous work • Link spam • Creating of a link structure to aim at affecting the outcome of a link-based ranking algorithm • Content spam • Maliciously crafting the content of Web pages (e.g.,insert keyword) • Similar methods used in e-mail spam filtering • Cloaking • Sending different content to a search engine than to the regular visitors of a web site

1. Introduction • Overall scheme • Link + Content based features • 3 Smoothing techniques Classification Smoothing Data Set Feature Extraction • WEBSPAM-UK2006 • 77.9million pages • 3 billion links • 11,400 hosts • Host level labeling • 236 features • Link feature • : 140 features • Content feature • : 96 features • Decision tree(C4.5) • Result • : 49 features • : 0.723 F-msr. • Using link structure • Graph clustering • Propagation using • random walks • Stacked graphical • learning • : 40 features • : 0.763 F-msr.

2. Data Set • Data Set • WEBSPAM-UK2006 dataset, a publicly available • 77.9 million pages and over 3 billion links in about 11,400 hosts

2. Data Set • Measures • Precision, Recall, F-measure • P = d / (b+d) • R = d / (c+d) • F = 2PR / (P+R) • True positive rate, False positive rate, ROC curve • TP = d / (c+d) • FP = b / (a+b) • ROC curve • Validation • tenfold cross validation

3. Features • Link-based features (140 features) • Using most of 163 features by Becchetti et al.[4] • Degree-related measures (16/17) • Measures related to the in-degree and out-degree • 16 degree-related features • PageRank (11/28) • Link-based ranking algorithm that computes a score for each page • 11 PageRank-based features • TrustRank ( /35) • Algorithm that estimates a TrustRank score for each page • Using the algorithm, also estimate the spam mass of a page

3. Features Truncated PageRank ( /60) A variant of PageRank that diminishes the influence of a page to the PageRank score of its neighbors Estimation of supporters x is d-supporter of y if the shortest path from x to y Nd(x) is the set of the d-supporters of page x Increasing function with respect to d bd(x) = Bottleneck number of page x 2.2 1.3~1.7

3. Features Content-based features (24 features) Using most of the features by Ntoulas et al.[22] Number of words in the page Number of words in the title Average word length Fraction of anchor text A has a link with the anchor text “computer” pointing to page B, then we may conclude that page B talks about “computer” Compression rate Some search engine give higher weight to pages containing the query keywords several times spam

3. Features Corpus precision & recall k most frequent words in the dataset, for k=100,200,500,1000 Query precision & recall q most popular terms in a query log, for q=100,200,500,1000 • Independent trigram likelihood • : probability distribution of trigrams in a page • : set of all trigrams in a page • : number of distinct trigrams • Entropy of trigrams <Histogram of the query precision for k=500> Page=200 k=100 P = 10/100 = 0.1 R = 10/200 = 0.05 10 Fraction of pages precision

3. Features From page features to host features Content based feature vector c(h) of host h : the home page of host h : the page with the largest PageRank among all pages in P : the 24 content feature vector of page p : the average of all vectors : the variance of In total, 140 + 96 = 236 link and content features 96 (=424) content features

4. Classification C4.5 (decision tree) Resulting tree used 45 unique features (Table 1) 18 of them are content features

5. Smoothing Smoothing Using differently the link structure of the graph Topological dependency Non-spam nodes usually link to no spam nodes Spam nodes are mainly linked by spam nodes < Histogram of the fraction of spam hosts in the links of non-spam or spam hosts> (a) Fraction of spam nodes in out-links (b) Fraction of spam nodes in in-links

5. Smoothing Clustering Using the METIS graph clustering algorithm [18] Partitioning the 11,400 hosts of the graph into 1,000 clusters

5. Smoothing Propagation Using propagation by random walks [32] A link with probability , returning to a spam node with probability 1- 

5. Smoothing Stacked graphical learning Meta learning scheme proposed recently by Kou [8] Using a base learning scheme + generating a set of extra features An extra feature Average predicted spamicity of r(h) p(h) : prediction for h r(h) : set of pages related to h Tree uses 40 features, of which 20 are content features 5.5%

6. Conclusion Contributions First paper that integrates link and content features Diverse smoothing algorithm, specially stacked graph learning Discussion Low detection rate compared to intrusion detection Publicly available dataset Feature selection using statistical approach Research for each Web Spam category

Thank you ! Question ?

Know your Neighbors: Web Spam Detection using the Web Topology