1 / 15

Know your Neighbors: Web Spam Detection using the Web Topology

Know your Neighbors: Web Spam Detection using the Web Topology. Carlos Castillo, chato@yahoo-inc.com Debora Donato , debora@yahoo-inc.com Aristides Gionis , gionis@yahoo-inc.com Vanessa Murdock, vmurdock@yahoo-inc.com Fabrizio Silvestri , f.silvestri@isti.cnr.it

Download Presentation

Know your Neighbors: Web Spam Detection using the Web Topology

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Know your Neighbors: Web Spam Detectionusing the Web Topology Carlos Castillo, chato@yahoo-inc.com Debora Donato, debora@yahoo-inc.com Aristides Gionis, gionis@yahoo-inc.com Vanessa Murdock, vmurdock@yahoo-inc.com FabrizioSilvestri, f.silvestri@isti.cnr.it Presented by Anton Rodriguez-Dmitriev

  2. Personal Background • Graduated from FSU • Working on a MSECE • Specializing in Controls • CS minor • Work part-time at STW Technic, LP

  3. Web Spam Consequences • Damages reputation of search engine • Weakens the trust of the users • Eiron et al. ranked 100 million pages using PageRank: 11 out of the top 20 were pornographic pages • PageRank alone cannot filter spam • Cost incurred in crawling, indexing and storing spam pages

  4. Some popular spamming techniques • Link Spam: create link structure, usually tightly knit community of links, to try to affect the outcome of the link-based ranking algorithm. • Content Spam: maliciously crafting the content of a Webpage using techniques such as keyword stuffing, inserting keywords that are more related to popular queries • Cloaking: send different content to a search engine than to the regular visitor of a website

  5. Topology of the Dataset • Used WEBSPAM-UK2006 dataset: publically available spam collection • Undirected graph • Pruned to contain only hosts that share more than 100 links • Black nodes are spam and white nodes are non-spam • Most spammers in the larger connected component are clustered together • Other connected components are single-class

  6. Evaluation of the process • Confusion Matrix: • a represents the number of non-spam examples that were correctly classified • b represents the number of examples of non-spam that were falsely classified as spam • c represents the spam examples that were falsely classified as non-spam • d represents the number of spam examples that were correctly classified

  7. Success Measures • True positive-rate (or Recall): • False positive-rate : • Precision: • F-measure :

  8. Link-based Features • Degree-related measures: • In-degree and out-degree of the hosts and neighbors • Edge-reciprocity: the number of links that are reciprocal • Assortativity: the ratio between the degree of a particular page and the average degree of its neighbors • PageRank • TrustRank: uses a subset of hand-picked trusted nodes and propagates their labels through the Web graph • Truncated PageRank: a variant of PageRank that diminishes the influence of a page to the PageRank of its neighbors

  9. Link-based Features • Estimation of supporters: • Given two nodes x and y, x is a d-supporterof y, if the shortest path from x to y has length d • Nd(x) is the set of d-supporters of page x • Spam pages have a smaller bottleneck than non-spam • Bottleneck number : Histogram of b4(x) for spam and non-spam

  10. Content-based Features • Most interesting features presented: • Finding the k most frequent words in the dataset, excluding stopwords: • Corpus precision: is the fraction of words in a page that appear in a set of popular terms • Corpus recall: to be the fraction of popular terms that appear in the page • Considering the set of q most popular terms in a query log: • Query precision and query recall: are analogous to corpus precision and recall. • Used k & q = 100, 200, 500 and 1000

  11. Content-based Features • The best features are the corpus precision and query precision • All features where judged based only on histograms Histogram of the query precision in non-spam vs. spam pages for q = 500.

  12. Classifiers • Cost-sensitive decision tree • Cost of zero for correctly classifying the instance • Cost of misclassifying spam as normal is R times more costly as classifying a normal host as spam • R can be used to tune the balance between the true-positive rate and the false-positive rate • Used “bagging” to help reduce the false-positive rate

  13. Conclusion • Experimental evidence led to the hypotheses: • Non-spam nodes tend to be linked by very few spam nodes, and usually link to no spam nodes • Spam nodes are mainly linked by spam nodes • These tendencies can be exploited to yield better spam detection • Using multiple features, link-based and content-based, provided better detection • Error rate can be tuned by adjusting the cost matrix

  14. Critique • Article presented many features, both link-based and content-based, that can be used for spam detection, and also techniques to optimize based on graph topology (smoothing) • Results obtained showed which features and optimizations were effective • Dataset that was used is outdated, so there is no indication on how well the methods would work with newer or more sophisticated spamming techniques • There was no direct comparison between prior research results and the results obtained

More Related