1 / 87

Web Spam

Web Spam. Yonatan Ariel SDBI 2005. The Hebrew University of Jerusalem. Based on the work of Gyongyi, Zoltan; Berkhin, Pavel; Garcia-Molina, Hector; Pedersen, Jan Stanford University. Contents. What is web spam Combating web spam – TrustRank Combating web spam – Mass Estimation

Download Presentation

Web Spam

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Web Spam Yonatan Ariel SDBI 2005 The Hebrew University of Jerusalem Based on the work of Gyongyi, Zoltan; Berkhin, Pavel; Garcia-Molina, Hector; Pedersen, Jan Stanford University

  2. Contents • What is web spam • Combating web spam – TrustRank • Combating web spam – Mass Estimation • Conclusion

  3. Web Spam • Actions intended to mislead search engines into ranking some pages higher than they deserve. • Search engines are the entryways to the web Financial gains

  4. Consequences The first step in combating spam is understanding it • Decreased search results quality • “Kaiser pharmacy” returns techdictionary.com • Increased cost of each processed query • Search engine indexes are inflated with useless pages

  5. Search Engines • High quality results, i.e. pages that are • Relevant for a specify query • Textual similarity • Important • Popularity • Search engines combine relevance and importance, in order to compute Ranking

  6. Definition revised • any deliberate human action that is meant to trigger an unjustifiably favorable relevance or importance for some web page, considering the page’s true value

  7. Search Engine Optimizers • Engage in spamming (according to our definition) • Ethical methods • Finding relevant directories to which a site can be submitted • Using a reasonably sized description meta tag • Using a short and relevant page title to name each page

  8. Spamming Techniques • Boosting techniques • Achieving high relevance / importance • Hiding techniques • Hiding the boosting techniques We’ll cover them both

  9. Techniques • Boosting Techniques • Term Spamming • Link Spamming • Hiding Techniques

  10. IDF TF • TF (term frequency( • measure of the importance of the term (in a specific page) number of occurrences of the considered term number of occurrences of all terms

  11. TF IDF • IDF - (inverse document frequency) • a measure of the general importance of the term in a collection of pages total number of documents in the corpus Total number of documents where t appears

  12. TF-IDF • A high weight in tf-idf is reached by a high term frequency (in the given document) and a low document frequency of the term in the whole collection of documents. • Spammers: • Make a page relevant for a large number of queries • Make a page very relevant for a specific query

  13. Term Spamming Techniques • Body Spam • Simplest, oldest, most popular. • Title Spam • Higher weights. • Meta tag spam • Low priority • <META NAME="keywords" CONTENT="jew,jews,jew watch,jews and communism,jews and banking,jews and banks,jews in government..history,diversity,Red Revolution,USSR,jews in government , holocaust, atrocities, defamation, diversity, civil rights, plurali, bible, Bible, murder, crime, Trotsky, genocide, NKVD, Russia, New York, mafia, spy, spies,Rosenberg">

  14. Term Spamming Techniques (cont’d) • Anchor text spam • <a href=“target.html”> free, great deals, cheap, cheap,free </a> • URL • Buy-canon-rebel-20d-lens-case.camerasx.com

  15. Grouping Term Spamming Techniques • Repetition • Increased relevance for a few specific queries • Dumping of a large number of unrelated terms • Effective against Rare, obscure terms queries • Weaving of spam terms into copies contents • Rare (original) topic • Dilution – conceal some spam terms within the text • Phrase stitching • Create content quickly Remember not only airfare to say the right plane tickets thing in the right place, but far cheaptravel more difficult still, to leave hotelrooms unsaid the wrong thing at vacation the tempting moment.

  16. Techniques • Boosting Techniques • Term Spamming • Link Spamming • Hiding Techniques

  17. Three Types Of Pages On The Web • Inaccessible • Spammers cannot modify • Accessible • Can be modified in a limited way • Own pages • We call a group of own pages a spam farm

  18. First Algorithm - HITS • Assigns global hub and authority scores to each page • Circular definition: • Important hub pages are those that point to many important authority pages • Important authority pages are those pointed to by many hubs • Hub scores can be easily spammed • Adding outgoing links to a large number of well knows, reputable pages. • Authority score is more complicated • The more the better

  19. Second Algorithm - Page Rank • a family of algorithms for assigning numerical weightings to hyperlinked documents • The PageRank value of a page reflects the frequency of hits on that page by a random surfer • is the probability of being at that page after lots of clicks • We continue at random from a sink page

  20. accessible Own Inaccessible t Page rank All n own pages are part of the farm All m accessible pages point to the spam farm Links pointing outside the spam farm are supressed All accessible and own pages point to t No vote gets lost (each page has an outgoing link) All pages within the farm are reachable

  21. Techniques – Outgoing links • Manually adding outgoing link to well-knows hosts; increased hub score • Directories sites • dmoz.org • Yahoo! Directory • Creating massive outgoing link structure quickly

  22. Techniques – Incoming Links • Honey-pot – useful resource • Infiltrate a web directory • Links on blogs, guest books, wikis • Google’s tag – <a href="http://www.example.com/" rel="nofollow">discount</a> • Link exchange • Buy expired domains • Create own spam farm

  23. Techniques • Boosting Techniques • Term Spamming • Link Spamming • Hiding Techniques

  24. Content Hiding • Color scheme • font’s color same as background’s color • Tiny anchor images links (1x1 pixel) • Using scripts • Setting the visible HTML style attribute to FALSE.

  25. Cloaking • Spam web servers can return a different document to a web crawler • Identification of web crawlers: • A list of IP addresses • ‘user-agent’ field in the HTTP request • Allow web masters block some contents • Legitimate optimizations (remove ads) • Delivering contents that search engine can’t read (such as flash)

  26. Redirection • Automatically redirecting the browser to another URL • Refresh meta tag in the header of an HTML document • <meta http-equiv=“refresh” content=“0;url=target.html> • Simple to identify • Scripts • <script language=“javascript> location.replace(“target.html”) </script>

  27. How can we fight it? • IDENTIFY instances of spam • Stop crawling / indexing such pages • PREVENT spamming • Avoid cloaking – identifying as regular web browsers • COUNTERBALANCE the effect of spamming • Use variation of the ranking methods

  28. Some Statistics The results of a single breadth first search at the Yahoo! Home page A complete set of pages crawled and indexed by AltaVista

  29. Some More Statistics Average spammers Sophisticated spammers

  30. Contents • What is web spam • Combating web spam – TrustRank • Combating web spam – Mass Estimation • Conclusion

  31. Motivation • The spam detection process is very expensive and slow, but is critical to the success of search engines • We’d like to assist the human experts who detect web spam

  32. Getting dirty • G = (V,E) • V = set of N pages (vertices) • E = set of directed links (edges) that connect pages • We collapse multiple hyperlinks into a single link • We remove self hyperlinks • i(p) – number of in-links to a page p • w(p) – number of out-links from a page p

  33. 1 2 3 4 Our Example V = { 1, 2, 3, 4} E = { (1,2),(2,3),(3,2),(3,4)} N = 4 i(2) = 2; w(2) = 1

  34. A Transition Matrix

  35. 1 2 3 4 In our example The out edges of ‘3’ The in edges of ‘4’

  36. An Inverse transition matrix

  37. 1 2 3 4 In Our Example The in edges of ‘2’ The out edges of ‘2’

  38. Page Rank • mutual reinforcement between pages • the importance of a certain page influences and is being influenced by the importance of some other pages. decay factor In-links votes start-off atuthority

  39. Equivalent Matrix Equation Static Scalar N vector N vector Scalar N vector Dynamic

  40. A Biased PageRank Only pages that are reachable from some d[i]>0 will have a positive page rank A static score distribution (summing up to one)

  41. 1 2 3 4 5 6 7 good bad Oracle Function • A binary oracle function O over all pages p in V: O(3 ) = 1 O(6 ) = 0

  42. Oracle Functions • Oracle invocations are expensive and time consuming • We CAN’T call the function for all pages • Approximate isolation of the good set • Good pages seldom point to bad ones • As we’ve seen, good pages *can* point to bad ones • bad pages often point to bad ones

  43. Trust Function • We need to evaluate pages without relying on O. • We define, for any page p, a trust function • Ideal Trust Property (for any page p) T(p) = Pr[ O(p) = 1 ] • Very hard to come up with such function • Useful in ordering search results

  44. Ordered Trust Property

  45. First Evaluation Metric - Pairwise Orderedness Trust function T, oracle function O, pages p,q A violation of the ordered trust proerty The fraction of the pairs for which T did not make a mistake

  46. Threshold Trust Property • Doesn’t necessarily provide an ordering of pages based on their likelihood of being good • We’ll describe two evaluation metrics • Precision • Recall

  47. Threshold Evaluation Metrics Total number of correct ‘good’ estimations Total number of correct ‘good’ estimations Total number of ‘good’ estimations Total number of good pages in X

More Related