240 likes | 358 Views
This paper presents an approach to improve web spam classification by utilizing rank-time features and support vector machines (SVM). With the exponential growth of information on the World Wide Web (WWW), search engines face challenges from web spam that undermines their ranking algorithms. This study details the classification process using domain separation and SVM, and highlights the importance of effective dataset creation and evaluation methods. The proposed system aims to enhance the precision of spam detection, improving user experience and search engine efficiency.
E N D
Improving Web Spam Classification using Rank-time Features September 25, 2008 TaeSeob ,Yun KAIST DATABASE & MULTIMEDIA LAB
Contents • Introduction • Support Vector Machine • Data Set • Domain Separation • Rank-time features • Evaluation • Summary DATABASE & MULTIMEDIA LAB
Introduction • World Wide Web(WWW) • Definition • An information space in which the items of interest, referred to as resources, are identified by global identifiers [IAN04] • Description • Too much information • Needs Web Search Engines DATABASE & MULTIMEDIA LAB
Introduction • Web Search Engine • Definition • A search engine designed to search for information on the World Wide Web [WIK08] • Description • Retrieves pages relevant to users’ query • Ranking is become important • Web Spam interferes Web Search Engines DATABASE & MULTIMEDIA LAB
Web Spam(1/2) • Definition • A page that uses bad method to improve ranking [KRI07] • Object • Mislead web search engines’ rank algorithm • Make profit by increase page’s traffic • Reason why we should remove Web Spam • Users spend too much time to search for information • Ranking on search engines is critical for making profit • Reduce search engine’s resources DATABASE & MULTIMEDIA LAB
Web Spam(2/2) • Type of Web spam • Link stuffing • Keyword stuffing • Cloaking • Web farming • When to remove Web Spam • Crawl-time • Index-time • Rank-time • How to remove Web Spam • By training machine – Support Vector Machine(SVM) DATABASE & MULTIMEDIA LAB
Support Vector Machine(1/2) v1 n dimensions ? v2 <3 dimensions> <2 dimensions> • Definition • A set of related supervised learning methods used for classification and regression[WIK08] • Description • Find separating hyperplane with maximal margin on vector space DATABASE & MULTIMEDIA LAB
Support Vector Machine(2/2) • Procedure • Collect Datasets • Classify Datasets into Training Datasets and Test Dataset • Train the machine with Training Datasets • Test the machine with Test Dataset • Problem • We need to collect Datasets DATABASE & MULTIMEDIA LAB
Dataset • Definition • A set of labeled sample data for training and test • Collecting Procedure • Collect common query lists from MSN Live search engine • Label each of top-10 result as spam, non-spam or unknown by human judge • Classify dataset into training datasets and a test dataset • Classification method on datasets • Very important! • We choose Domain Separation DATABASE & MULTIMEDIA LAB
Domain Separation(1/6) • Definition • A classification method that classify according to domains • Procedure(in this paper) • For each URL from dataset • Calculate hash value by domain • If a new hash value comes, assign it randomly into 5 files • If the hash value comes again, put into the assigned file • Adjust 5 files into similar size • Why should we choose Domain Separation? DATABASE & MULTIMEDIA LAB
Domain Separation(2/6) • Domain separated vs. Randomly separated • Opinion • Domain separated datasets are better • The result trained with randomly separate dataset is WRONG! • It’s general classification problem in machine learning • Reason • If there exists subsets in dataset, and they has features, we should use those features • In fact, some spammers buy a domain for making spam page, it’s common that whole pages related that domain labeled spam • How to make domain separated datasets? DATABASE & MULTIMEDIA LAB
Domain Separation(3/6) • Five-fold cross validation • Definition • A method for training and test the SVM using in this paper • Procedure • Choose one of five domain-separated datasets as a test set • Choose other domain-separated datasets as training datasets • Train the SVM with 4 training datasets • Test the SVM with a test set • Repeat above procedures at all combination of sets DATABASE & MULTIMEDIA LAB
Domain Separation(4/6) • The result of domain separation • Total 31,300 URLs • 3,133 spam labeled URLs(9.99%) • Problem • Learning feature vector to subset hash to label may turn out to be wildly and incorrectly optimistic • Leave future work DATABASE & MULTIMEDIA LAB
Domain Separation(5/6) • Description • No duplicated domain • Consists 25% spam • Couldn’t use domain information • Worst-case graph DATABASE & MULTIMEDIA LAB
Domain Separation(6/6) • Description • Add additional feature • Consists 10% spam • More difficult to detect than 25% spam • Result • Still little bit lower than randomly sep., but it’s worst-case • Note : Still couldn’t use domain information DATABASE & MULTIMEDIA LAB
FEATA(1/2) • Description • Rank independent features • FEATA includes • Domain-level features • Page-level features • Link information DATABASE & MULTIMEDIA LAB
FEATA(2/2) • Description • Average precision 60% at 10.8% recall • Consists of 10% spam • Not so good • We will add Rank-time features! DATABASE & MULTIMEDIA LAB
Rank-time Features • Definition • Features using on rank-time • Motivation • Every page has feature vector • Shape of spam/non-spam pages’ feature vector is different • Spammer can’t guess distribution of non-spam feature vector • Consist of • Query independent features(FEATB) • Query dependent features(FEATQ) DATABASE & MULTIMEDIA LAB
FEATB • Definition • Query independent, rank-time features • Description • Page-level features • Domain-level features • Popularity features • Time features DATABASE & MULTIMEDIA LAB
FEATQ • Definition • Query dependent, rank-time features • Description • Depend on the match between query and document property • Examine for each returned result • Future work • Label spam on the URL only, not on the relevance of a URL to a query DATABASE & MULTIMEDIA LAB
Evaluation • Micro averaged on five tests DATABASE & MULTIMEDIA LAB
Summary • Classification of Web Spam is an important problem • We can classify Web Spam by training on the SVM • Making training datasets as domain-separated datasets is very important • Rank-time features improve classification performance by as much as 25% in recall at a set precision DATABASE & MULTIMEDIA LAB
References • [KRY07]Krysta, M., Qiang, W., Chris, J., Aaswath, R,. “Improving Web Spam Classification using Rank-time Features”, AIRWeb ’07, May 8, 2007 • [IAN04] Ian, J., “Architecture of the World Wide Web, Volume One”, W3C Recommendation, Dec 15, 2004 • [WIK08] “Web Search Engine”, “Support Vector Machine”, http://wikipedia.org, Sep 25, 2008 DATABASE & MULTIMEDIA LAB
[Appendix A] Receiver Operating Characteristic DATABASE & MULTIMEDIA LAB