250 likes | 262 Views
Detecting Gambling Sites From Post Behaviors. Shensi Tong, Hanlong Zhang, Beijun Shen , Hao Zhong Shanghai Jiao Tong University Yongjian Wang, Bo Jin The Third Research Institute of Ministry of Public Security Presented By Shensi Tong 2 016.5.16. Outline. Introduction Approach
E N D
Detecting Gambling Sites From Post Behaviors ShensiTong,HanlongZhang,BeijunShen,HaoZhong ShanghaiJiaoTongUniversity YongjianWang,BoJin TheThird Research Institute of Ministry of Public Security PresentedByShensiTong 2016.5.16
Outline • Introduction • Approach • Evaluation • Optimization • Conclusions
Introduction • DetectingGamblingSitesisimportant • Internetgamblingisevenmoreaddictivethantraditionalgambling,whichisharmful • MostcountriesexplicitlyprohibitInternetgamblingorunderstrictlysupervision • DetectingGamblingSitesischallenging • Tothebestofourknowledge, nopreviousworkwasproposedtodetectgamblingsitesautomatically • Thereisnoconsensusofwhichisbestfeaturetodetectgamblingsites
Introduction • MajorContributions • The first approach that mines behavior models for gamblingsites and detects previously unknown gambling siteswith mined models • A tool and two evaluations on 1TB dataset. The resultsshow that our tool detects gambling sites effectively. POST behavior of a website is the best feature todetermine whether it is a gambling site or not • An addition evaluation on applying graph analysis toimprove our approach. The results are valuable to furtheroptimize our approach
Outline • Introduction • Approach • Evaluation • Optimization • Conclusions
Approach • PreprocessingHTTPPOSTs • Typically,aPOSTrequestmessageconsistsofthefollowingparts • Requestline • “POST/a/.../script?K1 =V1 &...&Kn =Vn HTTP/1.1” • Cookieinrequestheader • “JSESSIONID=064185D5B6; NETEASE SSN=shanghai” • Requestbody • “subject=Test&message=test&formhash=bbb14e19&usesig =1&posttime=138672”. Hashpost = MD5( Script& Keys( RequestLine)& Keys( RequestBody))
Approach • ClusteringSites • Filtering • Inthispaper,wesetα1to5 • ComputestheJaccardcoefficientbetweentwowebsites • Weputtwowebsitesintothesameclusterifandonlyiftheirsimilarityvalueishigherthanapredefinedthresholdβ1
Approach • MiningBehaviorModels • Pickoutgamblingsiteclustersmanually • Minesabehaviormodelforcluster • POSTTF-IDF • Sortinadescendingorderandselecttopα3 as the model
Approach • DetectingPreviousUnknownGamblingSites • Calculatethesimilarityvaluebetweenunknownsitesandminedmodel • Ifthevalueishigherthanthresholdβ2, wesetittogamblingsites • Ifsomesitesnotfollowanyminedmodel,were-runourapproachtotrainanewmodel
Outline • Introduction • Approach • Evaluation • Optimization • Conclusions
Evaluations • Datasets • 4,000,000,000HTTPPOSTs • 750,000sites • 1TB • ErrorMeasures
Conclusion • Features • URL • Consistsoflexicalandhostinformation • HTML • ExtractsfromHTMLtagsthatappearinHTMLcodeofWebpages • Semantic • CapturestextualinformationthatisvisibleonWebpages
Outline • Introduction • Approach • Evaluation • Optimization • Conclusions
Optimization • GraphAnalysisFeatures • Degree • Numberofitsneighbors • Similarity • Similaritybetweentwowebsites • HashCount • UniqueHashPOSTforawebsite • Utmcsr • Sourcewebsitetoenterthiswebsite • Utmctr • Keywordsthatenterinsearchengine • Utmv • Usedtoidentifyasitefortrafficstatistics
Optimization • Observation1 • Likeattractslike
Optimization • Observation2 • Concentration
Optimization • Observation3 • Anomaly
Optimization • OptimizationResults • Matchingvaluesincookies • Ifsomekeywordsappearsinutmctr, thesiteislikelytobeagamblingsites • Filteringoutliersfromsites • DetermineawebsitewhetherbelongtoaclusteraccordingtoitsHashCount • FilteringlargePOSTsites • Filteringoutliersfromclusters
Outline • Introduction • Approach • Evaluation • Optimization • Conclusions
Conclusion • We propose a novel approach that detects gambling sites based on POST behavior • We evaluate our approach on large corpus, and our results show that our approach achieves both high precision and recall • We apply graph analysis to improve performance and recall