1 / 25

CANTINA: A Content-Based Approach to Detecting Phishing Web Sites

CANTINA: A Content-Based Approach to Detecting Phishing Web Sites. Reporter: Gia-Nan Gao Advisor: Chin-Laung Lei 2010/6/7. 1. Reference.

eytan
Download Presentation

CANTINA: A Content-Based Approach to Detecting Phishing Web Sites

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CANTINA: A Content-Based Approach to Detecting Phishing Web Sites Reporter: Gia-Nan Gao Advisor: Chin-Laung Lei 2010/6/7 1

  2. Reference • Y. Zhang, J. Hong, and L. Cranor, “Cantina: A content-based approach to detecting phishing web sites,” in proceedings of the International World Wide Web Conference (WWW), 2007.

  3. Outline • Introduction • Automated Detection of Phishing • A Content-based Approach for Detecting Phishing Web Sites • Evaluation • Conclusion

  4. Introduction • Phishing • A kind of attack in which victims are tricked by spoofed emails and fraudulent web sites into giving up personal information • How many phishing sites are there? • 9,255 unique phishing sites were reported in June of 2006 alone • How much phishing costs each year? • $1 billion to 2.8 billion per year

  5. Automated Detection of Phishing • Use heuristics to judge whether a page has phishing characteristics • Can detect phishing attacks as soon as they are launched • Attackers may be able to design their attacks to avoid heuristic detection • Often produce false positives • Use a blacklist that lists reported phishing URLs • Higher accuracy • Require human intervention and verification

  6. CANTINA: Content-based Approach • TF-IDF algorithm • Robust Hyperlinks • Adapting TF-IDF for detecting phishing • A set of auxiliary heuristics

  7. TF-IDF Algorithm • Yield a weight that measures how important a word is to a document in a corpus • Term Frequency (TF) • The number of times a given term appears in a specific document • Measure of the importance of the term within the particular document • Inverse Document Frequency (IDF) • Measure how common a term is across an entire collection of documents • A term has a high TF-IDF weight • A high term frequency in a given document • A low document frequency in the whole collection of documents

  8. Robust Hyperlinks • Overcome the problem of broken links • Basic idea • Add a small number of well-chosen terms, which they called a lexical signature, to URLs • Create signatures • Calculate the TF-IDF value for each word in a document, and then select the words with highest value • Lexical signature of about five terms are sufficient to determine a web resource virtually uniquely

  9. How CANTINA Works? (1/2) • Given a web page, calculate the TF-IDF scores of each term on that web page • Generate a lexical signature by taking the five terms with highest TF-IDF weights • Feed this lexical signature to a search engine, which in the case is Google • If the domain name of the current web page matches the domain name of the N top search results, it will be considered a legitimate web site. Otherwise, it will be considered a phishing site.

  10. How CANTINA Works? (2/2) • Assumption • Google indexes the vast majority of legitimate web sites, and that legitimate sites will be ranked higher than phishing sites • Two heuristics • Domain name • Add the current domain name to the lexical signature • Zero results Means Phishing (ZMP) • If Google fails to return any result, the suspected site will be labeled as phishing • Example: eBay & its phishing site

  11. Example: eBay

  12. Phishing Site of eBay

  13. Legitimate Site of eBay

  14. Auxiliary Heuristics

  15. Evaluation • Four experiments • Evaluation of TF-IDF • Evaluation of heuristics • Evaluation of CANTINA • Evaluation of CANTINA using URLs gathered from email • Two metrics • True positives (correctly labeling a phishing site as phishing, higher is better) • False positives (incorrectly labeling a legitimate site as phishing, lower is better)

  16. Experiment 1 – Evaluation of TF-IDF (1/3) • Basic TF-IDF – Calculate the lexical signature based on the top 5 terms, submit that to Google, and check if the domain name of the page in question matches any of the top 30 results • Basic TF-IDF+domain – Same as Basic TF-IDF, except that the domain name of the page in question is added to the lexical signature • Basic TF-IDF+ZMP – Same as Basic TF-IDF, except that zero search results means that the page in question is labeled as a phishing site (ZMP is “zero means phishing”) • Basic TF-IDF+domain+ZMP – A combination of the two variants above. This combination turned out to have the best results, and is also called Final-TF-IDF in later sections

  17. Experiment 1 – Evaluation of TF-IDF (2/3) • 100 phishing URLs from PhishTank.com • 100 legitimate URLs from a list of 500 used in 3Sharp’s study of anti-phishing toolbars

  18. Experiment 1 – Evaluation of TF-IDF (3/3)

  19. Experiment 2 – Evaluation of Heuristics (1/2) • Determine the best weights for these heuristics. • If S = 1, the page is labeled as a legitimate page, and if S = -1, it is labeled as a phishing site

  20. Experiment 2 – Evaluation of Heuristics (2/2)

  21. Experiment 3 – Evaluation of CANTINA (1/2) • Comparison: • Final-TF-IDF, Final-TD-IDF+heuristics, SpoofGuard, and Netcraft • SpoofGuard • Rely entirely on heuristics • Netcraft • Use a combination of heuristics and a blacklist • 100 phishing URLs with unique domains from PhishTank • 100 legitimate URLs using the following strategy • Select the login pages of 35 sites that are often attacked by phishers • Select the 35 top pages from Alexa Web Search • Select 30 random pages from http://random.yahoo.com/fast/ryl, and manually verify that they are legitimate

  22. Experiment 3 – Evaluation of CANTINA (2/2)

  23. Experiment 4 – Evaluation of CANTINA Using URLs Gathered from Email (1/2) • Evaluate CANTINA using URLs gathered from users’ actual email • Gather 3038 unique URLs, of which only 2519 were active, from the 3385 email messages • Label the active URLs as “phishing,” “spam,” or “legitimate.” manually • Phishing: pages that impersonate known brands and ask for personal data (19) • Spam: those selling unsolicited products or services (388) • All other URLs were deemed legitimate (2100)

  24. Experiment 4 – Evaluation of CANTINA Using URLs Gathered from Email (2/2)

  25. Conclusion • A content-based approach for detecting phishing web sites • Pure TF-IDF approach • Can catch 97% phishing sites with about 6% false positives • Final TF-IDF approach • Can catch about 90% of phishing sites with only 1% false positives

More Related