PhishNet: Predictive Blacklisting for Phishing Detection

PhishNet: Predictive Blacklisting to detect Phishing Attacks Reporter: Gia-Nan Gao Advisor: Chin-Laung Lei 2010/4/26 1

Reference • Pawan Prakash, Manish Kumar, Ramana Rao Kompella and Minaxi Gupta, “PhishNet: Predictive Blacklisting to Detect Phishing Attacks,” in IEEE INFOCOM 2010.

Outline • Introduction • Two Major Components of PhishNet • URL prediction component • Approximate URL matching component • Evaluation • Conclusion

Introduction • Phishing attacks • Set up fake web sites mimicking real businesses in order to lure innocent users into revealing sensitive information • Blacklisting • Match a given URL with a list of URLs belonging to a blacklist • Problem of blacklisting • Malicious URLs cannot be known before a certain amount of prevalence in the wild

Two Major Components of PhishNet • URL prediction component • Generate new URLs (child) from known phishing URLs (parent) by employing various heuristics • Test whether the new URLs generated are indeed malicious • Approximate URL matching component • Perform an approximate match of a new URL with the existing blacklist

Component 1:Heuristics for Generating New URLs • Typical blacklist URLs structure • http://domain.TLD/directory/filename?query string • H1: Replacing TLDs • H2: IP address equivalence • H3: Directory structure similarity • H4: Query string substitution • H5: Brand name equivalence

Heuristics for Generating New URLs • H1: Replacing TLDs • 3, 210 effective top-level domains (TLDs) • Replace the effective TLD of the parent URL with 3, 209 other effective TLDs • H2: IP address equivalence • Phishing URLs having same IP addresses are grouped together into clusters • Create new URLs by considering all combinations of hostnames and pathnames

Heuristics for Generating New URLs (cont’d) • H3: Directory structure similarity • URLs with similar directory structure are grouped together • Build new URLs by exchanging the filenames among URLs belonging to the same group • Parent • www.abc.com/online/signin/paypal.htm www.xyz.com/online/signin/ebay.htm • Child • www.abc.com/online/signin/ebay.htm www.xyz.com/online/signin/paypal.htm

Heuristics for Generating New URLs (cont’d) • H4: Query string substitution • Build new URLs by exchanging the query strings among URLs • Parent • www.abc.com/online/signin/ebay?XYZ • www.xyz.com/online/signin/paypal?ABC • Child • www.abc.com/online/signin/ebay?ABC • www.xyz.com/online/signin/paypal?XYZ

Heuristics for Generating New URLs (cont’d) • H5: Brand name equivalence • Build new URLs by substituting brand names occurring in phishing URLs with other brand names

Component 1: Verification • Conduct a DNS lookup to filter out sites that cannot be resolved • For each of the resolved URLs • Try to establish a connection to the corresponding server • For each successful connection • Initiate a HTTP GET request to obtain content from the server • If the HTTP header from the server has status code 200/202 (successful request) • Perform a content similarity between the parent and the child URLs • If the URL’s content has sharp resemblance (above say 90%) with the parent URL • Conclude that the child URL is a bad site

Component 2: Approximate Matching • Determine whether a given URL is a phishing site or not

M1: Matching IP Address • Perform a direct match of the IP address of URL with the IP addresses of the blacklist entries • Assign a normalizedscore based on the number of blacklist entries that map to a given IP address • If IP address IPi is common to ni URLs min{ni} (max{ni}): the minimum (maximum) of the number of phishing URLs hosted by blacklisted entries of IP addresses

M2: Matching Hostname • Perform hostname match with those in the blacklist • Domains of phishing URLs • Specifically registered for hosting phishing sites • Hosted on free/paidfor web-hosting services (WHS) • Identify whether an incoming URL consists of a WHS or not • Matching WHSes • Matching non-WHSes

M2: Matching Hostname (cont’d)

M3: Matching Directory Structure • Perform directory structure match with those in the blacklist • Philosophy of this design • H3 (directory structure similarity) • H4 (query string substitution) • ni: the number of URLs corresponding to a directory structure

M4: Matching Brand Names • Check for existence of brand names in pathname and query string of URLs • ni: the number of occurrences of the brand name • Compute a final cumulative score • Assign different weights to different modules

Evaluation: Component 1 • Collect 6,000 URLs from PhishTank (2009/7/2 ~ 2009/7/25)

Evaluation: Component 2 • How many benign (malicious) sites are (not) flagged as malicious • Data source • Phishing URLs • PhishTank (consists of about 18, 000 URLs) • SpamScatter (14, 000 URLs) • Benign URLs • DMOZ (100, 000 benign URLs ) • 20, 000 benign URLs from Yahoo Random URL generator (YRUG)

Evaluation: Component 2 (cont’d) • Training phase • Create various data structures using the phishing URLs • Testing phase • An input URL is flagged as a phishing or a benign site • Weight of individual modules • W(M1, M2, M3, M4) = (1.0, 1.0, 1.5, 1.5)

Evaluation: Component 2 (cont’d)

Conclusion • Address major problems associated with blacklists • Two major components of PhishNet • URL prediction component • Approximate URL matching component • Flag new URLs effectively

PhishNet: Predictive Blacklisting for Phishing Detection