230 likes | 326 Views
Explore PhishNet's novel approach using heuristics to detect and blacklist phishing URLs effectively. Learn about URL prediction and matching components, heuristics for generating new URLs, and the evaluation of this predictive system.
E N D
PhishNet: Predictive Blacklisting to detect Phishing Attacks Reporter: Gia-Nan Gao Advisor: Chin-Laung Lei 2010/4/26 1
Reference • Pawan Prakash, Manish Kumar, Ramana Rao Kompella and Minaxi Gupta, “PhishNet: Predictive Blacklisting to Detect Phishing Attacks,” in IEEE INFOCOM 2010.
Outline • Introduction • Two Major Components of PhishNet • URL prediction component • Approximate URL matching component • Evaluation • Conclusion
Introduction • Phishing attacks • Set up fake web sites mimicking real businesses in order to lure innocent users into revealing sensitive information • Blacklisting • Match a given URL with a list of URLs belonging to a blacklist • Problem of blacklisting • Malicious URLs cannot be known before a certain amount of prevalence in the wild
Two Major Components of PhishNet • URL prediction component • Generate new URLs (child) from known phishing URLs (parent) by employing various heuristics • Test whether the new URLs generated are indeed malicious • Approximate URL matching component • Perform an approximate match of a new URL with the existing blacklist
Component 1:Heuristics for Generating New URLs • Typical blacklist URLs structure • http://domain.TLD/directory/filename?query string • H1: Replacing TLDs • H2: IP address equivalence • H3: Directory structure similarity • H4: Query string substitution • H5: Brand name equivalence
Heuristics for Generating New URLs • H1: Replacing TLDs • 3, 210 effective top-level domains (TLDs) • Replace the effective TLD of the parent URL with 3, 209 other effective TLDs • H2: IP address equivalence • Phishing URLs having same IP addresses are grouped together into clusters • Create new URLs by considering all combinations of hostnames and pathnames
Heuristics for Generating New URLs (cont’d) • H3: Directory structure similarity • URLs with similar directory structure are grouped together • Build new URLs by exchanging the filenames among URLs belonging to the same group • Parent • www.abc.com/online/signin/paypal.htm www.xyz.com/online/signin/ebay.htm • Child • www.abc.com/online/signin/ebay.htm www.xyz.com/online/signin/paypal.htm
Heuristics for Generating New URLs (cont’d) • H4: Query string substitution • Build new URLs by exchanging the query strings among URLs • Parent • www.abc.com/online/signin/ebay?XYZ • www.xyz.com/online/signin/paypal?ABC • Child • www.abc.com/online/signin/ebay?ABC • www.xyz.com/online/signin/paypal?XYZ
Heuristics for Generating New URLs (cont’d) • H5: Brand name equivalence • Build new URLs by substituting brand names occurring in phishing URLs with other brand names
Component 1: Verification • Conduct a DNS lookup to filter out sites that cannot be resolved • For each of the resolved URLs • Try to establish a connection to the corresponding server • For each successful connection • Initiate a HTTP GET request to obtain content from the server • If the HTTP header from the server has status code 200/202 (successful request) • Perform a content similarity between the parent and the child URLs • If the URL’s content has sharp resemblance (above say 90%) with the parent URL • Conclude that the child URL is a bad site
Component 2: Approximate Matching • Determine whether a given URL is a phishing site or not
M1: Matching IP Address • Perform a direct match of the IP address of URL with the IP addresses of the blacklist entries • Assign a normalizedscore based on the number of blacklist entries that map to a given IP address • If IP address IPi is common to ni URLs min{ni} (max{ni}): the minimum (maximum) of the number of phishing URLs hosted by blacklisted entries of IP addresses
M2: Matching Hostname • Perform hostname match with those in the blacklist • Domains of phishing URLs • Specifically registered for hosting phishing sites • Hosted on free/paidfor web-hosting services (WHS) • Identify whether an incoming URL consists of a WHS or not • Matching WHSes • Matching non-WHSes
M3: Matching Directory Structure • Perform directory structure match with those in the blacklist • Philosophy of this design • H3 (directory structure similarity) • H4 (query string substitution) • ni: the number of URLs corresponding to a directory structure
M4: Matching Brand Names • Check for existence of brand names in pathname and query string of URLs • ni: the number of occurrences of the brand name • Compute a final cumulative score • Assign different weights to different modules
Evaluation: Component 1 • Collect 6,000 URLs from PhishTank (2009/7/2 ~ 2009/7/25)
Evaluation: Component 2 • How many benign (malicious) sites are (not) flagged as malicious • Data source • Phishing URLs • PhishTank (consists of about 18, 000 URLs) • SpamScatter (14, 000 URLs) • Benign URLs • DMOZ (100, 000 benign URLs ) • 20, 000 benign URLs from Yahoo Random URL generator (YRUG)
Evaluation: Component 2 (cont’d) • Training phase • Create various data structures using the phishing URLs • Testing phase • An input URL is flagged as a phishing or a benign site • Weight of individual modules • W(M1, M2, M3, M4) = (1.0, 1.0, 1.5, 1.5)
Conclusion • Address major problems associated with blacklists • Two major components of PhishNet • URL prediction component • Approximate URL matching component • Flag new URLs effectively