AntiPhish – Lessons Learnt

AntiPhish – Lessons Learnt André Bergholz Fraunhofer IAIS, St. Augustin Workshop on CyberSecurity and Intelligence Informatics (CSI-KDD) June 28th, 2009

Phishing E-mail fraud • Send officially looking email • Include web link or form • Ask for confidential information, e.g., password, account details • Attacker uses information to withdraw money, enter computer system, etc.

Phishing: Target Sites • Target customers of banks and online payment services • Obtain sensitive data from U.S. taxpayers by pretended IRS- emails • Identity theft for social network sites, e.g. myspace.com • Recently more non-financial brands were attacked including social networking, VOIP, and numerous large web-based email providers. http://www.antiphishing.org/

Phishing: Techniques • Upward trend in number of phishing mails sent • Massive increase of phishing sites over the past • Increasing sophistication • Link manipulation, URL spelling • Website address manipulation • Evolution of phishing methods from shotgun-style email • Image phishing • Spear phishing (targeted) • Voice over IP phishing • Whaling: High-profile people http://www.antiphishing.org/

Phishing: Damage Gartner (“The War on Phishing Is Far From Over”, 2009): • 5 million US consumers affected between 09/2007 and 2008 (39.8% increase) • Average loss per consumer: $351 (60% decrease), Total loss: 1.8 billion dollars • Top-Three most attacked countries: USA, UK, Italy [RSA Online Fraud Report, 2009] • 90% of internet users are fooled by good phishing websites [Dhamija et al., SIGCHI 2006] • For the individual phisher: low-skill, low-income business [Herleyand Florencio, New Security Paradigms Workshop, 2008] http://www.antiphishing.org/

Approaches against Phishing • Network- and Encryption-based countermeasures: Email authentification, two factor authentification, mobile TANs, etc. • Blacklisting and whitelisting: Lists of phishing sites and legitimate sites • Content-based filtering for websites and emails • Typical formulations urging the user to enter confidential information • Design elements, trademarks, and logos for known brands (only relatively few brands are attacked) • Spoofed sender addresses and URLs • Invisible content inserted to fool automatic filtering approaches • Images containing the message text

Consortium • Fraunhofer IAIS (DE) • Symantec (GB, IRL) • Tiscali (IT) • Nortel (FR) • K.U. Leuven (BE) EU-Project AntiPhish Period: 01/2006 – 06/2009 • Develop content-based phishing filters • Use realistic email corpora • Deploy in realistic workflows • Trainable and adaptive filtersè adapt to new phishing attacksè anticipate attacks

Agenda è • Email Classification based on Advanced Text Mining • Hidden Salting and Anticipating Evasion • Real-Life AntiPhish Deployment • Conclusions

Non-phishing email features Featureextraction Classifier Phishing Phishing Filtering as Classification Problem Task: Automatically classify emails based on content • Use email features relevant to detect phishing • Training data: emails labeled with classes ham, spam, phishing • Train a classifier • Apply to new emails

header email formatted, packed content header body header fields mixed part altern. part plain text plain text html text + attributes (metadata) + [encoded] content Message Preprocessing Standardized email data file (flat representation) Structured representationincluding embedded images, attachments

Basic Features Can be derived directly from the email itself, i.e., do not require information about specific websites • Structural Features (4) Number of body parts (total, discrete, composite, alternative) • Link Features (8) Number of links (total, internal, external, w/ IP-numbers, deceptive, image), Number of dots, Action word links • Element Features (4) HTML, Scripts, JavaScript, Forms • Spam Filter Features (2) SpamAssassin (untrained) score and classification • Word List Features (9) Indicator words, e.g., account, update, confirm, verify, secur, notif, log, click, inconvenien

Dynamic Markov Chains • Operate on the bit representation of the natural language text of the email • Model a bit sequence as a stationary and ergodic Markov source with limited memory 0101001010010010111010100101101001010100101010011101001010101010101 … • Incrementally build such an automaton / Markov chain to model the training sequences • Train one DMC for each of the classes (i.e., ham, spam, phishing), For a new email look which model fits best • Has been successfully applied to spam classification[Bratko et al., JMLR 2006]

Dynamic Markov Chains: Details • States: Two probabilities representing the likelihood that the source emits 1 or 0 as next symbol • Prediction: Move through automaton, add up likelihoods • Training (incremental): States are cloned when reached via a frequently used transition • Model size reduction: Use training examples that the model cannot already classify well enough (after some initial training, see also uncertainty sampling in active learning) • Features: Expected cross entropies of a message for either model (ham and phishing), Boolean membership indicators

Latent Topic Models Analyze on the co-occurrence of words • Similar to word clustering: Specify the number of topics in advance • Common methods: LDA, PLSA • Probabilistic latent semantic analysis: Models the probability of each co-occurrence as a mixture of conditionally independent multinomial distributions • Latent Dirichlet allocation: Generative Bayesian version with Dirichlet prior • Document: Mixture of various topics

Latent Topic Models: Class Specific • Analyze on the co-occurrence of words Class-Topic Model (CLTOM): Extension of LDA • Incorporates class information • LDA: Uniform per-document topic Dirichlet prior a, uniform per-topic word Dirichlet prior b • CLTOM: Class-specific per-document topicDirichlet prior ac • Training using EM / Mean Field Approximation • Features: Probabilities for each topic

Relevance for phishing Latent Topic Model: Topics Words of a topic sorted by probability

Feature Processing and Selection Feature Processing: • Scaling : Guarantees that all features have values within the same range • Normalization : Sets length of the feature vectors to one, which is adequate for inner-product based classifiers Feature Selection: • Goal: Select a subset of relevant features • Abstract: Search in a state space [Kohavi and John, AI Journal 1997] • Operates on an independent validation set • Best-first search strategy: Expands the current subset by the node with the highest estimated performance, stores additional nodes to overcome local maxima • Compound operators: Combine set of best-performing children

Evaluation Method and Test Corpus Standard method: 10-fold cross-validation • Criteria: Precision, recall, F-measure, false positive rate, false negative rate, accuracy for comparison with related work • Note: Errors are not of equal importance Test Corpus: Assembled by [Fette et al., WWW 2007] • Ham emails: SpamAssassin corpus • Phishing emails: Collected by Nazario • Total size: 7808 emails, 6951 ham (89%) and 857 phishing (11%)

Missed phishing emails Lost ham emails Overall result • FPR reduced by 92%, FNR by 64% • Statistically significant difference to [Fette et al. 07] with less than 1% error • Feature selection: Better result with fewer features and less training data (20% reserved for validation)

Agenda • Email Classification based on Advanced Text Mining • Hidden Salting and Anticipating Evasion • Real-Life AntiPhish Deployment • Conclusions è

Salting • Salting: Intentional addition or distortion of content to evade automatic filtering • Can be applied to any medium (e.g., text, images, audio) and to any content genre (e.g., emails, web pages, MMS messages) • Visible salting: Additional text, images containing random pixels, etc. • Hidden salting : Not perceivable by the user (e.g., text in invisible color, text behind objects, reading order manipulation)

Size: abcd e fgh Colour: abcdefgh A story Once there was a noble prince. He lived in a fancy castle. Read more A t r y Concealment: abcdefghijklmnop tricks s o Concealing text O n c e Clipping: story Simulated perceived text character tiles cognitive model Email source text internal representation drawing canvas end user <html> <head> … </head> <body> <h1>A story</h1> Once there was … … </body> </html> <html> Astory Once there was a noble prince. He lived in a fancy castle. Read more <head> <body> <h1> rendering <a>

Hidden Salting Simulation • We tap into the rendering process to detect hidden content, i.e., manifestations of salting. • Intercept requests for drawing text primitives • Build an internal representation of the characters, i.e., a list of attributed glyphs in compositional order • Test for glyph visibility: • Clipping The glyph is drawn within the physical bounds of the drawing clip. • Concealment The glyph is not concealed by other glyphs or shapes. • Font Color The glyph’s fill color contrasts well with the background color. • Glyph Size The glyph’s size and shape is sufficiently large.

Hidden Salting Simulation (cont.) • We feed the intercepted, visible text into a cognitive hidden salting simulation model, which returns the simulated perceived text. • Reading order: Detected based on a layout characteristic where we expect that glyphs of parallel lines are aligned • Compliance of the text with the language specific distributions of character n-grams, common words and word lengths • For details: See [De Beer and Moens, Tech. Report KU Leuven 2007]

Evasion Detection • Cat-and-Mouse game: Spammers are developing tricks; filter developers are adapting their filters • So far: Hidden salting simulation model • Closing the loop: Identifying email messages that are likely to make the hidden salting simulation system fail • Method: Compare the simulated perceived text as generated by our hidden salting simulation system and the message text as obtained by applying OCR to the rendered email message

Evasion Detection: Approach

Example Email on Screen HTML Source <html> <body> INNOCENT TEXT TO TRICK FILTER Your home refinance loan is approved! To get your approved amount <a href="http://www.mortgagepower3.com/">go here</a>. To be excluded from further notices <a href="http://www.mortgagepower3.com/remove.html"> go here</a>. </body> 1gate </html> 5297gdqK6-498jyxl3033RafD3-195RTcz6485obQU9-615LOLg9l49 Hidden Salting Simulation OCR Text INNOCENT TEXT TO TRICK FILTER Your home refinance loan is approved! To get your approved amount go here. To be excluded from further notices go here. Your _ome refinance loan is approve_! To get your approve_ amount _o_o _ere. To De exclu_e_ from furt_er notices __o _ere. Detect Difference

Evaluation Method • Method: Simulate the detection of a new salting trick by disabling the detection of one of the known tricks • Classifier: One-class SVM • Training set = “One-class”: Class of “normal” emails, i.e., emails that contain no or only known salting tricks • Test set: Both emails with and without the disabled (“new”) salting trick • Features: Robust text distance measures • Classifier marks outliers, i.e., emails that are not in the “one class”, which indicates that they may contain a previously unseen salting trick • Classifier produces some real-valued output, we automatically compute the cutting threshold by reapplying the classifier to the training set • OCR engines: gocr, ocrad

Test Data • 6951 ham, 2154 spam messages, 4559 phishing messages from SpamAssassin and Nazario corpus • Considered tricks: Font color, font size • Training set: 800 messages w/o trick • Test set: 100 messages with / 300 messages w/o trick

Overall Result

Filtering a Real-Life Email Stream: Challenges • Fixed scenario with fixed parameters • Data : • From present real-life stream • Mostly English and Italian • (Almost) Unskewed • All data is unlabeled, not easy to eliminate spam • Very strict privacy regulations • Experiments: “Almost online”

General Deployment Approach For every day t Î{1, . . . , n}: • Capture a set of emails St , sent in real-time through spam filters • Select a test subset Tt Ì St for evaluation of the current AntiPhish model Mt−1 • Select a subset AtÌ St of emails that are difficult to classify to be used for active learning • Obtain labels for sets Tt and At • Evaluate current model Mt−1 on the set Tt • Add set At to the training set, train the new model Mt Start: Initial AntiPhish model M0

Details • AntiPhish system is evaluated on arbitrary collected emails. • Deployment period: n = 20 days. • Used features: unigram, DMC, semantic topics with k = 25 topics, link, and lexical features • Every day a total of | TtÈAt | = 750 emails are selected. • An email is classified as non-ham if and only if it is considered with a probability of at least 95% to be non-ham.

Stratified Evaluation • Tt: Stratified sample of its underlying base set St • Idea. “Better” represent interesting emails • Two buckets: Emails that are difficult or easy to classify • Basic procedure: Oversample the difficult emails, but give them a lower weight in evaluation • More specifically: Let St = St(u) È St(c) , we want to sample k1 and k2 emails respectively • Then • We use a probability of p = 95% (for non-ham) as certainty threshold.

Email Stream Previous Trainingset Active Learning Current Model • Set of additional training emails per day At, |At| = 500 • 400 top-ranked emails from St having the lowest confidence in classification • . . . plus 100 emails randomly selected from the rest of St • Minimization of duplicates among the 400 uncertain emails: Ignore duplicates Uncertain Emails Certain Emails Randomover sampling Random undersampling New Trainingset Training New Model

Initial Dataset • Initial dataset: Six days of 750 messages each • Total: 4489 messages • Ham: 1514 (34%) • Phishing: 1342 (30%) • Spam: 1633 (36%) • Non-Ham: 2975 (66%) • Time period for experiment: subsequent 20 days

Additional Training Data Through Active Learning

Test Data and Evaluation • 250 messages per day • k1 = k2 = 125 difficult and easy messages • Sometimes less, because not enough difficult emails were found • Evaluation: • False Positive Rate: Proportion of “lost” ham emails in all ham emails • False Negative Rate: Proportion of missed non-ham emails in all non-ham emails

Test Data

Baseline Result Ham classified as Non-Ham Non-Ham classified as Ham FPR Average: 0.34% FNR Average: 7.09%

Result forSelectedThresholds Threshold in % on predicted probability of Non-Ham

Effect of Active Learning Ham classified as Non-Ham Three different fixed models: • Initial model M0 • Model after five days of active learning M5 • Model after ten days of active learning M10 Non-Ham classified as Ham

Effect of Active Learning

Spam Filter Vote as Feature

Identifying Potential Phishing in Spam • Second real-life application • Anti-spam operations use spam traps to gather the latest spam samples so that these can be better defended against • The ability to separate out the phishing leads to a quicker defence against such fraudulent activity HoneypotNetwork Updatedsignatures Fast Updateof Spam Filter Spam +Phishing Phishing Classifier RegularSpam

Related Laboratory Experiment • Labeled data – phishing and regular spam from a probe network • Training: 53 phishing vs. 1060 regular spam per week • Test: 75 phishing vs. 1443 regular spam per week (on average) • Duration: June to November 2008 (26 weeks) • System Parameters • Features: DMC, semantic topics with 10 topics, unigram, wordlist, DMC-link • Threshold: Neutral (50%) • Evaluation: Sliding window strategy • Each week is filtered on classifier trained on previous N=4 weeks • Result • FPR: Spam classified as Phishing 0.18 % • FNR: Phishing classified as Spam 4.89 %

Training Prediction Sliding Window, Training N=4 weeks Phishing classified as Spam Spam classified as Phishing

Conclusions: Lessons Learnt • Phishing: Multi-billion dollar activity • AntiPhish: Phishing prevention through content-based email filtering • Advanced text mining features boost performance: Dynamic Markov chains, Latent topic models • Most of these techniques are language-independent • Anticipatory learning: Detecting new filter evasion techniques, Require high-speed + high-quality OCR • Real-life deployment: • Active learning keeps filters up-to-date • Combination with spam filters improves performance through incorporation of current blacklist information. • Identifying phishing in a honeypot network permits prioritization in spam-filter updating.

AntiPhish – Lessons Learnt

AntiPhish – Lessons Learnt

Presentation Transcript

Lessons learnt on Organizational Development

Corporate Scandals – Lessons Learnt

Buildings for the Future Lessons Learnt – Lessons Shared

Evaluation in the UK lessons learnt

PCI COMPLIANCE - Lessons Learnt

Lessons Learnt from Field Visit

Operational Issues – Lessons learnt

LESSONS LEARNT ON THE MOUNTAIN

Lessons Learnt While BUILDING Pex

UncertWeb lessons learnt

Puppet for messaging lessons learnt

Eskom Online Vending Pilot Lessons Learnt

Green Buildings - Lessons Learnt

Public sector sustainability reporting – Lessons learnt

LSA Midwifery Conference ‘Lessons Learnt’

Lessons learnt from WANF (*) dismantling

LESSONS LEARNT IN LAUNCHING A VENTURE

UK Smokefree Legislation - 10 lessons learnt

Lessons Learnt Developing Web Applications

Banking Crisis Resolution: Lessons Learnt

Content Licensing – Lessons Learnt

Banding Appeal Training – Lessons Learnt