Internal Presentation by
Sponsored Links
This presentation is the property of its rightful owner.
1 / 22

On: An Artificial Immune System for E-mail Classification PowerPoint PPT Presentation

  • Uploaded on
  • Presentation posted in: General

Internal Presentation by : Lei Wang Pervasive and Artificial Intelligenge research group On: An Artificial Immune System for E-mail Classification Andy Secker, Alex Freitas, Jon Timmis. Computing Laboratory, University of Kent Canterbury, Kent, UK.

Download Presentation

On: An Artificial Immune System for E-mail Classification

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

Internal Presentation by:Lei WangPervasive and Artificial Intelligenge research group

On: An Artificial Immune System for E-mail Classification

Andy Secker, Alex Freitas, Jon Timmis

Computing Laboratory, University of Kent

Canterbury, Kent, UK


An Artificial Immune System for E-mail Classification

Andy Secker, Alex Freitas, Jon Timmis

Computing Laboratory, University of Kent

Canterbury, Kent, UK



  • With the increase in information on the Internet, the strive to find more effective tools for distinguishing between interesting and non-interesting material is increasing.

  • This paper provides an immune-inspired algorithm called AISEC that is capable of continuously classifying electronic mail as interesting and non-interesting without the need for re-training.

  • Comparing with a naïve Bayesian classifier, the system proposed in this paper performs as well as the naïve Bayesian system and has a great potential for augmentation.

AISEC, immunity-inspired system

  • Immune system

    • Human body constantly under attack. Immune system must adapt and respond

    • The (natural) immune system is:

      • Dynamic

      • Adaptive

      • Robust

      • Etc.

  • Artificial Immune Systems (AIS) use principles and process from observed and theoretical immunology to solve problems

Artificial Immune Systems

  • Engineering framework

    • Representation of individual immune cells

    • Affinity measures

      • Evaluate interaction of individuals with environment and/or each other

    • Algorithms

      • Procedures of adaptation manipulate populations of immune cells

  • AIS as a classifier

    • AIRS

      • A successful supervised AIS algorithm for classification

AIS for Web Mining

  • Web mining, an umbrella term used to describe three quite different types of data mining:

    • Content mining

      • A process of extracting useful information from the text, images and other forms of content that make up the pages

      • The mining of textual data is a common task, often for the purposes of information retrieval

    • Usage mining

    • Structure mining

  • AISEC research goal

    • To develop a highly adaptive system capable of retrieving interesting information from the internet based on user’s current interests

    • The authors believe AIS may offer a number of advantages

What is AISEC ?

  • AISEC isn’t a spam filter

    • It has no methods to penalize false positives (loss of important e-mail)

    • Without a very low false positive rate, a spam filter would not be trusted

What is AISEC ?

  • AISEC is

    • A first step towards an AIS for web mining.

      • A study of performance and characteristics of an AIS applied to text mining in a dynamic domain

    • A text classification algorithm capable of continuous adaptation, which may yield a classification accuracy comparable to a Bayesian approach.

      • User behaviour and interaction with e-mail can be similar to web pages

      • Supervised classification algorithm

        • E-mail classified as interesting and uninteresting

        • Uses constant(ish) feedback from user

        • Capable of continuous adaptation

      • This tracks concept drift and can also handle concept shift

      • A specialised AIS algorithm based in part on the immune principle of clonal selection

        • No previously documented algorithm was suited for use in this situation without extensive changes


  • Each cell contains 3 sets of words (+ state)

    • Punctuation is removed from fields

    • Research literature has suggested header information is enough to accurately classify e-mail*

      A = [<free,DVD> , <sales,com> , < canterbury,UK>]

Subject field

Title of the E-mail

Sender field

Sender’s name

Return field

(Sender’s address)

* Diao, Lu & Wu (2000). A Comparative Study of Classification Based Personal E-mail Filtering, PAKDD 2000


  • Affinity value is proportion of words in one cell found in another

    • More features would require a less naïve distance measure

    • Cosine distance is an obvious choice

    • Resultant value always between 0 and 1

      A = [<free,DVDs> , <offers,DVD,com> , <offers,DVD,com>]

      B = [<half,price,sale>,<sales,DVD,com>,<sales,DVD,com>]

      affinity(A,B) = 4/9

PROCEDURE affinity (bc1, bc2)

IF(bc1 has a shorter feature vector than bc2)

bshort ← bc1, blong ← bc2


bshort ← bc2, blong ← bc1

count ← the number of words in bshort present in blong

bs_len ← the length of bshort’s feature vector

RETURN count/bs_len


  • One mutation takes a word previously used in subject or address and replaces single location

    • Subject, sender and return address libraries are kept separately

    • Usually >1 mutation per cell takes place

PROCEDURE clone_mutate(bc1,bc2)

aff ← affinity(bc1,bc2)

clones ← ∅

num_clones ← | aff * Kl |

num_mutate ← | (1-aff) * bc’s feature

vector length * Km |


bcx ← a copy of bc1


p ← a random point in bcx’s

feature vector

w ← a random word from the

appropriate gene library

replace word in bcx’s feature

vector at location p with w

bcx’s stimulation level ← Ksb

clones ← clones ∪ {bcx}

RETURN clones

Subjectlib= free,DVD

SenderLib = sales,DVD,com

ReturnLib = sales,DVD,com

A = [<free,DVD> , <sales,DVD,com> , <sales,DVD,com>]

A = [<free,free> , <sales,DVD,com> , <sales,DVD,com>]

The algorithm - classification

  • System is initialised with known uninteresting e-mail

Memory cells

Naive cells

2. E-mail presented for classification. Classified as uninteresting as it stimulates close cells

The algorithm – correct classification

  • Highly stimulated cell reproduces 7 times. Less stimulated cell produces only 2 clones but with higher mutation rate

Stimulation Region

Classification Region

4.Cell with highest affinity is known to be useful therefore rewarded by becoming memory cell.

The algorithm cont…

  • Incorrect classification

  • Any cell responsible for incorrect classification is removed (memory or otherwise)

  • Cell removal

  • Aged naïve cells deleted. Memory cells placed in already covered areas also deleted.

Results – Classification accuracy

  • 2268 e-mails (742 uninteresting) received over 6 months

  • E-mails presented in the order of date received

  • Feedback given after EVERY classification

  • AISEC run 10 times, results show mean

  • C5.0, neural network and C&R tree all run in “Clementine” data mining package

  • Bayesian algorithm used feedback to update like AISEC

Traditional Learning

Continuous Learning

Results – variation of population size

User point of view

  • AISEC runs as a proxy on local machine

  • Advantages

    • No need to switch e-mail client

    • Can collect mail from multiple locations

  • AISEC’s user interface would require minimal interaction

User point of view


Collect mail

Collect mail





Return mail

Positive user response


User interaction


Local machine

Negative user response

Results cont…

  • Standard measures of quality

    • Precision is the proportion of positive documents retrieved compared with the total number of positive documents

    • Recall is the proportion of positive documents actually classified as positive

Results – variation of time between user feedback


  • AISEC has produced promising results and appears robust

    • Interesting note: Typical accuracy similar to published results from other AIS for text classification (both traditional and continuous learning)

    • Use a larger training set and optimise (the many) parameters

      • Detect when there are the optimum number of cells

  • AISEC has been useful providing some evidence AIS applied to this domain would be possible

  • Research on adaptive systems for retrieval of interesting information, not necessarily purely accurate information

Questions & Discussions

An Artificial Immune System for E-mail Classification

Andy Secker, Alex Freitas, Jon Timmis

Computing Laboratory, University of Kent

More information:

  • Login