Internal Presentation by
This presentation is the property of its rightful owner.
Sponsored Links
1 / 22

On: An Artificial Immune System for E-mail Classification PowerPoint PPT Presentation


  • 51 Views
  • Uploaded on
  • Presentation posted in: General

Internal Presentation by : Lei Wang Pervasive and Artificial Intelligenge research group http://diuf.unifr.ch/pai. On: An Artificial Immune System for E-mail Classification Andy Secker, Alex Freitas, Jon Timmis. Computing Laboratory, University of Kent Canterbury, Kent, UK.

Download Presentation

On: An Artificial Immune System for E-mail Classification

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


On an artificial immune system for e mail classification

Internal Presentation by:Lei WangPervasive and Artificial Intelligenge research grouphttp://diuf.unifr.ch/pai

On: An Artificial Immune System for E-mail Classification

Andy Secker, Alex Freitas, Jon Timmis

Computing Laboratory, University of Kent

Canterbury, Kent, UK

http://www.cs.kent.ac.uk/~ads3

19/02/2004


An artificial immune system for e mail classification

An Artificial Immune System for E-mail Classification

Andy Secker, Alex Freitas, Jon Timmis

Computing Laboratory, University of Kent

Canterbury, Kent, UK

http://www.cs.kent.ac.uk/~ads3

19/02/2004


Significance

Significance

  • With the increase in information on the Internet, the strive to find more effective tools for distinguishing between interesting and non-interesting material is increasing.

  • This paper provides an immune-inspired algorithm called AISEC that is capable of continuously classifying electronic mail as interesting and non-interesting without the need for re-training.

  • Comparing with a naïve Bayesian classifier, the system proposed in this paper performs as well as the naïve Bayesian system and has a great potential for augmentation.


Aisec immunity inspired system

AISEC, immunity-inspired system

  • Immune system

    • Human body constantly under attack. Immune system must adapt and respond

    • The (natural) immune system is:

      • Dynamic

      • Adaptive

      • Robust

      • Etc.

  • Artificial Immune Systems (AIS) use principles and process from observed and theoretical immunology to solve problems


Artificial immune systems

Artificial Immune Systems

  • Engineering framework

    • Representation of individual immune cells

    • Affinity measures

      • Evaluate interaction of individuals with environment and/or each other

    • Algorithms

      • Procedures of adaptation manipulate populations of immune cells

  • AIS as a classifier

    • AIRS

      • A successful supervised AIS algorithm for classification


Ais for web mining

AIS for Web Mining

  • Web mining, an umbrella term used to describe three quite different types of data mining:

    • Content mining

      • A process of extracting useful information from the text, images and other forms of content that make up the pages

      • The mining of textual data is a common task, often for the purposes of information retrieval

    • Usage mining

    • Structure mining

  • AISEC research goal

    • To develop a highly adaptive system capable of retrieving interesting information from the internet based on user’s current interests

    • The authors believe AIS may offer a number of advantages


What is aisec

What is AISEC ?

  • AISEC isn’t a spam filter

    • It has no methods to penalize false positives (loss of important e-mail)

    • Without a very low false positive rate, a spam filter would not be trusted


What is aisec1

What is AISEC ?

  • AISEC is

    • A first step towards an AIS for web mining.

      • A study of performance and characteristics of an AIS applied to text mining in a dynamic domain

    • A text classification algorithm capable of continuous adaptation, which may yield a classification accuracy comparable to a Bayesian approach.

      • User behaviour and interaction with e-mail can be similar to web pages

      • Supervised classification algorithm

        • E-mail classified as interesting and uninteresting

        • Uses constant(ish) feedback from user

        • Capable of continuous adaptation

      • This tracks concept drift and can also handle concept shift

      • A specialised AIS algorithm based in part on the immune principle of clonal selection

        • No previously documented algorithm was suited for use in this situation without extensive changes


Representation

Representation

  • Each cell contains 3 sets of words (+ state)

    • Punctuation is removed from fields

    • Research literature has suggested header information is enough to accurately classify e-mail*

      A = [<free,DVD> , <sales,com> , < canterbury,UK>]

Subject field

Title of the E-mail

Sender field

Sender’s name

Return field

(Sender’s address)

* Diao, Lu & Wu (2000). A Comparative Study of Classification Based Personal E-mail Filtering, PAKDD 2000


Affinity

Affinity

  • Affinity value is proportion of words in one cell found in another

    • More features would require a less naïve distance measure

    • Cosine distance is an obvious choice

    • Resultant value always between 0 and 1

      A = [<free,DVDs> , <offers,DVD,com> , <offers,DVD,com>]

      B = [<half,price,sale>,<sales,DVD,com>,<sales,DVD,com>]

      affinity(A,B) = 4/9

PROCEDURE affinity (bc1, bc2)

IF(bc1 has a shorter feature vector than bc2)

bshort ← bc1, blong ← bc2

ELSE

bshort ← bc2, blong ← bc1

count ← the number of words in bshort present in blong

bs_len ← the length of bshort’s feature vector

RETURN count/bs_len


Clone mutation

Clone-Mutation

  • One mutation takes a word previously used in subject or address and replaces single location

    • Subject, sender and return address libraries are kept separately

    • Usually >1 mutation per cell takes place

PROCEDURE clone_mutate(bc1,bc2)

aff ← affinity(bc1,bc2)

clones ← ∅

num_clones ← | aff * Kl |

num_mutate ← | (1-aff) * bc’s feature

vector length * Km |

DO(num_clones)TIMES

bcx ← a copy of bc1

DO(num_mutate)TIMES

p ← a random point in bcx’s

feature vector

w ← a random word from the

appropriate gene library

replace word in bcx’s feature

vector at location p with w

bcx’s stimulation level ← Ksb

clones ← clones ∪ {bcx}

RETURN clones

Subjectlib= free,DVD

SenderLib = sales,DVD,com

ReturnLib = sales,DVD,com

A = [<free,DVD> , <sales,DVD,com> , <sales,DVD,com>]

A = [<free,free> , <sales,DVD,com> , <sales,DVD,com>]


The algorithm classification

The algorithm - classification

  • System is initialised with known uninteresting e-mail

Memory cells

Naive cells

2. E-mail presented for classification. Classified as uninteresting as it stimulates close cells


The algorithm correct classification

The algorithm – correct classification

  • Highly stimulated cell reproduces 7 times. Less stimulated cell produces only 2 clones but with higher mutation rate

Stimulation Region

Classification Region

4.Cell with highest affinity is known to be useful therefore rewarded by becoming memory cell.


The algorithm cont

The algorithm cont…

  • Incorrect classification

  • Any cell responsible for incorrect classification is removed (memory or otherwise)

  • Cell removal

  • Aged naïve cells deleted. Memory cells placed in already covered areas also deleted.


Results classification accuracy

Results – Classification accuracy

  • 2268 e-mails (742 uninteresting) received over 6 months

  • E-mails presented in the order of date received

  • Feedback given after EVERY classification

  • AISEC run 10 times, results show mean

  • C5.0, neural network and C&R tree all run in “Clementine” data mining package

  • Bayesian algorithm used feedback to update like AISEC

Traditional Learning

Continuous Learning


Results variation of population size

Results – variation of population size


User point of view

User point of view

  • AISEC runs as a proxy on local machine

  • Advantages

    • No need to switch e-mail client

    • Can collect mail from multiple locations

  • AISEC’s user interface would require minimal interaction


User point of view1

User point of view

Server(s)

Collect mail

Collect mail

AISEC

Client

Interesting

Classifier

Return mail

Positive user response

Uninteresting

User interaction

Store

Local machine

Negative user response


Results cont

Results cont…

  • Standard measures of quality

    • Precision is the proportion of positive documents retrieved compared with the total number of positive documents

    • Recall is the proportion of positive documents actually classified as positive


Results variation of time between user feedback

Results – variation of time between user feedback


Conclusion

Conclusion

  • AISEC has produced promising results and appears robust

    • Interesting note: Typical accuracy similar to published results from other AIS for text classification (both traditional and continuous learning)

    • Use a larger training set and optimise (the many) parameters

      • Detect when there are the optimum number of cells

  • AISEC has been useful providing some evidence AIS applied to this domain would be possible

  • Research on adaptive systems for retrieval of interesting information, not necessarily purely accurate information


Questions discussions

Questions & Discussions

An Artificial Immune System for E-mail Classification

Andy Secker, Alex Freitas, Jon Timmis

Computing Laboratory, University of Kent

More information:

http://www.cs.kent.ac.uk/~ads3c


  • Login