1 / 46

A Suffix Tree Approach to Text Classification Applied to Email Filtering

A Suffix Tree Approach to Text Classification Applied to Email Filtering. School of Computer Science and Information Systems Birkbeck College, University of London. Rajesh Pampapathi, Boris Mirkin, Mark Levene. Introduction – Outline. Motivation: Examples of Spam Suffix Tree construction

alessa
Download Presentation

A Suffix Tree Approach to Text Classification Applied to Email Filtering

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Suffix Tree Approach to Text Classification Applied to Email Filtering School of Computer Science and Information Systems Birkbeck College, University of London Rajesh Pampapathi, Boris Mirkin, Mark Levene

  2. Introduction – Outline • Motivation: Examples of Spam • Suffix Tree construction • Document scoring and classification • Experiments and results • Conclusion

  3. Buy cheap medications online, no prescription needed. We have Viagra, Pherentermine, Levitra, Soma, Ambien, Tramadol and many more products. No embarrasing trips to the doctor, get it delivered directly to your door. Experienced reliable service. Most trusted name brands. For your solution click here: http://www.webrx-doctor.com/?rid=1000 1. Standard spam mail

  4. zygotes zoogenous zoometric zygosphene zygotactic zygoid zucchettos zymolysis zoopathy zygophyllaceous zoophytologist zygomaticoauricular zoogeologist zymoid zoophytish zoospores zygomaticotemporal zoogonous zygotenes zoogony zymosis zuza zoomorphs zythum zoonitic zyzzyva zoophobes zygotactic zoogenous zombies zoogrpahy zoneless zoonic zoom zoosporic zoolatrous zoophilous zymotically zymosterol FreeHYSHKRODMonthQGYIHOCSupply.IHJBUMDSTIPLIBJTJUBIYYXFN * GetJIIXOLDViagraPWXJXFDUUTabletsNXZXVRCBX <http://healthygrow.biz/index.php?id=2> zonally zooidal zoospermia zoning zoonosology zooplankton zoochemical zoogloeal zoological zoologist zooid zoosphere zoochemical & Safezoonal andNGASXHBPnatural & TestedQLOLNYQandEAVMGFCapproved zonelike zoophytes zoroastrians zonular zoogloeic zoris zygophore zoograft zoophiles zonulas zygotic zymograms zygotene zootomical zymes zoodendrium zygomata zoometries zoographist zygophoric zoosporangium zygotes zumatic zygomaticus zorillas zoocurrent zooxanthella zyzzyvas zoophobia zygodactylism zygotenes zoopathological noZFYFEPBmas <http://healthygrow.biz/remove.php> 5. Embedded message (plus word salad)

  5. Buy meds online and get it shipped to your door Find out more here <http://www.gowebrx.com/?rid=1001> a publications website accepted definition. known are can Commons the be definition. Commons UK great public principal work Pre-Budget but an can Majesty's many contains statements statements titles (eg includes have website. health, these Committee Select undertaken described may publications 4. Word salads

  6. ROOT (1) (2) F (1) M E (4) (2) (1) T E (1) E (1) (1) E (2) T (2) (1) E E (1) T (1) (1) (2) T T (1) (1) Creating a Suffix Tree MEET FEET

  7. Levels of Information • Characters: the alphabet (and their frequencies) of a class. • Matches: between query strings and a class. s =nviaXgraU>Tabl$$$ets t =xv^ia$graTab£££lets Matches(s, t) = {v, ia, gra, Tab, l, ets, $} - But what about overlapping matches? • Trees: properties of the class as a whole. ~size ~density (complexity)

  8. Document Similarity Measure The score for a document, d, is the sum of the scores for each suffix: d(i) is the suffix of d beginning at the ith letter tau is a tree normalisation coefficient

  9. Substring Similarity Measure Score for match, m = m0m1m2…mn, is score(m): T is the tree profile of the class. v(m|T) is a normalisation coefficient based on the properties of T. p(mt) is the probability of the character, mt, of the match m. Φ[p] is a significance function.

  10. Decision Mechanism

  11. Specifications of Φ[p](character level) Note: Logit and Sigmoid need to be adjusted to fit in the range [0,1]

  12. Significance function

  13. Threshold Variation~ Significance functions ~

  14. Threshold Variation~ Significance functions ~

  15. Match normalisation m* is the set of all strings formed by permutations of m m’ is the set of all strings of length equal to length of m

  16. Match normalisation MUN: match unnormalised; MPN: permutation normalised; MLN: length normalised

  17. Threshold Variation~ match normalisation ~ Constant significance functionunnormalised Constant significance functionmatch normalised

  18. Specifications of tau

  19. Tree normalisation

  20. Androutsopoulos et al. (2000)~ Ling-Spam Corpus ~

  21. ~ Ling-BKS Corpus ~ ~ SpamAssassin Corpus ~

  22. Conclusions • Good overall classifier- improvement on naïve Bayes- but there’s still room for improvement • Can one method ever maintain 100% accuracy? • Extending the classifier • Applications to other domains- web page classification

  23. Future Work - ODP

  24. Computational Performance

  25. Experimental Data Sets • Ling-Spam (LS)Spam (481) collected by Androutsopoulos et al. Ham (2412) from online linguists’ bulletin board • Spam Assassin- Easy (SAe)- Hard (SAh)Spam (1876) and ham (4176) examples donated • BBKSpam (652) collected by Birkbeck

  26. Androutsopoulos et al. (2000)~ Ling-Spam Corpus ~

  27. Androutsopoulos et al. (2000)~ Ling-Spam Corpus ~

  28. ~ SpamAssassin Corpus ~

  29. book ghost host plate Plato sang then what 0 1 0 0 1 1 2 2 Vector Space Model “What then?” sang Plato’s ghost, “What then?” W. B. Yeats Word Probability = 0.05 P(w = ‘what’) = 50/1000

  30. Creating Profiles Mark

  31. Profiles Mark Levene engines databases information search data Mike Hu police intelligence criminal computational data

  32. Boris Mirkin Mark Levene Mike Hu Classification SBM SML SMH

  33. Naïve Bayes(similarity measure) For a document d = {d1d2d3 … dm }and set of classes c = {c1, c2 ... cJ}: (1) Where: (2) (3)

  34. Criticisms • Pre-processing:- Stop-word removal- Word stemming/lemmatisation- Punctuation and formatting • Smallest unit of consideration is a word. • Classes (and documents) are bags of words, i.e. each word is independent of all others.

  35. Word Dependencies Boris Mirkin means intelligence clustering computational data Mike Hu means intelligence criminal computational data

  36. Intellig- OR intelligent Word Inflections Intelligent Intelligence Intelligentsia Intelligible

  37. Success measures • Recall is the proportion of correctly classified examples of a class. If SR is spam recall, then (1-SR) gives the proportion of false negatives. • Precision is the proportion assigned to a class which are true members of that class. It is a measure of the number of true positives. If SP is spam precision, then (1 – SP) would give the proportion of false positives.

  38. Success measures • True Positive Rate (TPR) is the proportion of correctly classified examples of the ‘positive’ class. Spam is typically taken as the positive class, so TPR is then the number of spam classified as spam over the total number of spam. • False Positive Rate (FPR) is the proportion of the ‘negatve’ class erroneously assigned to the ‘positive’ class. • Ham is typically taken as the negative class, so FPR is then the number of ham classified as spam over the total number of ham.

  39. Classifier Structure • Training Data • Profiling Method • Profile Representation • Similarity/Comparison Measure • Decision Mechanism or Classification Criterion • Decision Spam Ham ? Ham Spam

  40. Classification using a suffix tree • Method of profiling is construction of the tree(no pre-processing, no post-processing) • The tree is a profile of the class. • Similarity measure? • Decision mechanism?

  41. Threshold Variation~ match normalisation ~ Constant significance functionunnormalised Constant significance functionmatch normalised SPE = spam precision error; HPE = ham precision error

  42. Threshold Variation~ Significance functions ~ Root function, no normalisation Logit function, no normalisation SPE = spam precision error; HPE = ham precision error

  43. Threshold Variation Constant significance function(unnormalised) SPE = spam precision error; HPE = ham precision error

More Related