1 / 10

Document Classification Comparison

Document Classification Comparison. Evangel Sarwar, Josh Woolever, Rebecca Zimmerman. Overview. What we did How we did it Results Why does this matter Conclusions Questions?. What did we do?. Compared document classification accuracy of three pieces of software on data from 20 newsgroups

cinnamon
Download Presentation

Document Classification Comparison

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Document ClassificationComparison Evangel Sarwar, Josh Woolever, Rebecca Zimmerman

  2. Overview • What we did • How we did it • Results • Why does this matter • Conclusions • Questions?

  3. What did we do? • Compared document classification accuracy of three pieces of software on data from 20 newsgroups • Rainbow (Naïve Bayes) • C4.5 (Decision Tree) • Neural Network (Back-propagation) • Initially planned on taking a single document and locating other documents similar to it

  4. How did we do it?. • Used Rainbow as benchmark • Used it to create a model of the data • Was trained and tested with a common set of data • Used perl scripts to separate the data into training/testing sets and create input files for C4.5 and the neural network software • Rainbow's ability to output word counts for the top N words was used to create the input files • Initially wanted to use word probabilities, but it is only capable of doing this with classes, not single documents

  5. .How did we do it? • Modified image neural network from previous assignment so that it would look at documents instead of images • Needed to have 20 output nodes, one for each newsgroup • Took in 1000 words (initially at least) • Started with the default hidden nodes (4) and used all the way up to approximately 2000 (2x the number of inputs) • http://www.faqs.org/faqs/ai-faq/neural-nets/part3/section-10.html

  6. Results • The Decision Tree software was able to get between 15% and 40% accuracy (depending on whether the tree was pruned and using test data) • Training set was about 17% after pruning • Test set was about 40% after pruning • Neural Network proved to be much more difficult than we at first thought • Very very slow (on full training data, took approximately 1 hour per epoch on a 1.2Ghz Linux machine) • Accuracy did not increase over many trials • Spent a great amount of time experimenting with the various paramaters • Learning Rate, Momentum, Hidden Units • Never got better than about 5% accuracy

  7. .Results. • Rainbow • Approximately 80% accuracy • C4.5 and Rainbow made similar errors: • Misclassified documents within the similar groups: • Alt.atheism, talk.religion.misc, talk.politics.misc • Comp.*

  8. Why is text classifcation important? • Spam detection • General mail filtering into folders • Automatically place documents in file system at proper location

  9. Conclusions • Naïve Bayes seems to empirically be the best for classifying documents • At least for newsgroup data • Still made similar errors to C4.5 which used only word counts • If we had pre-processed the data better, perhaps removing outliers and normalizing the information then we could have gotten better results with the Neural Network • Word counts not enough to “specify” a document, C4.5 seemed to create a tree that did not generalize well to the test data • Neural Networks are definitely not “plug and chug,” every application is specific and needs specific parameters • Hard to know how much data to use, or how many features. • Most people don’t have 10000 emails to “train” with • Should investigate a threshold minimum for getting accurate results

  10. Fin. • Questions?

More Related