Document Classification Comparison

Document ClassificationComparison Evangel Sarwar, Josh Woolever, Rebecca Zimmerman

Overview • What we did • How we did it • Results • Why does this matter • Conclusions • Questions?

What did we do? • Compared document classification accuracy of three pieces of software on data from 20 newsgroups • Rainbow (Naïve Bayes) • C4.5 (Decision Tree) • Neural Network (Back-propagation) • Initially planned on taking a single document and locating other documents similar to it

How did we do it?. • Used Rainbow as benchmark • Used it to create a model of the data • Was trained and tested with a common set of data • Used perl scripts to separate the data into training/testing sets and create input files for C4.5 and the neural network software • Rainbow's ability to output word counts for the top N words was used to create the input files • Initially wanted to use word probabilities, but it is only capable of doing this with classes, not single documents

.How did we do it? • Modified image neural network from previous assignment so that it would look at documents instead of images • Needed to have 20 output nodes, one for each newsgroup • Took in 1000 words (initially at least) • Started with the default hidden nodes (4) and used all the way up to approximately 2000 (2x the number of inputs) • http://www.faqs.org/faqs/ai-faq/neural-nets/part3/section-10.html

Results • The Decision Tree software was able to get between 15% and 40% accuracy (depending on whether the tree was pruned and using test data) • Training set was about 17% after pruning • Test set was about 40% after pruning • Neural Network proved to be much more difficult than we at first thought • Very very slow (on full training data, took approximately 1 hour per epoch on a 1.2Ghz Linux machine) • Accuracy did not increase over many trials • Spent a great amount of time experimenting with the various paramaters • Learning Rate, Momentum, Hidden Units • Never got better than about 5% accuracy

.Results. • Rainbow • Approximately 80% accuracy • C4.5 and Rainbow made similar errors: • Misclassified documents within the similar groups: • Alt.atheism, talk.religion.misc, talk.politics.misc • Comp.*

Why is text classifcation important? • Spam detection • General mail filtering into folders • Automatically place documents in file system at proper location

Conclusions • Naïve Bayes seems to empirically be the best for classifying documents • At least for newsgroup data • Still made similar errors to C4.5 which used only word counts • If we had pre-processed the data better, perhaps removing outliers and normalizing the information then we could have gotten better results with the Neural Network • Word counts not enough to “specify” a document, C4.5 seemed to create a tree that did not generalize well to the test data • Neural Networks are definitely not “plug and chug,” every application is specific and needs specific parameters • Hard to know how much data to use, or how many features. • Most people don’t have 10000 emails to “train” with • Should investigate a threshold minimum for getting accurate results

Fin. • Questions?

Document Classification Comparison