1 / 15

Naïve Bayes Classification

Naïve Bayes Classification. Christina Wallin Computer Systems Research Lab 2008-2009. Goal. create a naïve Bayes classifier using the 20 Newsgroup database compare the effectiveness of different implementations of this method. What is the Naïve Bayes?.

lorene
Download Presentation

Naïve Bayes Classification

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Naïve Bayes Classification Christina Wallin Computer Systems Research Lab 2008-2009

  2. Goal • create a naïve Bayes classifier using the 20 Newsgroup database • compare the effectiveness of different implementations of this method

  3. What is the Naïve Bayes? - Bayes’ Theorum: Classification method based on independence assumption - “Mars Rover” -Machine Learning

  4. Program Overview • Python with NLTK (Natural Language Toolkit) • file.py • train.py • test.py

  5. Procedures: file.py • Parses a file and makes a dictionary of all of the words present and their frequency • Stemming words • Accounting for length

  6. Procedures: train.py • Training the program as to what words occur more frequently in each class • Make a PFX vector, the probability that each word is in the class • Multivariate or Multinomial • Stopwords

  7. Procedures: Multivariate v. Multinomial • Multivariate • P (w) = (num files with w+1)/(num files in class + num vocab) • Multinomial • P (w) = (frequency of w + 1)/(num words + num vocab)

  8. Example • File 1: Computer, Science, AI, Science • File 2: AI, Computer, Learning, Parallel • Multivariate for Computer: (2+1)/(2+1)=1 • Multinomial for Computer: (2+1)/(8+1)=1/3 • Multivariate for Parallel: (1+1)/(2+1) = 2/3 • Multinomial for Parallel: (1+1)/(8+1) = 2/9 • Multivariate for Science: (1+1)/(2+1) = 2/3 • Multinomial for Science: (2+1)/(8+1) = 1/3

  9. Procedures: test.py • Using PFX generated by train.py, go through testing cases to compare the words in them to those in the classes as a whole • Use log sum to figure out the probability, because multiplying all of them would cause problems

  10. Testing • Generated text files based on a probability of the words occurring • Compared initial, programmed in, probability to PFX generated • Also used generated files to test text classification • Script for quicker testing

  11. Results: Effect of stemming

  12. Results: Multivariate v. Multinomial

  13. Results: Accounting for Length

  14. Results: Stopwords

  15. Conclusions • Effect of optimizations • Questions?

More Related