1 / 26

Document Classification using the Natural Language Toolkit

Document Classification using the Natural Language Toolkit. Ben Healey http://benhealey.info @BenHealey. Source: IStockPhoto. http://upload.wikimedia.org/wikipedia/commons/b/b6/FileStack_retouched.jpg. The Need for Automation. Take ur pick!.

yves
Download Presentation

Document Classification using the Natural Language Toolkit

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Document Classification using the Natural Language Toolkit Ben Healey http://benhealey.info @BenHealey

  2. Source: IStockPhoto

  3. http://upload.wikimedia.org/wikipedia/commons/b/b6/FileStack_retouched.jpghttp://upload.wikimedia.org/wikipedia/commons/b/b6/FileStack_retouched.jpg The Need for Automation

  4. Take urpick! http://upload.wikimedia.org/wikipedia/commons/d/d6/Cat_loves_sweets.jpg

  5. Features: - # Words - % ALLCAPS - Unigrams - Sender - And so on. Class: The Development Set Classification Algo. New Document (Class Unknown) Document Features Trained Classifier (Model) Classified Document.

  6. Relevant NLTK Modules • Feature Extraction • from nltk.corpus import words, stopwords • from nltk.stem import PorterStemmer • from nltk.tokenize import WordPunctTokenizer • from nltk.collocations import BigramCollocationFinder • from nltk.metrics import BigramAssocMeasures • See http://text-processing.com/demo/ for examples • Machine Learning Algos and Tools • from nltk.classify import NaiveBayesClassifier • from nltk.classify import DecisionTreeClassifier • from nltk.classify import MaxentClassifier • from nltk.classify import WekaClassifier • from nltk.classify.util import accuracy

  7. NaiveBayesClassifier http://61.153.44.88/nltk/0.9.5/api/nltk.classify.naivebayes-module.html

  8. http://www.educationnews.org/commentaries/opinions_on_education/91117.htmlhttp://www.educationnews.org/commentaries/opinions_on_education/91117.html

  9. 517,431 Emails Source: IStockPhoto

  10. Prep: Extract and Load • Sample* of 20,581 plaintext files • import MySQLdb, os, random, string •  MySQL via Python ODBC interface • File, string manipulation • Key fields separated out • To, From, CC, Subject, Body * Folders for 7 users with a large number of email. So not representative!

  11. Prep: Extract and Load • Allocation of random number • Some feature extraction • #To, #CCd, #Words, %digits, %CAPS • Note: more cleaning could be done • Code at benhealey.info

  12. From: james.steffes@enron.com To: louise.kitchen@enron.com Subject: Re: Agenda for FERC Meeting RE: EOL Louise -- We had decided that not having Mark in the room gave us the ability to wiggle if questions on CFTC vs. FERC regulation arose. As you can imagine, FERC is starting to grapple with the issue that financial trades in energy commodities is regulated under the CEA, not the Federal Power Act or the Natural Gas Act. Thanks, Jim

  13. From: pete.davis@enron.com To: pete.davis@enron.com Subject: Start Date: 1/11/02; HourAhead hour: 5; Start Date: 1/11/02; HourAhead hour: 5; No ancillary schedules awarded. No variances detected. LOG MESSAGES: PARSING FILE -->> O:\Portland\WestDesk\California Scheduling\ISO Final Schedules\2002011105.txt

  14. Class[es] assigned for 1,000 randomly selected messages:

  15. Prep: Show us ur Features • NLTK toolset • from nltk.corpus import words, stopwords • from nltk.stem import PorterStemmer • from nltk.tokenize import WordPunctTokenizer • from nltk.collocations import BigramCollocationFinder • from nltk.metrics import BigramAssocMeasures • Custom code • def extract_features(record,stemmer,stopset,tokenizer): … • Code at benhealey.info

  16. Prep: Show us ur Features • Features in boolean or nominal form if record['num_words_in_body']<=20: features['message_length']='Very Short' elif record['num_words_in_body']<=80: features['message_length']='Short' elif record['num_words_in_body']<=300: features['message_length']='Medium' else: features['message_length']='Long'

  17. Prep: Show us ur Features • Features in boolean or nominal form text=record['msg_subject']+" "+record['msg_body'] tokens = tokenizer.tokenize(text) words = [stemmer.stem(x.lower()) for x in tokens if x not in stopset and len(x) > 1] for word in words: features[word]=True

  18. Sit. Say. Heel. random.shuffle(dev_set) cutoff = len(dev_set)*2/3 train_set=dev_set[:cutoff] test_set=dev_set[cutoff:] classifier = NaiveBayesClassifier.train(train_set) print 'accuracy for > ',subject,':', accuracy(classifier, test_set) classifier.show_most_informative_features(10)

  19. Most Important Features

  20. Most Important Features

  21. Most Important Features

  22. Performance: ‘IT’ Model IMPORTANT: These are ‘cheat’ scores!

  23. Performance: ‘Deal’ Model IMPORTANT: These are ‘cheat’ scores!

  24. Performance: ‘Social’ Model IMPORTANT: These are ‘cheat’ scores!

  25. Don’t get burned. • Biased samples • Accuracy and rare events • Features and prior knowledge • Good modelling is iterative! • Resampling and robustness • Learning cycles http://www.ugo.com/movies/mustafa-in-austin-powers

  26. Resources • NLTK: • www.nltk.org/ • http://www.nltk.org/book • Enron email datasets: • http://www.cs.umass.edu/~ronb/enron_dataset.html • Free online Machine Learning course from Stanford • http://ml-class.com/ (starts in October) • StreamHacker blog by Jacob Perkins • http://streamhacker.com

More Related