1 / 5

CS 548 – Project 5

CS 548 – Project 5. Text Classification. Skyler Whorton – April 26, 2012. Data Set. UCI 4-Universities Data Set Cornell, Texas, Washington, Wisconsin plus “ misc ” sources Computer Science department web pages in 7 categories: Course Department Faculty Project Staff Student

Download Presentation

CS 548 – Project 5

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CS 548 – Project 5 Text Classification Skyler Whorton – April 26, 2012

  2. Data Set • UCI 4-Universities Data Set • Cornell, Texas, Washington, Wisconsin plus “misc” sources • Computer Science department web pages in 7 categories: • Course • Department • Faculty • Project • Staff • Student • “Other” – confounding label • WPI Computer Science Department • Manually picked test set • Each class label covered

  3. Pre-Processing, Objectives • HTML webpage → Text document • HTML header, tags, special characters • Dates, numbers, source-specific words • Remove instances with “other” label • Training and test data • Leave-one-out validation on each of the 4 sources • Extra test set: WPI web page collection • Objectives & guiding questions • Compare single word vs. N-gram tokenization • Find words/N-grams predictive of document-type • What pre-processing/settings strongly affect classifier performance • Which classifier generalizes best?

  4. Experiment Process Texas Raw HTML pages WPI Cornell Washington Wisconsin Cleaned text TestingWPI TestingCornell Training“Sans-Cornell” Document to word vector WordTokenizer N-GramTokenizer J48 Tree Training Vector Test Vectors NaïveBayes Evaluation SMO Classifiers

  5. Results • 4-Universities • NaïveBayes: cross-validation • J48: test set • WPI web pagesmuch different • All-Universities • Removed “other” • SMO: high acuracy,but overfit • Predictive words: • professor, computer science, ph d, assignments, syllabus, research, technical reports, computer science, I am, student in, groups, s home classification accuracy

More Related