cs 548 project 5 n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
CS 548 – Project 5 PowerPoint Presentation
Download Presentation
CS 548 – Project 5

Loading in 2 Seconds...

play fullscreen
1 / 5

CS 548 – Project 5 - PowerPoint PPT Presentation


  • 131 Views
  • Uploaded on

CS 548 – Project 5. Text Classification. Skyler Whorton – April 26, 2012. Data Set. UCI 4-Universities Data Set Cornell, Texas, Washington, Wisconsin plus “ misc ” sources Computer Science department web pages in 7 categories: Course Department Faculty Project Staff Student

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'CS 548 – Project 5' - timon-caldwell


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
cs 548 project 5
CS 548 – Project 5

Text Classification

Skyler Whorton – April 26, 2012

data set
Data Set
  • UCI 4-Universities Data Set
    • Cornell, Texas, Washington, Wisconsin plus “misc” sources
    • Computer Science department web pages in 7 categories:
      • Course
      • Department
      • Faculty
      • Project
      • Staff
      • Student
      • “Other” – confounding label
  • WPI Computer Science Department
    • Manually picked test set
    • Each class label covered
pre processing objectives
Pre-Processing, Objectives
  • HTML webpage → Text document
    • HTML header, tags, special characters
    • Dates, numbers, source-specific words
    • Remove instances with “other” label
  • Training and test data
    • Leave-one-out validation on each of the 4 sources
    • Extra test set: WPI web page collection
  • Objectives & guiding questions
    • Compare single word vs. N-gram tokenization
    • Find words/N-grams predictive of document-type
    • What pre-processing/settings strongly affect classifier performance
    • Which classifier generalizes best?
experiment process
Experiment Process

Texas

Raw HTML

pages

WPI

Cornell

Washington

Wisconsin

Cleaned

text

TestingWPI

TestingCornell

Training“Sans-Cornell”

Document to

word vector

WordTokenizer

N-GramTokenizer

J48 Tree

Training

Vector

Test

Vectors

NaïveBayes

Evaluation

SMO

Classifiers

results
Results
  • 4-Universities
    • NaïveBayes: cross-validation
    • J48: test set
    • WPI web pagesmuch different
  • All-Universities
    • Removed “other”
    • SMO: high acuracy,but overfit
  • Predictive words:
    • professor, computer science, ph d, assignments, syllabus, research, technical reports, computer science, I am, student in, groups, s home

classification accuracy