1 / 13

Text Classification using SVM-light

Text Classification using SVM-light. DSSI 2008 Jing Jiang. Text Classification. Goal: to classify documents (news articles, emails, Web pages, etc.) into predefined categories Examples To classify news articles into “business” and “sports”

griggsr
Download Presentation

Text Classification using SVM-light

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Text Classification using SVM-light DSSI 2008 Jing Jiang

  2. Text Classification • Goal: to classify documents (news articles, emails, Web pages, etc.) into predefined categories • Examples • To classify news articles into “business” and “sports” • To classify Web pages into personal home pages and others • To classify product reviews into positive reviews and negative reviews • Approach: supervised machine learning • For each pre-defined category, we need a set of training documents known to belong to the category. • From the training documents, we train a classifier.

  3. Overview • Step 1—text pre-processing • to pre-process text and represent each document as a feature vector • Step 2—training • to train a classifier using a classification tool (e.g. SNoW, SVM-light) • Step 3—classification • to apply the classifier to new documents

  4. Pre-processing: tokenization • Goal: to separate text into individual words • Example: “We’re attending a tutorial now.”  we ’re attending a tutorial now • Tool: • Word Splitter http://l2r.cs.uiuc.edu/~cogcomp/atool.php?tkey=WS

  5. Pre-processing: stop word removal (optional) • Goal: to remove common words that are usually not useful for text classification • Example: to remove words such as “a”, “the”, “I”, “he”, “she”, “is”, “are”, etc. • Stop word list: • http://www.dcs.gla.ac.uk/idom/ir_resources/linguistic_utils/stop_words

  6. Pre-processing: stemming (optional) • Goal: to normalize words derived from the same root • Examples: • attending  attend • teacher  teach • Tool: • Porter stemmer http://tartarus.org/~martin/PorterStemmer/

  7. Pre-processing: feature extraction • Unigram features: to use each word as a feature • To use TF (term frequency) as feature value • To use TF*IDF (inverse document frequency) as feature value • IDF = log (total-number-of-documents / number-of-documents-containing-t) • Bigram features: to use two consecutive words as a feature • Tool: • Write your own program/script • Lemur API

  8. Using Lemur to Extract Unigram Features Index *ind = IndexManager::openIndex("index-file.key"); int d1; TermInfoList *tList = ind->termInfoList(d1); tList->startIteration(); while (tList->hasMore()) { TermInfo * entry = tList->nextEntry(); cout << "entry term id: " << entry->termID() << endl; cout << "entry term count: " << entry->termCount() << endl; } delete dList; delete ind;

  9. SVM (Support Vector Machines) • A learning algorithm for classification • General for any classification problem (text classification as one example) • Binary classification • Maximizes the margin between the two different classes

  10. picture from http://www1.cs.columbia.edu/~kathy/cs4701/documents/jason_svm_tutorial.pdf

  11. SVM-light • SVM-light: a command line C program that implements the SVM learning algorithm • Classification, regression, ranking • Download at http://svmlight.joachims.org/ • Documentation on the same page • Two programs • svm_learn for training • svm_classify for classification

  12. SVM-light Examples • Input format 1 1:0.5 3:1 5:0.4 -1 2:0.9 3:0.1 4:2 • To train a classifier from train.data • svm_learn train.data train.model • To classify new documents in test.data • svm_classify test.data train.model test.result • Output format • Positive score  positive class • Negative score  negative class • Absolute value of the score indicates confidence • Command line options • -c a tradeoff parameter (use cross validation to tune)

  13. More on SVM-light • Kernel • Use the “-t” option • Polynomial kernel • User-defined kernel • Semi-supervised learning (transductive SVM) • Use “0” as the label for unlabeled examples • Very slow

More Related