1 / 14

Sparsity Analysis of Term Weighting Schemes and Application to Text Classification

Sparsity Analysis of Term Weighting Schemes and Application to Text Classification. Nataša Milić-Frayling, 1 Dunja Mladenić, 2 Janez Brank, 2 Marko Grobelnik 2 1 Microsoft Research, Cambridge, UK 2 Jo žef Stefan Institute, Ljubljana, Slovenia. Introduction.

chana
Download Presentation

Sparsity Analysis of Term Weighting Schemes and Application to Text Classification

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Sparsity Analysis of Term Weighting Schemes and Application to Text Classification Nataša Milić-Frayling,1 Dunja Mladenić,2Janez Brank,2 Marko Grobelnik21 Microsoft Research, Cambridge, UK2 Jožef Stefan Institute, Ljubljana, Slovenia

  2. Introduction • Feature selection in the context of text categorization • Comparing different feature ranking schemes • Characterizing feature rankings based on their sparsity behavior • Sparsity defined as the average number of different words in a document (after feature selection removed some words)

  3. Feature Weighting Schemes • Odds ratioOR(t) = log[odds(t|c) / odds(t|c)] • Information gainIG(t; c) = entropy(c) – entropy(c|t) • 2-statistic2(t) = N (NtcNtc – NtcNct)2 / [NcNcNtNt]N = number of all documents; Ntc = number of documents from class c containing term t, etc.Numerator equals 0 if t and c are independent. • Robertson – Sparck-Jones weightingRSJ(t) = log[(Ntc+0.5) (Ntc+0.5) / (Nct+0.5)(Ntc+0.5)](very similar to odds ratio)

  4. Feature Weighting Schemes • Weights based on word frequencyDF = document frequency (no. of documents containing the word; this ranking suggests to use the most common words)IDF = inverse document frequency (use the least common words)

  5. Feature Weighting Schemes • Weights based on a linear classifier (w, b) prediction(d) = sgn[b + Sterm tiwiTF(ti, d)] • If a weight wi is close to 0, the term ti has little influence on the predictions. • If it is not important for predictions, it is probably not important for learning either. • Thus, use |wi| as the score of a the term ti. • We use linear models trained using SVM and perceptron. • It might be practical to train the model on a subset of the full training set only (e.g. ½ or ¼ of the full training set, etc.).

  6. Characterization of Feature Rankings in terms of Sparsity • We have a reatively good understanding of feature rankings based on odds ratio, information gain, etc., because they are based on explicit formulas for feature scores • How to better understand the rankings based on linear classifiers? • Let “sparsity” be the average number of different words per document, after some feature selection has been applied. • Equivalently: the avg. number of nonzero components per vector representing the document. • This has direct ties to memory consuption, as well as to CPU time consumption for computing norms, dot products, etc. • We can plot the “sparsity curve” showing how sparsity grows as we add more and more features from a given ranking.

  7. Sparsity Curves

  8. Sparsity as the independent variable • When discussing and comparing feature rankings, we often use the number of features as the independent variable. • “What is the performance when using the first 100 features?” etc. • Somewhat unfair towards rankings that prefer (at least initially) less frequent features, such as odds ratio • Sparsity is much more directly connected to memory and CPU time requirements • Thus, we propose the use of sparsity as the independent variable when comparing feature rankings.

  9. Performance as a function of the number of features(Naïve Bayes, 16 categories of RCV2)

  10. Performance as a function of sparsity

  11. Sparsity as a cutoff criterion • Each category is treated as a binary classification problem (does the document belong to category c or not?) • Thus, a feature ranking method produces one ranking per category • We must choose how many of the top ranked features to use for learning and classification • Alternatively, we can define the cutoff in terms of sparsity. • The best number of features can vary greatly from one category to another • Does the best sparsity vary less between categories? • Suppose we want a constant number of features for each category. Is it better to use a constant sparsity for each category?

  12. Results

  13. Conclusions • Sparsity is an interesting and useful concept • As a cutoff criterion, it is not any worse, and is often a little better, than the number of features • It offers more direct control over memory and CPU time consumption • When comparing feature selection methods, it is not biased in favour of methods which prefer more common features

  14. Future work • Characterize feature ranking schemes in terms of other characteristics besides sparsity curves • E.g. cumulative information gain: how the sum of IG(t; c) over the first k terms t of the feature ranking grows with k. • The goal: define a set of characteristic curves that would explain why some feature rankings (e.g. SVM-based) are better than others. • If we know the characteristic curves of a good feature ranking, we can synthesize new rankings with approximately the same characteristic curves • Would they also perform comparatively well? • With a good set of feature characteristics, we might be able to take the approximate characteristics of a good feature ranking and then synthesize comparably good rankings on other classes or datasets. • (Otherwise it can be expensive to get a really good feature ranking, such as the SVM-based one.)

More Related