1 / 16

text categorization

text categorization. Updated 11/1/2006. Performance measures – binary classification. Ground truth. Accuracy: acc = (a+d)/(a+b+c+d) Precision: p = a/(a+b) Recall: r = a/(a+c) F F  = (  2 +1) pr/(  2 p +r) Ususally one uses F 1 = 2pr/( p +r) Break-even point.

marty
Download Presentation

text categorization

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. text categorization Updated 11/1/2006

  2. Performance measures – binary classification Ground truth • Accuracy: acc = (a+d)/(a+b+c+d) • Precision: p = a/(a+b) • Recall: r = a/(a+c) • F F = (2+1) pr/(2p +r) Ususally one uses F1 = 2pr/(p +r) • Break-even point Classifier assigned Contigency table

  3. Performance measures – multiple categories • Micro averaging • Macro averaging

  4. Reuters 21578 • Reuters collection contains 9603 training articles and 3299 test articles. • Were sent over the Reuters newswire in 1987. • Contains about 100 categories such as ‘mergers and acquisitions’, ‘interset rates’, ‘wheat’, ‘silver’ etc. • Distribution of articles among categories is highly non-uniform. • ‘earning’ contains 2709 docs • 75 categories contain less than 10 docs each.

  5. Example of a Reuters news story from category ‘earning’ <DATE>26-FEB-1987 15:18:59.34</DATE> <TOPICS><D>earn</D></TOPICS> <TEXT> <TITLE>COBANCO INC &lt;CBCO> YEAR NET</TITLE> <DATELINE> SANTA CRUZ, Calif., Feb 26 - </DATELINE> <BODY>Shr 34 cts vs 1.19 dlrs Net 807,000 vs 2,858,000 Assets 510.2 mln vs 479.7 mln Deposits 472.3 mln vs 440.3 mln Loans 299.2 mln vs 327.2 mln Note: 4th qtr not available. Year includes 1985 extraordinary gain from tax carry forward of 132,000 dlrs, or five cts per shr. Reuter </BODY></TEXT> </REUTERS>

  6. Categorization methods • Decision trees • Naïve bayes • K-nearest neighbors (KNN) • Neural networks • Support Vector Machines (SVM)

  7. Representation of documents • The most popular representation is ‘Bag of Words’, which ignores all structure of documents. • Document I will be represented by a vector Xi Rn (n is the number of word types), where the j’th coordinate is just the number of times word wj appears in the document. (so called ‘term frequency – tfj).

  8. contains “cents”  2 times contains “cents” < 2 times contains “versus”  2 times contains “versus” < 2 times contains “net”  1 time contains “net” < 1 time 272/5436 = 0.050 209/301 = 0.694 422/541 = 0.780 1398/1403 = 0.996 “yes” “no” Decision trees Earnings? 2301/7681 = 0.3 of all docs 1607/1704 = 0.943 694/5977 = 0.116

  9. Building decision trees • Information gain

  10. Decision Tree Pruning

  11. Naïve bayes • Multivariate Bernoulli model • Multinomial model

  12. Precision recall curve

  13. K-nearest neighbor

  14. Neural network • Perceptrons • Multi-layer perceptrons

  15. SVM

  16. reuters 21578 – comparison* *Yiming-Yang & Xin Liu, A re-examination of text categorization methods, SIGIR99)

More Related