1 / 25

Document Categorization

Document Categorization. Problem: given a collection of documents, and a taxonomy of subject areas Classification : Determine the subject area(s) most pertinent to each document Indexing : Select a set of keywords / index terms appropriate to each document. Classification Techniques.

jalena
Download Presentation

Document Categorization

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Document Categorization • Problem: given • a collection of documents, and • a taxonomy of subject areas • Classification: Determine the subject area(s) most pertinent to each document • Indexing: Select a set of keywords / index terms appropriate to each document

  2. Classification Techniques • Manual (a.k.a. Knowledge Engineering) • typically, rule-based expert systems • Machine Learning • Probabalistic (e.g., Naïve Bayesian) • Decision Structures (e.g., Decision Trees) • Profile-Based • compare document to profile(s) of subject classes • similarity rules similar to those employed in I.R. • Support Machines (e.g., SVM)

  3. Machine Learning Procedures • Usually train-and-test • Exploit an existing collection in which documents have already been classified • a portion used as the training set • another portion used as a test set • permits measurement of classifier effectiveness • allows tuning of classifier parameters to yield maximum effectiveness • Single- vs. multi-label • can 1 document be assigned to multiple categories?

  4. Automatic Indexing • Assign to each document up to k terms drawn from a controlled vocabulary • Typically reduced to a multi-label classification problem • each keyword corresponds to a class of documents for which that keyword is an appropriate descriptor

  5. Case Study: SVM categorization • Document Collection from DTIC • 10,000 documents • previously classified manually • Taxonomy of • 25 broad subject fields, divided into a total of • 251 narrower groups • Document lengths average 27051464 words, 623274 significant unique terms. • Collection has 32457 significant unique terms

  6. Document Collection

  7. Sample: Broad Subject Fields 01--Aviation Technology 02--Agriculture 03--Astronomy and Astrophysics 04--Atmospheric Sciences 05--Behavioral and Social Sciences 06--Biological and Medical Sciences 07--Chemistry 08--Earth Sciences and Oceanography

  8. Sample: Narrow Subject Groups Aviation Technology 01 Aerodynamics 02 Military Aircraft Operations 03 Aircraft 0301 Helicopters 0302 Bombers 0303 Attack and Fighter Aircraft 0304 Patrol and Reconnaissance Aircraft

  9. Distribution among Categories

  10. Baseline • Establish baseline for conventional techniques • classification • training SVM for each subject area • “off-the-shelf” document modelling and SVM libraries

  11. Why SVM? • Prior studies have suggested good results with SVM • relatively immune to “overfitting” – fitting to coincidental relations encountered during training • low dimensionality of model parameters

  12. Machine Learning: Support Vector Machines • Binary Classifier • Finds the plane with largest margin to separate the two classes of training samples • Subsequently classifies items based on which side of line they fall hyperplane Font size margin Line number

  13. SVM Evaluation

  14. Baseline SVM Evaluation • Training & Testing process repeated for multiple subject categories • Determine accuracy • overall • positive (ability to recognize new documents that belong in the class the SVM was trained for) • negative (ability to reject new documents that belong to other classes) • Explore Training Issues

  15. SVM “Out of the Box” • 16 broad categories with 150 or more documents • Lucene library for model preparation • LibSVM for SVM training & testing • no normalization or parameter tuning • Training set of 100/100 (positive/negative samples) • Test set of 50/50

  16. “OOtB” Interpretation • Reasonable performance on broad categories given modest training set size. • Related experiment showed that with normalization and optimized parameter selection, accuracy could be improved as much as an additional 10%

  17. Training Set Size

  18. Training Set Size • accuracy plateaus for training set sizes well under the number of terms in the document model

  19. Training Issues • Training Set Size • Concern: detailed subject groups may have too few known examples to perform effective SVM training in that subject • Possible Solution: collection may have few positive examples, but has many, many negative example • Positive/Negative Training Mixes • effects on accuracy

  20. Increased Negative Training

  21. Training Set Composition • experiment performed with 50 positive training examples • OotB SVM training • increasing the number of negative training examples has little effect on overall accuracy • but positive accuracy reduced

  22. Interpretation • may indicate a weakness in SVM • or simply further evidence of the importance of optimizing SVM parameters • may indicate unsuitability of treating SVM output as simple boolean decision • might do better as “best fit” in a multi-label classifier

More Related