1 / 30

Document Classification Techniques using LSI

Document Classification Techniques using LSI. Barry Britt University of Tennessee PILOT Presentation Spring 2007. Introduction. Automatic Classification of Documents. The problem: Brought to Dr. Berry by InRAD, LLC Develop a system that can automatically score proposals (NSF, SBIR)

elton
Download Presentation

Document Classification Techniques using LSI

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Document Classification Techniques using LSI Barry Britt University of Tennessee PILOT Presentation Spring 2007

  2. Introduction

  3. Automatic Classification of Documents • The problem: • Brought to Dr. Berry by InRAD, LLC • Develop a system that can automatically score proposals (NSF, SBIR) • Proposals are scored by the authors. • The current system grades proposals based on the writing skill of the author • The solution: • An automatic system for classifying readiness levels.

  4. LSI and GTP • reduced rank vector space model • Queries • are reduced rank approximations • Produce contextually relevant results • Semantically linked terms and documents are grouped together • GTP is an implementation of LSI • Local and global weighting and document frequencies • Document normalization • Much more…

  5. Document Sets - Composition • Consist of Technology Readiness Reports • Proposals • Subjective score from 1 (low) to 9 (high) • No structure to documents Example (from DS1): windshield windshield windshield windshield windshield windshield windshield triphammer triphammer night night ithaca ithaca fleet fleet airline airline worthy window william warn visual visible variant snow severe retrofit reflect red prior primarily popular owner outside oem look imc gages day dangerous cue certification brook analog accumulation accumulate accretion

  6. Document Sets - Composition • Document Set 1 (DS1) • 4000 documents • 85.4 terms per document • Document Set 2 (DS2) • 4000 documents • 49.2 terms per document • Document Set 3 (DS3) • 455 documents • 37.1 terms per document • 2 Classes - “Phase1” and “Phase2”

  7. Document Sets - Labels Class labels for the individual documents are determined by the authors of the proposals…

  8. Document Classification

  9. Classification - Majority Rules • 3 Steps to classification: • Choose a document and make it the query document • Retrieve the x closest documents; documents with the highest cosine similarity values • Class = max[n1,n2]

  10. Majority Rules - Results Predicted Actual Predicted Actual Predicted Actual

  11. Majority Rules - Results • Why are these results not good? • Good representation from Class 1 • Very poor representation from Class 2 • How can we improve results in the underrepresented class?

  12. Classification - Class Weighting • Add a “weight” to our classification. • Steps: • Choose a document and make it the query document • Retrieve the x closest documents; documents with the highest cosine similarity values • Class = max[weight1 * n1,weight2 * n2] • Each class has its own separate weight

  13. Weighted Classifier - Results Predicted Actual Predicted Actual Predicted Actual

  14. Weighted Classifier - Results • Better classifier • Still a good representation from majority class • Better representation from minority class • We can still improve on these results for the minority class.

  15. “Weight - Document Size” (WS) Classifier • Problem: • Minority class still underrepresented • Hypothesis: • Documents in the same class will have similar “sizes”, or total number of relevant terms. • Solution: • Account for document size in results for the Weighted Classifier

  16. “Weight - Document Size” (WS) Classifier • Only consider documents within x total words of the query document • Steps: • Choose a document and make it the query document • Retrieve the x closest documents within n number of words of the query document • Class = max[weight1 * n1,weight2 * n2] • Each class, like the regular weighted classifier, has its own weight value

  17. “Weight - Document Size” (WS) Classifier Predicted Actual Predicted Actual Predicted Actual

  18. “Weight - Document Size” (WS) Classifier • Best classifier so far • Good representation from both classes • Best representation so far from the minority class • Can we improve this further?

  19. Term Classifier • Rather than classifying based on similar documents, classify based on similar terms. • Steps: • Analyze the terms in each document, and the class of those documents. • Choose a document and make it the query document • Retrieve the x closest documents (note: we are not accounting for document size) • Class = max[weight1 * n1,weight2 * n2] • Again, each class has its own weight.

  20. Term Classifier Class 1 words Class 2 words Class 1 and Class 2 words In one of our document sets, the list of exclusive words was less than 3% of the total words.

  21. Term Classifier • Take the exclusive words list. If a document clusters near a “Phase1” exclusive word, classify it as “Phase1”, and vice versa • We can use this information to produce an alternate classification.

  22. Term Classifier Predicted Actual Predicted Actual Predicted Actual

  23. Term Classifier • Comparable to the WS Classifier • Better for DS3, probably because it is a much smaller set • Not good for Class 2 in the other sets. • The real value lies in reclassification.

  24. Document Reclassification • The Term Classifier correctly identifies some documents missed by the WS Classifier. • Confidence Value: • If a classification of the WS classifier does not have a high confidence value, then check to see what the Term classifier says.

  25. Document Reclassification • Technique is good for checking small numbers of documents. • Technique is not good for completely reclassifying an entire set.

  26. Related Work

  27. Java GUI Front End • Developed in Spring 2007 • Helps by providing a stable interface through which to run GTP and classify documents. • Can “save state”, saves LSI model and all internal data structures for later use. • All tables used in this document were generated by this program.

  28. Windows GTP • Direct port of GTP from UNIX to Windows • Developed on Windows XP, SP2 • Completely self-contained, doesn’t require external programs or shared libraries • Sorting parsed words: • Original GTP uses UNIX sort command… • Windows GTP uses an external merge sort…

  29. Acknowledgements • These people and groups assisted by providing their knowledge and experiences to the project • Dr. Michael Berry • Murray Browne • Mary Ann Merrell • Nathan Fisher • The InRAD staff

  30. References • “Using Linear Algebra for Intelligent Information Retrieval.” Michael W. Berry, Susan T. Dumais, and Gavin W. O’Brien, December 1994. Published in SIAM Review 37:4 (1995), pp. 573-595. • Understanding Search Engines: Mathematical Modeling and Text Retrieval. M. Berry and M. Browne, SIAM Book Series: Software, Environments, and Tools. (2005), ISBN 0-89871-581-4. • “GTP (General Text Parser) Software for Text Mining.” J. T. Giles, L. Wo, and M. W. Berry, Statistical Data Mining and Knowledge Discovery. H. Bozdogan (Ed.), CRC Press, Boca Raton, (2003), pp. 455-471.

More Related