1 / 25

Automatic Classification Document and Filing

Automatic Classification Document and Filing. Jonathan McElroy Advisor: Franz J. Kurfess. Overview. Introduction Classification Techniques Hidden Markov Models Similar Systems Novel Approach. Introduction. Creating an assistant document filer that learns from the user.

astra
Download Presentation

Automatic Classification Document and Filing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Automatic Classification Document and Filing Jonathan McElroy Advisor: Franz J. Kurfess

  2. Overview • Introduction • Classification Techniques • Hidden Markov Models • Similar Systems • Novel Approach

  3. Introduction • Creating an assistant document filer that learns from the user. • Novelty – taking different classification approaches to forming a hierarchical folder system based on user’s filing patterns. Also uses natural and specific learning and Markov Models to determine user’s style of filing.

  4. Classification - Bayesian • Probabilistic method using Bayes Theorem [1] [12] • Bayes Theorem Now sum up the probabilities that a word in A will be in class B.

  5. Classification – Bayesian (cont.) • Each word is independent of each other. • Often performs just as well as more complicated techniques like decision trees, rule-based learning and instance based learning.

  6. Classification – Vector Based • The text documents are turning into vectors [1] • Support Vector Machines [14] • Supervised learning. • Forms a divide between examples mapped in space. • New objects mapped are classified based on where they are.

  7. Classification – Vector (cont.) • T-Route [1] • An average document vector created for K classes. • Uses a term-document matrix • WTR is size M X K. • Wij represents number of times ti occurs in cj.

  8. Classification – Vector (cont.) • Vectorization [1]

  9. Classification – Vector (cont.) • T-Trans [1] • A unique document vector created for K classes. • WTT is size M X N. • Wij represents number of times ti occurs in dj. • Document is assigned same class as column vector in WTT with the smallest Euclidian distance from document.

  10. Classification - Improvements • Latent Semantic Analysis • Looks at relationships between words and documents and then forms concepts to link eachother.

  11. Classification - Improvements • Term Weighting [1] [15] • Term Frequency – Inverse Document Frequency • The importance increases proportionally to the number of times a word appears in the document but is offset by the frequency of the word in the corpus

  12. Classification - Improvements • Term Weighting [1] [15] • Mutual Information - look at two different classes and infers which keywords being used to classify one of them will also lead to a misclassification of the other one • Measures their mutual dependence

  13. Classification - Improvements • Term Weighting [1] [15] • Bellegarda – Combines the global weighting with a localized weighting for a word. • Creates a new term-document Wij with ti in dj

  14. Hidden Markov Models [4] • Method for learning patterns. Specifically for filing patterns. • HMM - describe two related discrete-time stochastic processes. • First – hidden state • Second – visible variables

  15. Hidden Markov Models [4] • Example: User files using 2 different types of filing: Date, Area of Interest. • Observations about the documents in nodes will lead to a filing type using probabilities of each type, and node data.

  16. Hidden Markov Models [4] Date Area Related Documents Similar Dates Unrelated Documents

  17. Similar Systems • Email Classification/ Routing Systems • Hierarchical Systems • Semantic Desktops

  18. Similar Systems • Email Classification/ Routing Systems • [6] System reroutes information from a central database to multiple users with different profiles, by using evolving classifying agents that filter the data • [1] Continually receive new text based documents and working to classify and extract important information out of them.

  19. Similar Systems • Hierarchical Systems [7] • At each level a context sensitive signature and feature selection is created and then focused to cut out noise and stop words. • Bayesian > Vector

  20. Similar Systems • Semantic Desktops • CALO [13] • project lead by SRI International that focused on development of a smart desktop • automate interrelated decision making tasks that have resisted automation and allow them to react appropriately to situations that are unusual.

  21. Similar Systems • Semantic Desktops • DEVONThink [10] • Seeks to make an all inclusive information gatherer and organizer • Sorts, classifies and shows relationships between documents automatically, but has shortcomings

  22. My Approach • Hierarchical approach to classification • Classifying each node in a directory • Also uses natural and specific learning. Letting user choose how involved with learning. • Using Markov Models to determine user’s style of filing. Automatic placement of files that do not fit into any current nodes.

  23. My Approach • The user is able to drag newly received files and drop them onto the program. • Files are be classified by their content and put in the location that the user would most likely have placed them. • Any changes by user are recorded and added to classification.

  24. References [1] Tailby, R., Dean, R., Milner, B., and Smith, D. 2006. Email classification for automated service handling. In Proceedings of the 2006 ACM Symposium on Applied Computing (Dijon, France, April 23 - 27, 2006). SAC '06. ACM, New York, NY, 1073-1077. http://doi.acm.org/10.1145/1141277.1141530 [2] Sebastiani, F. 2002. Machine learning in automated text categorization. ACM Comput. Surv. 34, 1 (Mar. 2002), 1-47. http://doi.acm.org/10.1145/505282.505283 [3] Fu, Y., Ke, W., and Mostafa, J. 2005. Automated text classification using a multi-agent framework. In Proceedings of the 5th ACM/IEEE-CS Joint Conference on Digital Libraries (Denver, CO, USA, June 07 - 11, 2005). JCDL '05. ACM, New York, NY, 157-158. http://doi.acm.org/10.1145/1065385.1065420 [4] Frasconi, P., Soda, G., and Vullo, A. 2002. Hidden Markov Models for Text Categorization in Multi-Page Documents. J. Intell. Inf. Syst. 18, 2-3 (Mar. 2002), 195-217. http://dx.doi.org/10.1023/A:1013681528748 [5] Cohen, W. W. and Singer, Y. 1999. Context-sensitive learning methods for text categorization. ACM Trans. Inf. Syst. 17, 2 (Apr. 1999), 141-173. http: //doi.acm.org/10.1145/306686.306688 [6] Clack, C., Farringdon, J., Lidwell, P., and Yu, T. 1997. Autonomous document classification for business. In Proceedings of the First international Conference on Autonomous Agents (Marina del Rey, California, United States, February 05 - 08, 1997). AGENTS '97. ACM, New York, NY, 201- 208. http://doi.acm.org/10.1145/267658.267716 [7] Chakrabarti, S., Dom, B., Agrawal, R., and Raghavan, P. 1998. Scalable feature selection, classication and signature generation for organizing large text databases into hierarchical topic taxonomies. The VLDB Journal 7, 3 (Aug. 1998), 163-178. http://dx.doi.org/10.1007/s007780050061 [8] Baker, L. D. and McCallum, A. K. 1998. Distributional clustering of words for text classication. In Proceedings of the 21st Annual international ACM SIGIR Conference on Research and Development in information Retrieval (Melbourne, Australia, August 24 - 28, 1998). SIGIR '98. ACM, New York, NY, 96-103. http://doi.acm.org/10.1145/290941.290970 [9] Cognitive Assistant that Learns and Organizes http://caloproject.sri.com/ [10] http://www.devon-technologies.com/products/devonthink/index.html [11] http://nepomuk.semanticdesktop.org/ [12] Fan, H. and Ramamohanarao, K. 2003. A Bayesian approach to use emerging patterns for classification. In Proceedings of the 14th Australasian Database Conference - Volume 17 (Adelaide, Australia). K. Schewe and X. Zhou, Eds. ACM International Conference Proceeding Series, vol. 143. Australian Computer Society, Darlinghurst, Australia, 39-48. [13] Cognitive Assistant that Learns and Organizes. http://caloproject.sri.com/ [14] Support Vector Machines. December, 2009. http://en.wikipedia.org/wiki/Support_vector_machine. [15] K. Yu, X. Xu, M. Ester, H.-P. Kriegel. Feature weighting and instance selection for collaborative filtering. Knowledge and Information Systems, 5(2), 201-224, 2003

  25. Questions Do you?

More Related