1 / 20

Data Mining Applied to Document Imaging

Data Mining Applied to Document Imaging. Jeff Rekoske. Agenda. Introduction Problem Definition Solution and Methodology Progress Report Tools Techniques Applied from CSC-288 Lessons Learned/Reinforced Summary. Introduction. Employed as SW Developer and DBA on document imaging project

brand
Download Presentation

Data Mining Applied to Document Imaging

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data Mining Applied to Document Imaging Jeff Rekoske

  2. Agenda • Introduction • Problem Definition • Solution and Methodology • Progress Report • Tools • Techniques Applied from CSC-288 • Lessons Learned/Reinforced • Summary

  3. Introduction • Employed as SW Developer and DBA on document imaging project • Access to OCR statistics • Management staff has a few questions that can be answered by analysis of existing data

  4. Problem Definition • Two Parts • Management questions • Data mining demonstration

  5. Management Questions • Result of interviews • Fairly basic • What forms are processed the most? • What are the recognition rates for the top forms? • What is the percentage of forms that were presented to an operator for keying?

  6. Data Mining Demonstration • Purpose is to show the usefulness of data mining techniques. • Prediction of rates for new forms • Characteristics of highly recognized forms • Use mined data to develop new forms

  7. Solution • Data mart • Answer management questions • Provide data for mining activities

  8. Data Mart Schema (Snowflake)

  9. ETL and Data Mining Dataflow

  10. Methodology • Choose a small timeframe to sample data • September – October 2004 • Use ETL to load data • Relatively “clean” process due to data location • Apply SQL statements to data mart to answer management questions

  11. Methodology (continued) • Extract data from data mart to create WEKA files • Attribute-Relation File Format (ARFF) • Use WEKA to create classifier model using C4.5 algorithm (pass/fail recognition) • Validate model with 10-fold cross validation

  12. Progress Report • First part (management questions) complete • 14,210 imaged documents • 865,409 OCR fields • View created that joins tables • Allows for non-technical personnel to create basic queries • Management is pleased with results

  13. Progress Report (continued) • Part Two (WEKA –classifier) in progress • ARFF generation scripts complete • Need to run ARFF files through WEKA • Need to cross validate results

  14. Tools • Oracle 8i RDBMS • Oracle PL/SQL scripting language • WEKA implementation of C4.5 classifier • WEKA cross validation

  15. Techniques Applied from CSC-288 • Data Mart • Snowflake Schema • ETL • OLAP Operations

  16. Techniques Applied (continued) • Classification • C4.5 Algorithm • Supervised Learning • Credibility • Cross-Validation

  17. Lessons Learned/Reinforced • Get firm requirements (if possible) • Data marts can get large quickly • OLAP operations should be performed offline (from the OLTP system) • Demonstrations are useful for explaining concepts

  18. Summary • Application of knowledge from CSC-288 to my work • Data mart can be used to answer multiple questions without effecting OLTP processing • Hopefully demonstrate using the data mart for creating a classification model

  19. References • “Data Mining: Concepts and Techniques,” by Jiawei Han and Micheline Kamber, Morgan Kaufmann, San Francisco, 2001 • "Data Mining: Practical machine learning tools with Java implementations," by Ian H. Witten and Eibe Frank, Morgan Kaufmann, San Francisco, 2000.

  20. Questions?

More Related