1 / 28

Recognizing Table Structure from the Extracted Cells of Genealogical Microfilm

Recognizing Table Structure from the Extracted Cells of Genealogical Microfilm. A Thesis Proposal Presented to the Department of Computer Science Brigham Young University. Kenneth Martin Tubbs Jr. Motivation. Millions of people want genealogical information

paco
Download Presentation

Recognizing Table Structure from the Extracted Cells of Genealogical Microfilm

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Recognizing Table Structure from the Extracted Cells of Genealogical Microfilm A Thesis Proposal Presented to the Department of Computer Science Brigham Young University Kenneth Martin Tubbs Jr.

  2. Motivation • Millions of people want genealogical information • Acquiring microfilm is expensive and time consuming

  3. Problem • Searching microfilm by hand is slow, error prone, and tedious • Extraction by hand requires enormous amounts of time and manpower

  4. Problem • Tables have different layouts and styles • Tables contain different records • Tables lack information and are ambiguous

  5. Related Work • Current work exploits the geometric properties of tables • Regular expressions, grammars, probabilistic models, and templates • They ignore the ontological constraints of the information

  6. Input Features • Coordinates of each cell. • Printed text of each cell. • Whether or not each cell is empty. • XML Input File • < cell rectangle="335,114,521,172" printed_text =“NAME and Surname of each Person" empty=“0" • /> … Related Work

  7. Input Collect Evidence XML Input File(Preprocessed Microfilm Image) Apply Rules Verify Results Genealogical Ontology Algorithm Method Output SQL Insert Statements

  8. Cell Types Label Cells Print Value Cells Empty Cells

  9. Input

  10. Genealogical Ontology Age Name Gender * * * Address 1 1.1 1.1 1 1 4.3 1.3 Family Person

  11. Extract Features Collect Evidence • The algorithm extracts features • Support or refute a geometric and ontological relationships • Extracted features yield a confidence value between 0 and 1

  12. 5 Relationships Collect Evidence • Associate value cells to label Cells • Associate label cells to label Cells • Associate value cells to value Cells • Match label cells to object set in the genealogical ontology • Identify label cells that factor other label cells

  13. .75 .10 .20 .32 Evidence Matrix Collect Evidence Label Cells Values Cells

  14. Apply Rules Collect Evidence Apply Rules • A set correlation rules associate the values of the evidence matrices • The algorithm iterates over the set of correlation rules

  15. .90 Value - Value A Rule Collect Evidence Apply Rules .75 .10 .20 .32 Label - Value j min[LVji & LVjk ] = min {min[LVji & LVjk ] * [ VVik + .3], max[LVji & LVjk ] }

  16. A Rule Collect Evidence Apply Rules .90 .75 .32 Value - Value .75 .32 Label - Value j min[LVji & LVjk ] = min { min[LVji & LVjk ] * [ VVik + .3], max[LVji & LVjk ] }

  17. Factoring Collect Evidence Apply Rules [Name] per [Address] = 9 / 2 = 4.5

  18. Genealogical Ontology Collect Evidence Apply Rules [Name] per [Address] = 1 * 4.3 * 1.1 = 4.73 Age Name Gender * * * Address 1 1.1 1.1 1 1 4.3 1.3 Family Person

  19. A Factoring Rule Collect Evidence Apply Rules • Compare the expected cardinality, O, ratio for a pair of label cells with the observed cardinality ratio, Ni/Nj. FMij = FMij * [1 - | Oij – Ni/Nj | + C] = FMij * [1 - | 4.73 – 4.5 | + .5] = FMij * 1.27

  20. Score Results Collect Evidence Apply Rules • Score extracted record structure • Human user for verification Store Results

  21. Score Results Collect Evidence Apply Rules Store Results

  22. INSERT INTO Person (Name) VALUES ('335,114,521,172 ') INSERT INTO Person (Name) VALUES ('335,173,521,231') Database Collect Evidence Apply Rules • Create SQL Insert statements to store table cell coordinates Store Results … Name Family … 0123 0123 …

  23. Input Collect Evidence XML Input File(Preprocessed Microfilm Image) Apply Rules Store Results Genealogical Ontology Algorithm Method Output SQL Insert Statements

  24. Measurements • 5 – 7 Concept Tables • 5 Train Set – Real World Tables • 15 Test Set - Real World Tables • Precision, recall, and accuracy of the cells written in the SQL statements.

  25. Contributions • Exploiting both constraints of a genealogical ontology and geometry • Combines extracted features using correlation rules

  26. Delimitations • Tables of rows and columns • Genealogical domain. • English language documents • Tables that do not span multiple documents

  27. Artifacts • Application/demo in the Java programming language.

  28. Recognizing Table Structure from the Extracted Cells of Genealogical Microfilm A Thesis Proposal Presented to the Department of Computer Science Brigham Young University Kenneth Martin Tubbs Jr.

More Related