1 / 46

Computer Science Research for Family History and Genealogy

Computer Science Research for Family History and Genealogy. Computer Graphics, Vision, & Image Processing Laboratory Neural Networks and Machine Learning Laboratory Data Extraction and Integration Laboratory Laboratory for Information, Collaboration, & Interaction Environments

liz
Download Presentation

Computer Science Research for Family History and Genealogy

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Computer Science Research forFamily History and Genealogy Computer Graphics, Vision, & Image Processing Laboratory Neural Networks and Machine Learning Laboratory Data Extraction and Integration Laboratory Laboratory for Information, Collaboration, & Interaction Environments Performance Evaluation Laboratory Data and Software Engineering Laboratory www.cs.byu.edu/familyhistory David W. Embley Heath Nielson, Mike Rimer, Luke Hutchison, Ken Tubbs, Doug Kennard, Tom Finnigan William A. Barrett

  2. The Problem • 2.5 million rolls of microfilm • Assuming 1000 images per roll • 2.5 billion images Is there a way to automatically extract this information?

  3. A (Possible) Solution Let a computer do the extraction work. • Input: Images of Microfilmed Records • Table Recognition (Heath Nielson) • Old-Text Recognition (Mike Rimer) • Handwriting Recognition (Luke Hutchison) • Record Extraction & Organization (Ken Tubbs) • Just-in-Time Browsing (Doug Kennard) • Visualization (Tom Finnigan) • Output: Organized Genealogical Information

  4. ZoningGeneral Overview • Find the lines in the document using the horizontal and vertical profiles of the image. • Apply a matched filter to the profiles to identify the line signatures. • Recursively divide the document into separate pieces, analyzing each piece for lines.

  5. Zone ClassificationMachine vs. Handwriting • Machine printed text is consistent/regular. • Handwriting is irregular.

  6. Document templates • Images are not ideal. • Results in incorrect zoning and classification. • Form layout is the same across documents. • Features missed in one image, are found in another. • Build a template of the document’s form by using several documents. • Provides robustness, and increases accuracy.

  7. Document Templates

  8. Zoned Image

  9. Automated Text Recognition

  10. Word Segmentation

  11. Letter Segmentation

  12. Optical Character Recognition

  13. Handwriting Recognition

  14. Mary Handwriting Recognition • The Task • Online handwriting recognition • The writer's pen movements are captured • Velocity, acceleration, stroke order are available • Offline handwriting recognition • Page was previously-written and scanned • Only pixel color information available • Genealogical records are all offline • Offline is harder, but doable

  15. Handwriting Recognition • Can we just convert offline data into (simulated) online data? • Yes, although difficult to do reliably: • What order were the strokes written in? • Doubled-up line segments? Ink blobs? Spurious joins between letters? Missing joins? • Inferring online data (e.g. stroke ordering) could be crucial to success • Demonstrated to be solvable with reasonable reliability

  16. Handwriting Recognition • An example of some steps in the analysis process: • Contour extraction • Midline determination • Stroke ordering

  17. nr? m? Handwriting Recognition • An example of some steps in the recognition process: • Handwriting style clustering • Letter recognition • Approximate string matching Smith Smythe

  18. Automatic Record Extraction

  19. Extraction Algorithm • Identify the Geometric Structure • Identify the Type of Information • Identify the Attribute-Value pairs • Identify the Record Boundaries

  20. Column-Row Recognition

  21. Genealogical Ontology

  22. Match Labels ROAD, STREET, &c., And No. or NAME of HOUSE Location

  23. Match Labels Location NAME and Surname of each Person Full Name

  24. Match Labels Location Full Name RELATION to Head of Family Relationship

  25. Location Full Name Relationship Extract Records Collafer

  26. John Eyres Head Location Full Name Relationship Extract Records Collafer

  27. Annie Eyres Wife Location Full Name Relationship Extract Records Collafer

  28. Lehailes Eyres Son Location Full Name Relationship Extract Records Collafer

  29. Web Query John Eyres

  30. Search Results

  31. Online Digital Microfilm: Problem Many of the images we are interested in are quite large. 6048 x 4287 pixels

  32. What is Just-In-Time Browsing? A method of quickly browsing digital images over the Internet which capitalizes on: • Progressive Image Transmission: • Hierarchical Spatial Resolution • Progressive Bitplane Encoding • JBIG Compressed Bitplanes • Prioritized Regions of Interest • User Interaction

  33. Hierarchical PIT Sequential Transmission (Progressive Image Transmission)

  34. PIT Using Bitplane Method 1 BitPlane (2 levels of gray) 2 BitPlanes (4 levels of gray) 3 BitPlanes (8 levels of gray) 4 BitPlanes (16 levels of gray)

  35. Digital Microfilm Browser

  36. PAF – 5 Generation Pedigree

  37. PAF – 5 Generation Pedigree

  38. Gena:A 3D Genealogy Visualizer

  39. Concluding Remarks Workshop: April 4, 2002 at BYU www.cs.byu.edu/familyhistory

  40. Appendix Categorized List of BYU Faculty Interests in Computer Science Research Topics that Support Technology for Family History and Genealogy

  41. Extraction from Digitized Images • Scanning (Flanagan) • Segmentation & Table Recognition (Barrett, Martinez) • OCR for Old Type-Set Text (Martinez) • Element Classification & Record Construction (Embley, Barrett, Martinez) • Handwriting Recognition (Sederberg) • Recognition of Hand-printed Text (Olson, Barrett, Martinez)

  42. Extraction from Digital Data Sources • Automatic Extraction from Semi-structured and Unstructured Sources (Embley, Martinez) • Mappings from Heterogeneous Structured Source Views to Target Views (Embley) • Individualized Source Views (Woodfield)

  43. Information Integration • Definition of Ontological Expectations (Embley, Woodfield) • Value Normalization (Woodfield) • Object Identity & Data Merging (Embley, Sederberg) • Managing Uncertainty (Embley, Woodfield, Martinez)

  44. Systems for Family History and Genealogy • Storage of Large Volumes of Data (Flanagan) • Distributed Storage (Woodfield) • Indexing Original Documents (Martinez, Embley) • Human-Computer Interaction (Olsen) • Just-in-Time Browsing (Barrett, Olsen) • Workflow for Directing Genealogical Work (Woodfield, Martinez, Embley) • Notification Systems (Woodfield) • Visualization (Sederberg)

More Related