1 / 20

Ancient Greek OCR w ith Gamera and the Google/ Perseus Greek and Latin Collection

Ancient Greek OCR w ith Gamera and the Google/ Perseus Greek and Latin Collection. Bruce Robertson, Mount Allison University. ἀλήθεια truth Ἀ λήθεια. ‘Breathing’ marks on vowels at beginning of a word Accents possible on all vowels. Diversity of Greek Fonts in 19 th C. Other Examples.

gen
Download Presentation

Ancient Greek OCR w ith Gamera and the Google/ Perseus Greek and Latin Collection

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Ancient Greek OCR with Gamera and the Google/PerseusGreek and Latin Collection Bruce Robertson, Mount Allison University

  2. ἀλήθειαtruth Ἀλήθεια • ‘Breathing’ marks on vowels at beginning of a word • Accents possible on all vowels

  3. Diversity of Greek Fonts in 19th C.

  4. Other Examples

  5. Greek OCR With Gamera • Dalitz and Brandt provide an experimental framework • I added splitting, grouping, sql output, etc. • Teams of undergraduates making multiple classifiers • Based on families of fonts • Comparing strategies of composite characters, splitting, etc. • Must also train for Latin scripts used • Not yet working on post-processing

  6. Good Results

  7. Systematic Approach to Automated Greek OCR • Remove the curator from the loop – especially important for journals, monographs, etc. • Assign classifier by computation means • Using: • Federico Boschetti’s ground-truth-less Greek text evaluator • Atlantic Computational Excellence Network, Atlantic Canada’s parallel computing network

  8. Process • 160 Greek-heavy texts chosen • Of these, random samples of 10 pages were taken • Each was processed with each of the 20 classifiers made this summer • The result were evaluated and given a ‘Boschetti score’ from 0 – 1

  9. Google/ABBYY Line Splitting

  10. Gamera’s Text Line Finding(bbox_merging)

  11. Replaced with runlength_smearing

  12. Two-step processing

  13. Future Work • Combining and re-optimizing classifiers? • Assign classifier based on Latin text • Is ‘Oxford’, ‘Clarendon’ or ‘Oxonii’ in the first pages of output? • Align with Google’s output, and provide Google with corrected Greek • Implement line-splitting from other OCR engines • Discover badly OCR’d Greek in others’ output • Implement OCR correction frameworks described here

  14. Common Problems • Assessments of pre-processing strategies and tools • Schemas for page description

  15. Thanks • Colleagues in Dynamic Variorum Editions: • Greg Crane at Perseus / Tufts • Brian Fuchs at Imperial College • Federico Boschetti • AceNet, especially tech. support of Sergiy Khan

More Related