Ocr and salix parsing
This presentation is the property of its rightful owner.
Sponsored Links
1 / 14

OCR and SALIX Parsing PowerPoint PPT Presentation


  • 95 Views
  • Uploaded on
  • Presentation posted in: General

OCR and SALIX Parsing. Daryl Lafferty Arizona State University October, 2012. SALIX: Semi-Automatic Label Information eXtraction. SALIX was developed at Arizona State University from 2009 through 2012. Over 55,000 ASU Herbarium specimen labels were digitized using SALIX.

Download Presentation

OCR and SALIX Parsing

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Ocr and salix parsing

OCR andSALIX Parsing

Daryl Lafferty

Arizona State University

October, 2012


Salix semi automatic label information extraction

SALIX:Semi-Automatic Label Information eXtraction

SALIX was developed at Arizona State University from 2009 through 2012.

Over 55,000 ASU Herbarium specimen labels were digitized using SALIX


Ideal salix process flow

Ideal SALIX Process Flow

The ideal process flow is:

Photograph the specimen label

Perform OCR on the photograph

Have SALIX parse the resulting text into database categories

Upload the results to the database


Practical salix process flow

Practical SALIX Process Flow

The actual process flow has added steps:

Photograph the specimen label

Perform OCR on the photograph

Correct any OCR errors. Tweak the text layout

Have SALIX parse the resulting text into database categories

Correct any mis-parsed results

Upload the results to the database


Ocr workflow

OCR Workflow

We use a ABBYY Professional Version 10

We capture an image of the full specimen, and another of just the label for OCR.

Processing is done in batch mode, usually run over night on a folder containing hundreds of images.

The result is a single text file with one label per page.

OCR errors are corrected in the text file before processing with SALIX


The salix user interface

The SALIX User Interface


Manual data entry

Manual Data Entry


Ocr and salix parsing

A label that results in many OCR errors


A label that results in few ocr errors

A label that results in few OCR errors


Label length and quality

Label Length and Quality

  • We first categorized 4 different label types, with the following average characteristics:

  • We then had 3 students each process 10 labels of each category (40 labels total through SALIX and typed into Symbiota form.


Sample throughput data

Sample Throughput Data


Conclusions

Conclusions

OCR quality has a strong effect on semi-automated parsing throughput using SALIX.

OCR using ABBYY in Batch Mode was most efficient for our workflow.

The relationship is roughly:

where

S = Ratio of SALIX Throughput/Typing Throughput

and

E = OCR Error rate stated as OCR Errors per 100 words

(Obviously, the relationship isn't accurate as E approaches zero, i.e. less than about 2 Errors/100 words)


Acknowledgements

Acknowledgements

All of the data presented here was from Anne Barber's Master's Thesis, completed at ASU in May, 2012.

Anne also developed the process flow that helped optimize SALIX throughput.

The overall project was under the direction of Les Landrum, curator of the ASU Herbarium.


  • Login