1 / 11

OCR and the Welsh Language

OCR and the Welsh Language. Lyn Léwis Dafis and Martin Locock 2007-07-20. Welsh orthography Digraphs and diacritics Scanning for OCR Books From The Past Welsh Journals Online. Welsh orthography. Old Welsh 7th century Poets of the nobility 1588/1620 Bible

gaye
Download Presentation

OCR and the Welsh Language

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. OCR and the Welsh Language Lyn Léwis Dafis and Martin Locock 2007-07-20

  2. Welsh orthography • Digraphs and diacritics • Scanning for OCR • Books From The Past • Welsh Journals Online

  3. Welsh orthography • Old Welsh 7th century • Poets of the nobility • 1588/1620 Bible • William Owen Pughe, Welsh and English dictionary 1803 • Welsh orthography 1893 • Orgraff yr iaith Gymraeg 1928

  4. Digraphs and diacritics • Digraphsch dd ff ng ll ph rh th • DiacriticsÂ Ê Î Ô Û Ŵ Ŷ â ê î ô ûŵŷÄ Ë Ï Ö Ü Ẅ Ÿ ä ë ï ö ü ẅ ÿÀ È Ì Ò Ù Ẁ Ỳ à è ì ò ù ẁ ỳÁ É Í Ó Ú Ẃ Ý á é í ó ú ẃ ý

  5. Digraphs and diacritics • Digraphsch dd ff ng ll ph rh th • DiacriticsÂ Ê Î Ô Û Ŵ Ŷ â ê î ô ûŵŷÄ Ë Ï Ö Ü Ẅ Ÿ ä ë ï ö ü ẅ ÿÀ È Ì Ò Ù Ẁ Ỳ à è ì ò ù ẁ ỳÁ É Í Ó Ú Ẃ Ý á é í ó ú ẃ ý

  6. Scanning for OCR • Over 300 dpi • Grayscale and not bi-tonal

  7. Books from the Past • 19th-early 20th C printed texts (novels, sermons, poetry) • Wanted clean TEI text to accompany scans • Required OCR contractor to identify diacritics • TEI cleaned by hand • Most diacritics incorrect • Need Welsh dictionary to check % words identified

  8. Searching Ignore? impractical Could be key in the future Stop words Display of TEI dy dy‘your your’ dy dŷ ‘your house’ ei dwr | ei dwrei thwr | ei thwrei dŵr | ei dwr ei thŵr| ei thwr Is it essential?

  9. Normalise to unaccented form > TEI is incorrect; Display is ok Need to handle searching Or identify where possible Silent fuzzy searching (lookup table) OR ‘Did you mean’ Welsh Journals Online20th C texts; scan is main presentation format; 40% Welsh; English and Welsh user interfaces; content tagged for language

  10. Further information Lyn Léwis Dafis Pennaeth yr uned ddigido a metadata Head of the digitisation and metadata unit lvd@llgc.org.uk Martin Locock Rheolwr Prosiect Cylchgronau Cymru Welsh Journals Online project manager martin.locock@llgc.org.uk

More Related