Searching in text databases with non-standard orthography

Searching in text databases with non-standard orthography Thomas Pilz University of Duisburg-Essen Digital Historical Corpora, December 3-8, 2006

Overview • Introduction • Processing historical documents • Text recognition • Performing search operations • Distance measures • Example of application • Outlook

Search Engine Origins of RSNSR A web-based system for assisted literature research Project Nietzsche-CD Partial text recognition of German Fraktur documents

Introduction Dei tropische Emdiuck des Gaitens wird ganz besonders dmch die Bananen Casuaunen Fiedei- und Fachei-Palmen sowie die tropischen Ficus-aiten heivoigeiufen, unter denen mir die voi dei indische Ficus benghalensis, die hinteimdische ... Altneuland, 1904 (Compact Memory) Dei tropische Emdiuck des Gaitens wird ganz besonders dmch die Bananen CasuaunenFiedei- und Fachei-Palmen sowie die tropischen Ficus-aitenheivoigeiufen, unter denen mir die voi dei indische Ficus benghalensis, die hinteimdische ... Am abgewichenen Sonnabend haben Jhro Königl. Hoheit, die Prinzeßin Amalia, an das Königliche Haus, und viele hohen Standespersohnen (...) und die Printzeßin Amalia Königl. Hoheiten, bey der Printzeßin ... Berlinische priv. Ztg., 1748 (Bibliotheca Augustana) Am abgewichenen Sonnabend haben Jhro Königl. Hoheit, die Prinzeßin Amalia, an das Königliche Haus, und viele hohen Standespersohnen (...) und die Printzeßin Amalia Königl. Hoheiten, bey der Printzeßin ...

An ideal scenario Digital HQ scan Image Preprocessing Optical character recognition High quality document Manual correction

Digital HQ scan LQ analog copies Image Preprocessing Document with recognition errors High quality documents Manual correction Search Engine Our solution for the „real world“ Optical character recognition and non-standard spellings $

Typical OCR problems • 75 common recognition errors • > 90% context free

Moral Analysis (+Restauration) Combination Mo[x]ai [W]oral Ma[x][x]l Recognition Preclassification Optical (NN/SVM) Context-based Additional modules [j][!][x][A] [W][x][x][x][!] [A][x][x][x][Q][!][x] Localization (Vertical Bar Patterns) [j][!][x][A] [W][x][x][x][!] [A][x][x][x][Q][!][x] Retrieval on black letter texts Moral ?

Querying a historical document Text archive Interested User emperor emperor ůu^ouo ůſã ... as emperour ruleth ouer many kings ... Search engine empriour emparour emperour

~45% diachrone variation ~15% synchrone variation Variant spellings

Diploma thesis Categorization Computer-assisted collection of spelling word-pairs Training Evaluation and Enhancement Diploma theses Categorizeddistancemeasures Document retrieval Master theses From text to retrieval

Training database • 106 texts from 1293 to 1919 • author, location, time, language • 12.819 semi-automatically collectedlemma – variant word-pairs • textreference, category

Training database

Measuring lingustic distance • Commonly used in dialectrometry Heeringa et al. `06 • Java-Framework SVBase • Working with evidences of spelling variation • Dynamic programming • Java-Datatype FlexMetricKempken et al. `06 • Working with arbitrary distance measures

tot toth rot Stochastic distance measure • Learning string edit distance: trainable • Uses EM-Algorithm • Table of edit costs: |alphabet|  |alphabet|

Search engine Modular distance measures OCR 1700 -1800 1800 -1900 Text archive • „Search for spellings • with recognition errors and • typical for the 18th century.“

Categories for distance measures • Chronological classification 1250 - 1350 1350 - 1450 1450 - 1650 1650 - 1900 • Diatopical classification Upper German Central German Lower German • OCR class

Further applications • Document classification Benedetto et al. `03 • Language comparison Archer et al. `06 • Understanding of language evolution

Example: an existing interface www.compactmemory.de

Aibeitei Synoptic view Facsilime (Graphic) Searchable textlayer with errors Dei Bau \uid von deutschen Ingenieuren ausgefuhit Als Aibeitei weiden, wie bei dei Hed-schasbahn, Landesbewohnei, vorliegend Soldaten, jedenfalls keine ein opaischen Arbeiter vei\\endet 4) Für den 1. Apul

Keeping the user unaware Textarchive Search engine Facsilimes(Interface) Interested User JPG,GIF,... XML • Recognition errors are hidden • Historical spellings are conserved but managed • User „feels“ like searching on the pictures

Our Goals We want to • provide a modul for arbitrary text-databases • allow better retrieval results • keep priority of recall over precision • We do not want to • correct recognition errors with metric based search • translate historical texts • rely on dictionaries

Outlook • Automatic evidence retrieval • Automatic classification of historical texts • Enhancement of fulltext search • Implementation of synoptic view module • Implementation of visualizations

Thanks to • DFG – Deutsche Forschungsgemeinschaft • Compact Memory • Bibliotheca Augustana • Hessisches Staatsarchiv Darmstadt • DocumentArchiv.de • Dawn Archer and Paul Rayson

Bibliography • Benedetto, D., Caglioti, E., Loreto, V. (2002). Language Trees and Zipping. Phys. Rev. Lett. 88, 048702 • Heeringa, W., Kleiweg, P., Gooskens, C., Nerbonne, J. (2006). Evaluation of String Distance Algorithms for Dialectology. In: J.Nerbonne & E.Hinrichs (eds.) Linguistic Distances Workshop, Sydney, July, 2006. pp. 51-62. • Kempken, S., Luther, W., Pilz, T (2006). Comparison of distance measures for historical spelling variants. In: Proceedings of the IFIP AI Conference, Santiago de Chile, 2005 • Mischke, L., Luther, W. (2005). Document Image De-Warping Based on Detection of Distorted Text Lines. In: F.Roli & S.Vitulano (eds.) ICIAP 2005. Berlin Heidelberg. pp. 1068-1075 • Pilz, T., Luther, W., Ammon, U., and Fuhr, N. (2005). Rule-based search in text databases with nonstandard orthography. In: Proceedings ACH/ALLC, 2005.

Thank you for your interest!Any questions?

Searching in text databases with non-standard orthography