1 / 26

Searching in text databases with non-standard orthography

Searching in text databases with non-standard orthography. Thomas Pilz University of Duisburg-Essen Digital Historical Corpora, December 3-8, 2006. Overview. Introduction Processing historical documents Text recognition Performing search operations Distance measures

larue
Download Presentation

Searching in text databases with non-standard orthography

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Searching in text databases with non-standard orthography Thomas Pilz University of Duisburg-Essen Digital Historical Corpora, December 3-8, 2006

  2. Overview • Introduction • Processing historical documents • Text recognition • Performing search operations • Distance measures • Example of application • Outlook

  3. Search Engine Origins of RSNSR A web-based system for assisted literature research Project Nietzsche-CD Partial text recognition of German Fraktur documents

  4. Introduction Dei tropische Emdiuck des Gaitens wird ganz besonders dmch die Bananen Casuaunen Fiedei- und Fachei-Palmen sowie die tropischen Ficus-aiten heivoigeiufen, unter denen mir die voi dei indische Ficus benghalensis, die hinteimdische ... Altneuland, 1904 (Compact Memory) Dei tropische Emdiuck des Gaitens wird ganz besonders dmch die Bananen CasuaunenFiedei- und Fachei-Palmen sowie die tropischen Ficus-aitenheivoigeiufen, unter denen mir die voi dei indische Ficus benghalensis, die hinteimdische ... Am abgewichenen Sonnabend haben Jhro Königl. Hoheit, die Prinzeßin Amalia, an das Königliche Haus, und viele hohen Standespersohnen (...) und die Printzeßin Amalia Königl. Hoheiten, bey der Printzeßin ... Berlinische priv. Ztg., 1748 (Bibliotheca Augustana) Am abgewichenen Sonnabend haben Jhro Königl. Hoheit, die Prinzeßin Amalia, an das Königliche Haus, und viele hohen Standespersohnen (...) und die Printzeßin Amalia Königl. Hoheiten, bey der Printzeßin ...

  5. An ideal scenario Digital HQ scan Image Preprocessing Optical character recognition High quality document Manual correction

  6. Digital HQ scan LQ analog copies Image Preprocessing Document with recognition errors High quality documents Manual correction Search Engine Our solution for the „real world“ Optical character recognition and non-standard spellings $

  7. Typical OCR problems • 75 common recognition errors • > 90% context free

  8. Moral Analysis (+Restauration) Combination Mo[x]ai [W]oral Ma[x][x]l Recognition Preclassification Optical (NN/SVM) Context-based Additional modules [j][!][x][A] [W][x][x][x][!] [A][x][x][x][Q][!][x] Localization (Vertical Bar Patterns) [j][!][x][A] [W][x][x][x][!] [A][x][x][x][Q][!][x] Retrieval on black letter texts Moral ?

  9. Querying a historical document Text archive Interested User emperor emperor ůu^ouo ůſã ... as emperour ruleth ouer many kings ... Search engine empriour emparour emperour

  10. ~45% diachrone variation ~15% synchrone variation Variant spellings

  11. Diploma thesis Categorization Computer-assisted collection of spelling word-pairs Training Evaluation and Enhancement Diploma theses Categorizeddistancemeasures Document retrieval Master theses From text to retrieval

  12. Training database • 106 texts from 1293 to 1919 • author, location, time, language • 12.819 semi-automatically collectedlemma – variant word-pairs • textreference, category

  13. Training database

  14. Measuring lingustic distance • Commonly used in dialectrometry Heeringa et al. `06 • Java-Framework SVBase • Working with evidences of spelling variation • Dynamic programming • Java-Datatype FlexMetricKempken et al. `06 • Working with arbitrary distance measures

  15. tot toth rot Stochastic distance measure • Learning string edit distance: trainable • Uses EM-Algorithm • Table of edit costs: |alphabet|  |alphabet|

  16. Search engine Modular distance measures OCR 1700 -1800 1800 -1900 Text archive • „Search for spellings • with recognition errors and • typical for the 18th century.“

  17. Categories for distance measures • Chronological classification 1250 - 1350 1350 - 1450 1450 - 1650 1650 - 1900 • Diatopical classification Upper German Central German Lower German • OCR class

  18. Further applications • Document classification Benedetto et al. `03 • Language comparison Archer et al. `06 • Understanding of language evolution

  19. Example: an existing interface www.compactmemory.de

  20. Aibeitei Synoptic view Facsilime (Graphic) Searchable textlayer with errors Dei Bau \uid von deutschen Ingenieuren ausgefuhit Als Aibeitei weiden, wie bei dei Hed-schasbahn, Landesbewohnei, vorliegend Soldaten, jedenfalls keine ein opaischen Arbeiter vei\\endet 4) Für den 1. Apul

  21. Keeping the user unaware Textarchive Search engine Facsilimes(Interface) Interested User JPG,GIF,... XML • Recognition errors are hidden • Historical spellings are conserved but managed • User „feels“ like searching on the pictures

  22. Our Goals We want to • provide a modul for arbitrary text-databases • allow better retrieval results • keep priority of recall over precision • We do not want to • correct recognition errors with metric based search • translate historical texts • rely on dictionaries

  23. Outlook • Automatic evidence retrieval • Automatic classification of historical texts • Enhancement of fulltext search • Implementation of synoptic view module • Implementation of visualizations

  24. Thanks to • DFG – Deutsche Forschungsgemeinschaft • Compact Memory • Bibliotheca Augustana • Hessisches Staatsarchiv Darmstadt • DocumentArchiv.de • Dawn Archer and Paul Rayson

  25. Bibliography • Benedetto, D., Caglioti, E., Loreto, V. (2002). Language Trees and Zipping. Phys. Rev. Lett. 88, 048702 • Heeringa, W., Kleiweg, P., Gooskens, C., Nerbonne, J. (2006). Evaluation of String Distance Algorithms for Dialectology. In: J.Nerbonne & E.Hinrichs (eds.) Linguistic Distances Workshop, Sydney, July, 2006. pp. 51-62. • Kempken, S., Luther, W., Pilz, T (2006). Comparison of distance measures for historical spelling variants. In: Proceedings of the IFIP AI Conference, Santiago de Chile, 2005 • Mischke, L., Luther, W. (2005). Document Image De-Warping Based on Detection of Distorted Text Lines. In: F.Roli & S.Vitulano (eds.) ICIAP 2005. Berlin Heidelberg. pp. 1068-1075 • Pilz, T., Luther, W., Ammon, U., and Fuhr, N. (2005). Rule-based search in text databases with nonstandard orthography. In: Proceedings ACH/ALLC, 2005.

  26. Thank you for your interest!Any questions?

More Related