1 / 17

VARD 2: A tool for dealing with spelling variation in historical corpora

VARD 2: A tool for dealing with spelling variation in historical corpora. Alistair Baron Computing Department Lancaster University a.baron@comp.lancs.ac.uk www.comp.lancs.ac.uk/~barona/. Outline. Early Modern English Characteristics Spelling Variation Corpora VARD 2 Demonstration

rosa
Download Presentation

VARD 2: A tool for dealing with spelling variation in historical corpora

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. VARD 2: A tool for dealing with spelling variation in historical corpora Alistair Baron Computing Department Lancaster University a.baron@comp.lancs.ac.uk www.comp.lancs.ac.uk/~barona/

  2. Outline • Early Modern English • Characteristics • Spelling Variation • Corpora • VARD 2 • Demonstration • Current and Future Work • Questions Alistair Baron | Lancaster University

  3. Early Modern English (EModE) • Period of English language between 1450 and 1700. • Large amount of research interest: • Influential period in the formation of modern English. • Earliest period of English from which a large corpus can be built due to a sharp increase in text production: • King Henry V’s commitment to the vernacular from 1417. • William Caxton’s introduction of the printing press in 1476. • An increasingly literate public. • Shakespeare’s works. • Large amount of spelling variation. Alistair Baron | Lancaster University

  4. EModE Spelling Variation • Large amount of spelling variation in EModE texts: • No notion of the importance of having a single spelling for each word. • Authors, scribes, editors and printing houses would have their own spelling preferences. • Letters would be added or removed to ease line justification. • Local dialect could also influence spelling. • Spelling variation became less frequent through the EModE period: • Spread of London and Chancery English through the introduction of printing. • Signified by the introduction of dictionaries, especially that of Samuel Johnson’s in 1755. Alistair Baron | Lancaster University

  5. EModE Spelling Variation (Examples) • Examples of spelling variants: • Addition or removal of ‘e’, e.g. aske, workes, dos • Doubling and singling of letters, e.g. smels, heere, leggs • Interchanged letters: { u , v }, { j , i }, { ie , y }, { vv , w }, e.g. haue, vnder, maiestie, vvas • Usage of apostrophe, e.g. vow’d, ‘em • Spellings which are variable still today, e.g. centre/center, -or/-our, -ise/-ize • Fused forms, e.g. t’is, t’was, o’th • Archaic –(e)th and –(e)st endings, e.g. hath, doth, seemeth, shouldst • Archaic forms, e.g. betwixt, howbeit • Phonetic spellings, e.g. publiquely, blew (blue) • + any combination of the above and other irregular spellings, e.g. Iigge (Jig), diuell (devil), shak’d (shook) Alistair Baron | Lancaster University

  6. EModE Corpora • The construction of EModE and other historical corpora has become an important focus of research. Research topics using the corpora range from studies of diachronic linguistic change to studies of attitudes towards gender at the time. • Corpora include: • ARCHER, Helsinki, Lampeter and ZEN (Kytö et al, 1994) • Corpus of Early English Correspondence (Nevalainen, 1997) • Corpus of English Dialogues (Culpeper and Kytö, 1997) • Many versions of Shakespeare’s works • e.g. the First Folio as printed in 1623 which can be sourced from the Oxford Text Archive (http://ota.ahds.ac.uk/) • Also, increasing amounts of textual data, large quantities of which are historical texts, are being digitised through current initiatives including: • The Open Content Alliance (http://www.opencontentalliance.org/) • Google Books (http://books.google.com) • Early English Books Online (http://eebo.chadwyck.com/home) Alistair Baron | Lancaster University

  7. EModE CorporaThe problem with spelling variation • Many corpus linguistic functions can be completed automatically with software such as Wordsmith Tools, BNCWeb and WMatrix; however, these tools are designed to work with modern English (or other modern languages). • Problems occur when corpus linguistic functions are processed on historical varieties or dialects of English (and indeed other languages) especially when large levels of spelling variation occurs – as in Early Modern English. • This can cause problems for even simple functions such as a string search, with only words spelt in exactly the same way as the search query being returned. • A recent examination of the Lampeter corpus has shown that an average of 1 in 5 word types per text are not found in a large modern word list. • Frequency lists will also be incorrect due to a word’s potential frequency being split between its different spellings. • would for example could be spelt in a variety of forms including: would, wolde, woolde, wood, wuld, wulde, wud, wald, vvould, vvold, and so on. Alistair Baron | Lancaster University

  8. EModE CorporaThe problem with spelling variation • Keyword lists could be obscured by spelling variation due to multiple spellings of a word reducing its ‘keyness’. • Collocations would also be affected in much the same way, with co-occurring words not being detected due to spelling variation. • Rayson et al (2007) evaluated the accuracy of the CLAWS Part-of-speech tagger on EModE corpora (modern accuracy: 96-97%): • Archer et al (2003) discuss developing the USAS Semantic Tagger for EModE, the paper reports on evaluation performed on relatively contemporary texts from 1640. Dealing in part with spelling variation produced an improvement in error rates: 2.9% to 1.2% in one text and 4.0% to 1.4% in the other text processed. Alistair Baron | Lancaster University

  9. Our Solution – VARD (Variant Detector) • Our solution to the problem was to build a pre-processor for corpus linguistic tools which ‘standardizes’ the spelling variation found within. • This led to the production of VARD, a search and replace tool which uses a large list of known variants to insert a modern equivalent alongside the original spelling. • The processed text could then be passed on to corpus linguistic software where the system would ‘see’ the modern spelling instead of the spelling variant, thus improving the accuracy of the tool’s techniques. • It should be noted that we are not “correcting” the spelling of EModE texts, there was no correct spelling at the time and the spelling variants are important linguistic features. The original variant is maintained and it is a simple process to switch between the original and modernised texts. The modern equivalents are inserted for the benefit of other automated software which would produce inaccurate results with a large amount of spelling variants remaining. Alistair Baron | Lancaster University

  10. VARD 2 • The original VARD tool managed to deal with a large amount of spelling variation, however due to the extensive variety of spelling variation it is impossible to include all possible spelling variants in a pre-defined list. • Therefore VARD 2 was developed which employs techniques from modern spell checkers to find potential replacements for spelling variants. • The tool also offers an interactive interface where users can view all spelling variants detected and select the desired replacement from a supplied list for each variant. Alistair Baron | Lancaster University

  11. VARD 2 Demonstration Alistair Baron | Lancaster University

  12. Current and Future Work • DICER: Discovery and Investigation of Character Edit Rules Alistair Baron | Lancaster University

  13. Current and Future Work • Context sensitive rules • Surrounding grammar (and Semantics?) • Word bigram and trigram analysis • Collocations? • Analysis of VARD 2 recall and precision • Analysis of effect on Corpus linguistic techniques Alistair Baron | Lancaster University

  14. Any Questions? • Thanks for listening! • More information about VARD 2 can be found at http://www.comp.lancs.ac.uk/~barona/vard2/ • Any Questions? Alistair Baron | Lancaster University

  15. References • Archer, D., McEnery, T., Rayson, P. and Hardie, A. (2003). Developing an automated semantic analysis system for Early Modern English. In Archer, D, Rayson, P., Wilson, A. and McEnery, T. (eds.). Proceedings of the Corpus Linguistics 2003 conference. UCREL technical paper number 16. UCREL, Lancaster University, pp. 22 - 31. • Culpeper, J. and Kytö, M. (1997). Towards a corpus of dialogues, 1550-1750. In Ramisch, H. and Wynne, K. (eds.). Language in Time and Space. Studies in Honour of Wolfgang Viereck on the Occasion of His 60th Birthday (Zeitschrift für Dialektologie und Linguistik - Beihefte, Heft 97). pp 60-73. Franz Steiner Verlag, Stuttgart. • Kytö, M., Rissanen, M. and Wright, S. (1994). Corpora across the Centuries: Proceedings of the First International Colloquium on English Diachronic Corpora, Cambridge, March 1993. Rodopi, Amsterdam. • Nevalainen, T. (1997). Ongoing work on the Corpus of Early English Correspondence. In Hickey, R., Kytö, M., Lancashire, I. and Rissanen, M. (eds.) Tracing the Trail of Time: Proceedings from the Second Diachronic Corpora Workshop.Rodopi, Amsterdam. • Rayson, P., Archer, D. and Smith, N. (2005). VARD versus Word: A comparison of the UCREL variant detector and modern spell checkers on English historical corpora. In Proceedings from the Corpus Linguistics Conference Series on-line e-journal, Vol. 1, no. 1, ISSN 1747-9398. • Rayson, P., Archer, D., Baron, A., Culpeper, J. and Smith, N. (2007). Tagging the Bard: Evaluating the accuracy of a modern POS tagger on Early Modern English corpora. In proceedings of Corpus Linguistics 2007, July 27-30, University of Birmingham, UK. • Vallins, G. H. (revised by Scragg, D.G. 1965) (1954). Spelling. André Deutsch. Alistair Baron | Lancaster University

  16. VARD 2 Screenshots Alistair Baron | Lancaster University

  17. VARD 2 Screenshots Alistair Baron | Lancaster University

More Related