Corpus Linguistic Processing Problems

Corpus Linguistic Processing Problems Mike Scott School of English University of Liverpool Charles University, Prague 22.5.06 This presentation is at www.lexically.net/downloads/corpus_linguistics

Abstract • This lecture considers the problems of handling and analysing sizeable corpora using standard PC technology. Issues to be addressed include the problem of dealing with text in a variety of formats, what constitutes a text boundary, memory versus disk storage, and retrieval from a hard disk of relevant texts which can be said to be “about” a given topic.

H.G. Wells World Brain (1938) • This World Encyclopaedia would be the mental background of every intelligent man in the world. It would be alive and growing and changing continually under revision, extension and replacement from the original thinkers in the world everywhere. Every university and research institute would be feeding it … its contents would be the standard source of material… • (in Witten et al 1999:435)

Issues and Questions • Retrieval – Queries • Text formats • Text boundaries • Storage • Finding relevant text data on a hard disk

Part 1: Retrieval

What we do with a Corpus: search it Language Focus Text Focus • Find texts meeting certain criteria • Discover characteristics of text-types • Find words / phrases / structures meeting certain criteria • Discover characteristics of words / phrases / structures Find text-types with these characteristics

“Query” Operations • List all instances of X Addition Operations merge documents insert into list Removal Operations split documents delete from list View Operations re-order see wider context

Text Attributes • Date • Authorship • Readership / audience • Location • Participants • Length • Format (encoding) • Language • Style • Mode • Domain • Availability • Meaning (aboutness) • etc.

Simple Query Types • identical to topic/wording X • similar to topic X • touches on topic X • quotes text X • quoted by text Y • refers/alludes to text X • referred to in text Y

Complex Queries • More than 1 simple query type, and/or more than 1 text attribute … • …in Boolean combinations (and, or, not)

Part 2: Text Formats

The chaos of text formats • Character formats • Text formats

Characters • “Legacy” formats from the 1980s (e.g. DOS and its fore-runners) • Unicode (now at version 5 beta): “Fundamentally, computers just deal with numbers. They store letters and other characters by assigning a number for each one. Before Unicode was invented, there were hundreds of different encoding systems for assigning these numbers. No single encoding could contain enough characters: for example, the European Union alone requires several different encodings to cover all its languages. Even for a single language like English no single encoding was adequate for all the letters, punctuation, and technical symbols in common use. These encoding systems also conflict with one another. That is, two encodings can use the same number for two different characters, or use different numbers for the same character. Any given computer (especially servers) needs to support many different encodings; yet whenever data is passed between different encodings or platforms, that data always runs the risk of corruption.” http://www.unicode.org/standard/WhatIsUnicode.html

Text Processing • Unix, Windows, Mac – can each handle some aspects of texts differently, e.g. how they process ends of lines • Word .doc v. RTF v. HTML v. XML: extra information built into the text

Prague.doc – 26,064 bytes Prague.xml = 8,534 bytes Prague.rtf = 7,893 bytes Prague.htm = 7,272 bytes Prague.txt = 8 bytes

Part 3: Text Boundaries

The Colony • “… a colony is a discourse whose component parts do not derive their meaning from the sequence in which they are placed.” (Hoey 1986: 4)

Examples of Colony Texts • “shopping lists, letter pages, dictionaries, hymn books, exam papers, concordances, small ads, class lists, bibliographies (to papers), abstracts (in volume form), constitutions, address books, newspapers, encyclopaedias, cookery books, seminar programmes, journals, certain kinds of reference books (e.g. Films on TV), footnotes to literary works, telephone directories, the Book of Proverbs, the Radio Times (and other TV magazines), gardening columns (sometimes), horoscopes (in newspapers), conference proceedings, menus…” (Hoey 1986:5)

Features of the Colony • Meaning not derived from sequence; • Adjacent units do not form continuous prose; • There is a framing context; • No single author and/or anon; • One component may be used without referring to the others; • Components can be reprinted or reused in subsequent works; • Components may be added, removed or altered; • Many of the components serve the same function; • Alphabetic, numeric or temporal sequencing. (Hoey 1986:20)

“Mainstream” Texts • share some of these features: • they • refer to • quote • allude to • share meaning with • other texts.

Part 4: Storage

Corpus Storage • Usually done using folders and sub-folders using some text attribute, often date, as a general key • Sometimes (BNC) the opportunity to make the filename informative has been wasted • But a tree is not the best way to access corpus contents…

because • of what we saw in Part 1: • there are a number of different text attributes any of which at different times may guide a given research query • given the unpredictability of research goals

so • a better strategy would be to let the component texts remain wherever they happen to be: • in emails • in .doc, html, .xml files • in previous corpora (.txt usually) • and access them by an index structure

Part 5: Finding

Accessing relevant corpus texts • via the index • with a mechanism for determining & then labelling each text’s • format • start and end • aboutness • language, authorship etc. • A database solution.

Conclusions • Only a sub-set of retrieval methods are catered for at present • Text formats represent a significant problem for corpus builders • Text boundaries are often (always?) quite fuzzy if one is interested in meaning • Storage has traditionally been organised in discrete corpora • But it would be better to organise a discrete index instead.

…which is not very different from…

References: • Aston, Guy & Lou Burnard, 1988. The BNC Handbook. Edinburgh: Edinburgh University Press. • Hoey, M. 1986, “The Discourse Colony: a preliminary study of a neglected discourse type”, in M. Coulthard (ed.) Talking About Text. Birmingham: English Language Research Discourse Analysis Monographs no. 13, pp. 1-26. • Scott, Mike & Chris Tribble (2006) Textual Patterns: key words and corpus analysis in language education. Amsterdam: Benjamins. • Wells, H.G. (1938) World Brain. New York: Doubleday. • Witten, I.H, A. Moffat & T.C. Bell, 1999, Managing Gigabytes. 2nd edition. San Francisco: Morgan Kaufman.

Corpus Linguistic Processing Problems