1 / 29

Corpus Linguistic Processing Problems

Corpus Linguistic Processing Problems. Mike Scott School of English University of Liverpool Charles University, Prague 22.5.06 This presentation is at www.lexically.net/downloads/corpus_linguistics. Abstract.

efia
Download Presentation

Corpus Linguistic Processing Problems

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Corpus Linguistic Processing Problems Mike Scott School of English University of Liverpool Charles University, Prague 22.5.06 This presentation is at www.lexically.net/downloads/corpus_linguistics

  2. Abstract • This lecture considers the problems of handling and analysing sizeable corpora using standard PC technology. Issues to be addressed include the problem of dealing with text in a variety of formats, what constitutes a text boundary, memory versus disk storage, and retrieval from a hard disk of relevant texts which can be said to be “about” a given topic.

  3. H.G. Wells World Brain (1938) • This World Encyclopaedia would be the mental background of every intelligent man in the world. It would be alive and growing and changing continually under revision, extension and replacement from the original thinkers in the world everywhere. Every university and research institute would be feeding it … its contents would be the standard source of material… • (in Witten et al 1999:435)

  4. Issues and Questions • Retrieval – Queries • Text formats • Text boundaries • Storage • Finding relevant text data on a hard disk

  5. Part 1: Retrieval

  6. What we do with a Corpus: search it Language Focus Text Focus • Find texts meeting certain criteria • Discover characteristics of text-types • Find words / phrases / structures meeting certain criteria • Discover characteristics of words / phrases / structures Find text-types with these characteristics

  7. “Query” Operations • List all instances of X Addition Operations merge documents insert into list Removal Operations split documents delete from list View Operations re-order see wider context

  8. Text Attributes • Date • Authorship • Readership / audience • Location • Participants • Length • Format (encoding) • Language • Style • Mode • Domain • Availability • Meaning (aboutness) • etc.

  9. Simple Query Types • identical to topic/wording X • similar to topic X • touches on topic X • quotes text X • quoted by text Y • refers/alludes to text X • referred to in text Y

  10. Complex Queries • More than 1 simple query type, and/or more than 1 text attribute … • …in Boolean combinations (and, or, not)

  11. Part 2: Text Formats

  12. The chaos of text formats • Character formats • Text formats

  13. Characters • “Legacy” formats from the 1980s (e.g. DOS and its fore-runners) • Unicode (now at version 5 beta): “Fundamentally, computers just deal with numbers. They store letters and other characters by assigning a number for each one. Before Unicode was invented, there were hundreds of different encoding systems for assigning these numbers. No single encoding could contain enough characters: for example, the European Union alone requires several different encodings to cover all its languages. Even for a single language like English no single encoding was adequate for all the letters, punctuation, and technical symbols in common use. These encoding systems also conflict with one another. That is, two encodings can use the same number for two different characters, or use different numbers for the same character. Any given computer (especially servers) needs to support many different encodings; yet whenever data is passed between different encodings or platforms, that data always runs the risk of corruption.” http://www.unicode.org/standard/WhatIsUnicode.html

  14. Text Processing • Unix, Windows, Mac – can each handle some aspects of texts differently, e.g. how they process ends of lines • Word .doc v. RTF v. HTML v. XML: extra information built into the text

  15. Prague.doc – 26,064 bytes Prague.xml = 8,534 bytes Prague.rtf = 7,893 bytes Prague.htm = 7,272 bytes Prague.txt = 8 bytes

  16. Part 3: Text Boundaries

  17. The Colony • “… a colony is a discourse whose component parts do not derive their meaning from the sequence in which they are placed.” (Hoey 1986: 4)

  18. Examples of Colony Texts • “shopping lists, letter pages, dictionaries, hymn books, exam papers, concordances, small ads, class lists, bibliographies (to papers), abstracts (in volume form), constitutions, address books, newspapers, encyclopaedias, cookery books, seminar programmes, journals, certain kinds of reference books (e.g. Films on TV), footnotes to literary works, telephone directories, the Book of Proverbs, the Radio Times (and other TV magazines), gardening columns (sometimes), horoscopes (in newspapers), conference proceedings, menus…” (Hoey 1986:5)

  19. Features of the Colony • Meaning not derived from sequence; • Adjacent units do not form continuous prose; • There is a framing context; • No single author and/or anon; • One component may be used without referring to the others; • Components can be reprinted or reused in subsequent works; • Components may be added, removed or altered; • Many of the components serve the same function; • Alphabetic, numeric or temporal sequencing. (Hoey 1986:20)

  20. “Mainstream” Texts • share some of these features: • they • refer to • quote • allude to • share meaning with • other texts.

  21. Part 4: Storage

  22. Corpus Storage • Usually done using folders and sub-folders using some text attribute, often date, as a general key • Sometimes (BNC) the opportunity to make the filename informative has been wasted • But a tree is not the best way to access corpus contents…

  23. because • of what we saw in Part 1: • there are a number of different text attributes any of which at different times may guide a given research query • given the unpredictability of research goals

  24. so • a better strategy would be to let the component texts remain wherever they happen to be: • in emails • in .doc, html, .xml files • in previous corpora (.txt usually) • and access them by an index structure

  25. Part 5: Finding

  26. Accessing relevant corpus texts • via the index • with a mechanism for determining & then labelling each text’s • format • start and end • aboutness • language, authorship etc. • A database solution.

  27. Conclusions • Only a sub-set of retrieval methods are catered for at present • Text formats represent a significant problem for corpus builders • Text boundaries are often (always?) quite fuzzy if one is interested in meaning • Storage has traditionally been organised in discrete corpora • But it would be better to organise a discrete index instead.

  28. …which is not very different from…

  29. References: • Aston, Guy & Lou Burnard, 1988. The BNC Handbook. Edinburgh: Edinburgh University Press. • Hoey, M. 1986, “The Discourse Colony: a preliminary study of a neglected discourse type”, in M. Coulthard (ed.) Talking About Text. Birmingham: English Language Research Discourse Analysis Monographs no. 13, pp. 1-26. • Scott, Mike & Chris Tribble (2006) Textual Patterns: key words and corpus analysis in language education. Amsterdam: Benjamins. • Wells, H.G. (1938) World Brain. New York: Doubleday. • Witten, I.H, A. Moffat & T.C. Bell, 1999, Managing Gigabytes. 2nd edition. San Francisco: Morgan Kaufman.

More Related