What is a CORPUS?

“A corpus is a collection of pieces of language that are selected and ordered according to explicit linguistic criteria in order to be used as a sample of the language” (Sinclair 1996) What is a CORPUS?

“[…] the term corpus as used in modern linguistics can best be defined as a collection of sampled texts, written or spoken, in machine-readable form which may be annotated with various forms of linguistic information” (McEnery, Xiao and Tono 2006) What is a CORPUS?

Machine-readable texts Authentic texts Sampled texts Representative of a particularlanguage or language variety Key concepts re. Corpora:

The expression Corpus Linguistics first appeared in the early 80s. Corpus-based language study,however has a substantial history. Is Corpus Linguistics a new approach to the study of language?

In the “pre-Chomskyan era”: Field linguists (Boas) Structuralists (Sapir, Newman, Bloomfield, Pike, etc.) “Corpora” where few paper slips with data. “Shoebox Corpora”:Non-representative. Corpus-based only in that the methodology was empirical and based on observable data. Corpus-based language study

Chomsky (1962) accused the (contemporary) corpus methodology, by reason of the skewedness of corpora. Non-representative, time consuming, competence vs. performance, I-language vs. E-language Corpora were marginalized. The 50s: the “protests”

With the advances in computer technology the exploitation of massive corpora became feasible. Brown Corpus Brown University Standard Corpus of American Present-day English The revolutionary 60s

From the 80s onwards the number and size of corpora and corpus based studies have increased dramatically. Corpora have revolutionizedalmost all branches of linguistics. The 80s: the boom

Computers… … allow us to speed up the processing of data. … avoid human bias in data analysis … allow the enrichment of data with metadata A few remarks…

Intuition should be applied with caution: Influence of dialect, sociolect, idiolect… No universal agreement on (degree of) acceptability Informants monitor their use of language (non-spontaneous) Introspection is not observable Intuition vs. Corpus

Corpus-based approach draws upon authentic or real texts Computer-based analysis can retrieve differences that intuition alone cannot perceive Reliable quantitative data Intuition vs. Corpus

Not at all! The key to using corpus data is to find the balance between the use of corpus data and the use of one’s own intuition. Should we dismiss intuition then?

Not all research questions can be addressed by the corpus-based approach. Corpus-based approach and intuition-based approach ARE NOT MUTUALLY EXCLUSIVE Should we dismiss intuition then?

“[…] Neither the corpus linguist of the 1950s, who rejected intuition, nor the general linguist of the 1960s, who rejected corpus data, was able to achieve the interaction of data coverage and the insight that characterise the many successful corpus analyses of recent years”. Leech (1991:14) writes…

No universal agreement. CL is a METHODOLOGY and not an independent branch of linguistics such as semantics, pragmatics, syntax, etc. CL can be employed to explore almost any area of linguistic research. Is CL a methodology or a theory?

Corpus-based approaches are used to “expound, test or exemplify theories and descriptions that were formulated before large corpora became available to inform language study” (Tognini-Bonelli 2001:65). Therefore, corpus-based linguists are not strictly committed to corpus data and they would discard “inconvenient evidence” by insulation, standardisation and instantiation (i.e. via corpus annotation). Corpus-based or Corpus-driven approaches?

Corpus-driven linguists are “strictly committed to the integrity of the data as a whole”. Theoretical statements are fully consistent with, and reflect directly, the evidence provided by the corpus. (Tognini-Bonelli 2001:84-85). Corpus-based or Corpus-driven approaches?

The distinction is overstated, they are 2 idealized extremes. 4 basic differences among the 2 approaches: Types of corpora used Attitudes towards theories and intuitions Focuses of research Paradigmatic claims Corpus-based or Corpus-driven approaches?

C.B. Approaches Corpus must be representative and balanced Size is not all-important; Minimum frequency is used to exclude non-relevant results; In favour of corpus annotation: CB approaches generally have existing theory as a starting point and correct and revise such theory in the light of corpus evidence; Distinction between the different levels of language analysis. C.D. Approaches Corpus will balance itself when it grows to be big enough (cumulative representativeness); Corpus must be very large; Corpus evidence is exploited fully, but this way the number of the combinations is enormous; Against corpus annotation (no preconceived theories) No distinction betweenlexis, syntax, pragmatics,etc. There is only 1 levelof language description:the functionally complete unit of meaning or languagepatterning

We will only refer to CORPUS-BASED APPROACHES A few key notions in Corpus Linguistics…

Essential feature of a corpus. Balance (the range of genres included in a corpus) and sampling (how the text chunks for each genre are selected) ensure representativeness. Representativeness

A corpus is representative if… …the findings based on its contents cane be generalized to the said language variety (Leech 1991); …its samples include the full range of variability in a population (Biber 1993) Representativeness

It changes over time (Hunston 2002): if a corpus is not regularly updated, it rapidly becomes unrepresentative. Representativeness

Criteria to select texts for a corpus: External criteria (Biber’s situational perspective): defined situationally, e.g. genres, registers, text types, etc. Internal criteria (Biber’s linguistic perspective): defined linguistically, taking into account the distribution of linguistic features. CIRCULAR – because a corpus is typically design to study linguistic distribution, so there is no point in analysing a corpus where distribution of linguistic features is predetermined. Representativeness

2 main types (for the range of text categories represented): General corpora – a basis for an overall description of a language (variety); their r. depends on the sampling from a broad range of genres. Specialized corpora – domain- or genre specific corpora; their r. can be measured by the degree of closure or saturation (lexical features). Representativeness

The range of text categories included in the corpus: The acceptable b. is determined by the intended uses. A balanced corpus covers a wide range of text categories which are supposed to be representative of the language (variety) under consideration. Balance

There is no scientific measure for balance. It is more important for sample corpora than for monitor corpora Balance

A corpus is a sample of a given population A sample is representative if what we find for the sample holds for the general population Samples are scaled-down versions of a larger population Sampling

Sampling unit: for written text, a s.u. could be a book, periodical or newspaper. Population: the assembly of all sampling units; it can be defined in terms of language production, reception (demographic, sex, age, etc.) or language as a product (category, genre of language data). Sampling frame: the list of sampling units Sampling

Sampling techniques: Simple random sampling: all sampling units within the sampling frame are numbered and the sample is chosen by use of a table or random numbers; rare features could not be accounted for. Stratified random sampling: the population is divided in relatively homogeneous groups, i.e. the strata, and then these latter are sampled at random; never less representative than the former method. Sampling

Sample size: Full texts = no balance; peculiarity of individual texts may show through. Text chunks are sufficient (e.g. 2000 running words): frequent linguistic features are stable in their distribution and hence short text chunks are sufficient for their study (Biber 1993). Text initial, middle and end samples must be balanced. Sampling

Proportion and number of samples: The number of samples across text categories should be proportional to their frequencies and/or weights in the target population in order for the resulting corpus to be considered as representative Sampling

Claims of corpus representativeness and balance should be interpreted in relative terms as there is no objective way to balance a corpus or to measure its representativeness. Representativeness is a fluid concept: the research question that one has in mind when building a corpus determines what is an acceptablebalance for the corpus one should use and whether it is suitablyrepresentative. What matters is theResearch Question!

Spoken data must be transcribed from audio recordings. Written text must be rendered machine-readable by keyboarding or OCR (Optical Character Recognition) scanning. Language data so collected form a RAW CORPUS. Data collection

System of standard codes inserted into a document stored in electronic form to provide information about the text itself and govern formatting, printing and other processes. Most widely used mark-up schemes: TEI (Text Encoding Initiative) CES (Corpus Encoding Standard) Corpus Mark-up

It is essential in corpus-building because… …sampled texts are out of context and it allows to recover contextual information …it provides more informationthan the file names alone (re. text types, sociolinguistic variables, textual information – structure) …it ads value to the corpus because it allows for a broader range of questionsto be addressed …it allows to insert editorial comments during the corpus building process. Corpus Mark-up

Extra-textual and textual information must be kept separate from the corpus data. Examples: COCOA mark-up scheme <A WILLIAM SHAKESPEARE> A= author, attribute name WILLIAM SHAKESPEARE= attribute value Corpus Mark-up

Each individual text is a document consisting in a header and a body, in turn composed of different elements. Ex. in the header there are 4 main elements: A file description <fileDesc> An encoding description <encodingDesc> A text profile <profileDesc> A revision history <revisionDesc> Tags can be nested, i.e. they can appear inside other elements. TEI Mark-up Scheme

It can be expressed using a number of different formal languages. SGML (Standard GeneralizedMark-up Language – used bythe BNC) XML (Extensible Mark-up Language) TEI Mark-up Scheme

Designed specifically for the encoding of language corpora. Document-wide mark-up (bibliographical descripion, encoding description, etc.) Gross structural mark-up (volume, chapter, paragraph, footnotes, etc.; specifies recommended character sets) Mark-up for subparagraph structures (sentence, quotations, words, abbreviations, etc.) CES Mark-up Scheme

It specifies a minimal encoding level that corpora must achieve to be considered standardized in terms of descriptive representation as well as general architecture. 3 levels of standardization designedto achieve the goal of universal document interchange: Metalanguage level Syntactic level Semantic level CES Mark-up Scheme

Necessary in order to extract relevant information from corpora. “The process of adding […] interpretive, linguistic information to an electronic corpus of spoken and/or written language data” (Leech 1997) Corpus Annotation

Corpus mark-up provides objective, verifiable information. Annotation is concerned withinterpretive linguistic information. Annotation vs. Mark-up

It makes extracting information easier, faster and enables human analysts to exploit and retrieve analyses of which they are not themselves capable. The advantages of annotation

2. Annotated corpora are reusable resources. 3. Annotated corpora are multifunctional: they can be annotated with a purpose and be reused with another. The advantages of annotation

4. Corpus annotation records a linguistic analysis explicitly. 5. Corpus annotation provides a standard reference resource, a stable base of linguistic analyses, so that successive studies can be compared and contrasted on a common basis. The advantages of annotation

Annotation produces cluttered corpora Annotation imposes an analysis Annotation overvalues corpora making them less accessible Is annotation accurate and consistent? Criticisms to corpus annotation

Automatic annotation Computer-assisted annotation Manual annotation Sinclair (1992): the introduction of the human element in corpus annotation reduces consistency. How are corpora annotated?

Different types of annotation can be carried out with different means. For some types automatic annotation is very accurate. Other types require post-editing, i.e. human correction. Types of annotation

What is a CORPUS?

What is a CORPUS?

Presentation Transcript

SEMANTIC FREQUENCY:

THE VEDIC AGE AND ADVENT OF IRON 1500-600 BCE

Principles of corpus construction

بسم الله الرحمن الرحيم

Learning the Structure of Task-Oriented Conversations from the Corpus of In-Domain Dialogs

Distributional Semantics: Word Association and Similarity

Isa Buchstaller Karen Corrigan Adam Mearns Hermann Moisl Newcastle University

Peril and Promise in a New Age Texas A & M University Corpus Christi George L. Mehaffy

Simple Statistics for Corpus Linguistics

Regular Expressions and Automata

Balchik IV by SNOOP_Chan

Spoken Language Understanding

South Texas Fire Weather

ESOL Transition Academy

Sequence Scoring Experiments Using the TIMIT Corpus and the HTK Recognition Framework

What is a CORPUS?

What is a CORPUS?

Presentation Transcript

SEMANTIC FREQUENCY:

THE VEDIC AGE AND ADVENT OF IRON 1500-600 BCE

Principles of corpus construction

بسم الله الرحمن الرحيم

Learning the Structure of Task-Oriented Conversations from the Corpus of In-Domain Dialogs

Distributional Semantics: Word Association and Similarity

Isa Buchstaller Karen Corrigan Adam Mearns Hermann Moisl Newcastle University

Peril and Promise in a New Age Texas A &amp; M University Corpus Christi George L. Mehaffy

Simple Statistics for Corpus Linguistics

Regular Expressions and Automata

Balchik IV by SNOOP_Chan

Spoken Language Understanding

South Texas Fire Weather

ESOL Transition Academy

Sequence Scoring Experiments Using the TIMIT Corpus and the HTK Recognition Framework

Peril and Promise in a New Age Texas A & M University Corpus Christi George L. Mehaffy