More about Corpus Linguistics PALA Summer School, Maribor, 2014

More about Corpus Linguistics PALA Summer School, Maribor, 2014

Introduction We will look at ... • Basic concepts and terminology • Sampling and representativeness • Annotation and mark-up

Characteristics of corpus linguistics according to Biber, Conrad & Reppen (1998: 4) • Uses a corpus • Uses computers for analysis • Empirical – analysing actual patterns of language use • Depends on quantitative and qualitative analytical techniques

Methodology vs. theory Two main views: Methodologist CL is a methodology for studying large amounts of language data using computer software Neo-Firthian CL is a sub-discipline of linguistics, concerned with explaining relationships between meaning and structure in language

Characteristics of a corpus according to McEnery and Wilson (2001) • Machine-readable form • Very large • Representative sample • (Standard reference) • Often annotated

Machine readable form • Nowadays, corpus = machine readable • Corpora tend to sit on a computer • Not always the case

Very large • Corpora are usually very large: 10 x 1000s, 100 x 1000s, millions of words. • Usually a finite size • Size decided at design stage – when size reached, data collection stops. • Exception – monitor corpus • E.g. COBUILD Corpus (Birmingham, UK) • Dictionary compiling

A representative sample • Corpora are so big that they can be a ‘representative sample’ of a language or a language variety • Also depends a lot on design of corpus • Consider sampling • which texts will be sampled • size of samples • number of samples

A representative sample Written language extracts from books, magazines, newspapers, websites… Spoken language transcripts of meetings, lectures, radio programs, everyday conversations… Something more specific ...

A representative sample Yorkshire English • What time frame you were going to sample? • Speech or writing or both? • Source of language data?

(Standard reference) • A corpus might be a standard reference or a ‘benchmark’ for a particular variety of language against which other texts or corpora can be compared

Annotation • Just the words on their own = ‘raw text’ • Annotation = extra information about what is in the corpus • Helps with the analysis of the data • Annotation also known as tagging (generally) or mark-up

Annotation Information about the text: • Where it came from • Who produced it • Genre • Etc.

Example <File id=“J2.1”> <Header> <Title> The spoyle of Antwerpe</Title> <Author> George Gascoigne </Author> <PubDate> 1576 </PubDate> <Source> EEBO </Source> <Words> 2112 </Words> <Comment> to the end </Comment> </Header> <Text> ........ ........ </Text> </File>

Annotation Adding information to the body of the text: • e.g. Gender of speaker • e.g. Discourse presentation

Example and Bromssell having demanded that it should be free unto them to take againe their places, the first President did oppose it, saying, it would be time enough when all the informations are read. They thought this could be done this morning,

Example and Bromssell having demanded that it <mod type="d">should</mod> be free unto them to take againe their places, the first President did oppose it, saying, it <mod type="e">would</mod> be time enough when all the informations are read. They thought this <mod type="e">could</mod> be done this morning,

Annotation • Annotation can be a manual process (takes ages) • But some linguistic annotation can be done automatically • e.g. word meaning (semantic) • e.g. grammatical class of each word in the corpus (noun, verb, etc.)

Linguistic Annotation: examples CLAWS • Constituent Likelihood Automatic Word-tagging System • Developed at Lancaster University • 96-97% accurate • Works out what Part Of Speech the word is and assigns a tag from a list of tags (a tagset)

Linguistic Annotation: examples CLAWS I liked him, and he was different from other boys, not at all pushy, except pushy to please I suppose , but even that was sweet in a way

Linguistic Annotation: examples CLAWS I_PPIS1 liked_VVD him_PPHO1 ,_, and_CC he_PPHS1 was_VBDZdifferent_JJfrom_IIother_JJ boys_NN2 ,_, not_XX at_RR21 all_RR22 pushy_JJ ,_, except_CSpushy_JJto_TOplease_VVI I_PPIS1 suppose_VV0 ,_, but_CCBeven_RR that_DD1 was_VBDZsweet_JJin_II a_AT1 way_NN1

Characteristics of a corpus according to McEnery and Wilson (2001) • Machine-readable • Very large • Representative sample • (Standard reference) • Annotation

A corpus • …a finite-sized body of machine-readable text, sampled in order to be maximally representative of the language variety under consideration. (McEnery & Wilson 2001: 32)

Why use a corpus? • Allows linguists to access quantitative information about language, which can often be used to support qualitative analysis. • Insights into language gained from corpus analysis are often generalisable in a way that insights gained from the qualitative analysis of small samples of data are not. • Using corpus data forces us to acknowledge how language is really used (which is often different from how we think it is used)

Exploiting a corpus Collocations • Collocation = relationship between words that tend to occur together • Words that tend to occur near word X are the collocates of word X • Based on frequencies • Statistical measures

Exploiting a corpus Collocations • Important in corpus linguistics • The company a word keeps can give that word implicit associations or assumptions

Exploiting a corpus Collocations • Juvenile = young, youthful, a young person • Collocates: delinquency, delinquent, delinquents, offenders, diabetes, crime, court • Juvenile has negative associations • Semantic prosody

Exploiting a corpus Collocations • Near-synonyms often differ in terms of their collocations

Exploiting a corpus Collocations • Young • Collocates: mums-to-be, bloods, nubile, hopefuls, impressionable, up-and-coming • Negative associations?

Exploiting a corpus Keywords • A keyword is a word which occurs in a text or corpus more frequently than you would expect by chance alone • … based on comparison with another (benchmark) corpus (e.g. the BNC) • … and the difference has to be statistically significant

Keyness Text #1 wordlist Text #2 wordlist Comparison process Apply statistical test (e.g. Log Likelihood). Calculated by the tool Key words list The over-represented (and under-represented) words in text #1 when compared with text #2 Difference must be statistically significant

Exploiting a corpus Keywords • A text’s keywords often point towards its content or its biases and/or can act as style markers (Enkvist 1973) • Keywords are often a good guide to what would be interesting to look at in more detail

Exploiting a corpus Keyness is “[...] a quality words may have in a given text or set of texts, suggesting that they are important [...]” (Scott and Tribble 2006: 55-6)

Summary The basic idea: • By analysing VERY large amounts of textual data, we can ... • establish norms about the variety of language being studied • test theories about language • spot common and rare language phenomena • reduce bias

Summary The computer can’t do it all for us – we still have to analyse the results and ask ... ‘What does it all mean?’

More about Corpus Linguistics PALA Summer School, Maribor, 2014

More about Corpus Linguistics PALA Summer School, Maribor, 2014

Presentation Transcript

Corpus Linguistics: Introduction

Intro to corpus linguistics

Corpus Linguistics

Corpus Linguistics

Summer School 2014

Corpus Linguistics

LIN 3098 Corpus Linguistics

LIN 3098 Corpus Linguistics

Corpus Linguistics: session 2

Summer School 2014

Corpus Linguistics: Introduction

2014 Summer School

Introducing Corpus Linguistics

Corpus Linguistics

Corpus Linguistics 2012

Corpus Linguistics (2)

Corpus Linguistics

Corpus Linguistics (6)

Introduction to Corpus Linguistics

Corpus Linguistics

Corpus Linguistics