slide1 n.
Skip this Video
Download Presentation
Hermann Moisl Karen Corrigan Newcastle University, UK

Loading in 2 Seconds...

play fullscreen
1 / 28

Hermann Moisl Karen Corrigan Newcastle University, UK - PowerPoint PPT Presentation

  • Uploaded on

Building and Mining the DECTE Corpus: Information and Communication Technologies for the study of North East Dialects. Hermann Moisl Karen Corrigan Newcastle University, UK.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'Hermann Moisl Karen Corrigan Newcastle University, UK' - bryga

Download Now An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

Building and Mining the DECTE Corpus: Information and Communication Technologies for the study of North East Dialects

Hermann Moisl

Karen Corrigan

Newcastle University, UK


This presentation describes the nature, content, and construction of the Diachronic Electronic Corpus of Tyneside English (DECTE), and gives an example of how it can be used in computationally-based linguistic analysis.

The discussion is in 4 parts:

1. Overview

2. Content

3. Structure

4. Analysis


1. Overview

  • DECTE is a corpus of dialect speech from Tyneside in North-East England.
  • It is based on two pre-existing corpora:
  • 1. The Newcastle Electronic Corpus of Tyneside English (NECTE) completed in 2004. This itself combined two earlier corpora:
    • The Tyneside Linguistic Survey collected in the late 1960s
    • The Phonological Variation and Change in Contemporary Spoken English collected in 1994
  • 2. The NECTE2 corpus collected between November 2007 and September 2011

DECTE amalgamates these components into a single Text Encoding Initiative (TEI)-conformant XML-encoded corpus and makes them available in a variety of aligned formats:

  • digitized audio
  • standard orthographic transcription
  • phonetic transcription
  • part-of-speech tagged.
  • DECTE thereby constitutes a rare example of a corpus presenting dialect material spanning five decades.
  • The DECTE website describes the corpus in detail, and makes it available to academic researchers, educationalists, the media in non-commercial applications, and organisations such as language societies and individuals with a serious interest in historical dialect materials.

2. Content

  • The DECTE content is provided in several types of representation, though not all components have all the types of representation:
  • Audio
    • This is the DECTE base representation, and all components have it in the form of *.wav files containing spoken interviews with informants.
  • Orthographic transcription
    • All DECTE audio files have been transcribed into standard English orthography
  • Part-of-speech tagged orthographic transcription
    • The orthographic transcriptions of the TLS and the PVC (but not the NECTE2) audio were part-of-speech tagged by the University Centre for Computer Corpus Research on Language (UCREL) at the University of Lancaster, UK, using the CLAWS4 tagger.
  • Phonetic transcription
    • 64 of the TLS audio files were phonetically transcribed using a transcription scheme similar but not identical to IPA.

3. Structure

  • DECTE is formatted in Text Encoding Initiative (TEI) conformant XML, using the current P5 TEI Guidelines at
  • To be TEI-conformant, an XML document has to be validated relative to a schema that is consistent with the published TEI Guidelines (Guidelines 23.3.2), which implies that only the TEI-defined XML tag set and tag syntax are used in the document. DECTE uses the XML Document Type Definition (DTD) schema language and has been validated using the oXygen XML editor.
  • The file decte.xmlis the main DECTE file. It specifies the structure of the corpus, and has three components:
    • The XML declaration
    • The Document Type Definition (DTD)
    • The document content
  • What follows outlines the main structural features.

3.1 XML declaration

  • The decte.xml begins with the following XML declaration:
  • <?xml version="1.0" encoding="UTF-8" standalone="no"?>
  • where:
  • The version specification is standard and requires no comment.
  • The encoding specification says that the version of the current Unicode standard that allows for 1-byte representation of universal 7-bit ASCII is used; see the  Wikipedia entry on Unicode.
  • The 'standalone' specification makes explicit that the document refers to an external DTD.

3.2 Document type definition

A valid as opposed to merely well-formed XML document must include a DTD in relation to which the document can be validated.

This is done by means of a document type or DOCTYPE declaration in which the XML element, attribute, and other tags used in the document are specified.

This specification can be internal in the sense that its components appear lexically within the DOCTYPE declaration, or external in that the names of one or more files containing the specification are given, or it can be a combination of the two.

In the present case the specification takes the form of references to external files some of which are a selection TEI module files containing the tag sets used in the corpus, and some of which are files containing the names of the content files which constitute the corpus.


<!-- DTD declaration -->

<!DOCTYPE teiCorpus SYSTEM "tei.dtd" [<!-- Additions to core entities in tei.dtd-->

<!ENTITY % TEI.core 'INCLUDE'><!ENTITY % TEI.corpus 'INCLUDE'><!ENTITY % TEI.header 'INCLUDE'><!ENTITY % TEI.textstructure 'INCLUDE'><!ENTITY % TEI.linking 'INCLUDE'><!ENTITY % TEI.analysis 'INCLUDE'><!ENTITY % TEI.namesdates 'INCLUDE'><!ENTITY % TEI.spoken 'INCLUDE'><!-- Additional ENTITY declarations -->

<!-- Interview XML files --><!ENTITY % interviews SYSTEM "interviews.ent"> %interviews;<!-- Interview audio files --><!ENTITY % audiofiles SYSTEM "audiofiles.ent"> %audiofiles;]>


  • <!DOCTYPE teiCorpus SYSTEM "tei.dtd" [...]> is the DOCTYPE declaration in which
  • teiCorpus names the root element of the document to which the DTD applies
  • SYSTEM "tei.dtd" says that the required DTD definitions are available locally via the file tei.dtd. To understand the role of this file, one has to realize that the TEI DTD is partitioned into modules that can be selected according to the requirements of particular applications, thus obviating the need to include the entire DTD in situations where its full range is not required; tei.dtd is a 'driver' file which refers to these TEI DTD files
  • the square brackets [ ] enclose the DECTE-specific selections from the TEI DTD and some DECTE-specific <ENTITY> declarations. These are described in what follows.

<!-- DTD declaration -->

<!DOCTYPE teiCorpus SYSTEM "tei.dtd" [<!-- Additions to core entities in tei.dtd -->

<!ENTITY % TEI.core 'INCLUDE'><!ENTITY % TEI.corpus 'INCLUDE'><!ENTITY % TEI.header 'INCLUDE'><!ENTITY % TEI.textstructure 'INCLUDE'><!ENTITY % TEI.linking 'INCLUDE'><!ENTITY % TEI.analysis 'INCLUDE'><!ENTITY % TEI.namesdates 'INCLUDE'><!ENTITY % TEI.spoken 'INCLUDE'><!-- Additional ENTITY declarations -->

<!-- Interview XML files --><!ENTITY % interviews SYSTEM "interviews.ent"> %interviews;<!-- Interview audio files --><!ENTITY % audiofiles SYSTEM "audiofiles.ent"> %audiofiles;]>


  • The <ENTITY> declarations from <!ENTITY % TEI.core 'INCLUDE'> to <!ENTITY % TEI.spoken 'INCLUDE'> select the parts of the full TEI DTD provided by TEI that are relevant to DECTE.
  • The additional <ENTITY> declarations are DECTE-specific additions to the TEI DTD.
  • <!ENTITY % interviews SYSTEM "interviews.ent"> %interviews: List of the XML-formatted files included in DECTE.
  • <!ENTITY % audiofiles SYSTEM "audiofiles.ent"> %audiofiles: List of the audio files included in DECTE.

DECTE is a language corpus, and is therefore regarded by the TEI Guidelines as a composite documentconsisting of more than one discrete subtext.

This composite structure is reflected in the specification of the DECTE document structure in decte.xml.

This has two main components within the <teiCorpus> tag, which TEI uses to define composite documents: a header which includes metadata about the corpus  as a whole, and a list of the documents which constitute the corpus.


  • <teiHeader>
    •  <!– Header content -->
  • </teiHeader>
  •  <!-- List of constituent texts -->



The global header <teiHeader > contains metadata descriptive of the corpus as a whole. This is in four main parts:


  • <fileDesc>
    • <!-- Bibliographical description of the document such as its title, author(s), funding body, distribution arrangements, and the sources on which it is based-->
  • </fileDesc>
  • <encodingDesc>
    • <!--Editorial principles and practice used to generate the corpus -->
  • </encodingDesc>
  • <profileDesc>
    • <!-- Contextual information about the corpus, such as where and why it was compiled, by whom and for what purpose, and so on-->
  • </profileDesc>
  • <revisionDesc>
    • <!--Logs the history of any revision to the corpus -->
  • </revisionDesc>



The list of the 106 constituent texts included in DECTE is a sequence of entity references defined by the interviews.ent file included in the DTD.

  • &decten1tlsg01; &decten1tlsg02; … &decten2y10i026
  • All texts have the same structure:
  • <TEI xml:id="decten1tlsg01">
    • <teiHeader type="text">
      • <!--Header information -->
    • </teiHeader>
    • <text>
      • <!-- Content -->
    • </text>
  • </TEI>
  • where:
  • Each <TEI> element contains a single TEI-conformant document comprising a header and a text, and which is uniquely identified by an 'xml:id' attribute whose value in the present case is one of the above text entity references.
  • <teiHeader> contains information specific to the interview.
  • The <text> element contains the text of the interview.

It has already been noted that the DECTE content is provided in several types of representation -audio, orthographic transcription, part-of-speech tagged orthographic transcription, and phonetic transcription- and that not all types are representation are available across all the TLS, PVC, and NECTE2 components.

To encode these in a TEI-conformant way, each <text> is regarded as a composite whose components are enclosed in the <group> element, which, in the words of the TEI Guidelines, 'contains the body of a composite text, grouping together a sequence of distinct texts (or groups of such texts) which are regarded as a unit for some purpose'.

  • <text>
    • <group>
    • <!-- a sequence of five <text> elements -->
      • <text xml:id="decten1tlsg01audio">
        • <body>
          • <!-- content -->
        • </body>
      • </text>
      • <text xml:id="decten1tlsg01necteortho">
        • <body>
          • <!-- content -->
        • </body>
      • </text>
      • <text xml:id="decten1tlsg01phonetic">
        • <body>
          • <!-- content -->
        • </body>
      • </text>
      • <text xml:id="decten1tlsg01tagged">
        • <body>
          • <!-- content -->
        • </body>
      • </text>
    • </group>
  • </text>


Non-text content like audio and graphics cannot appear explicitly in XML documents, but they can be embedded using a referencing mechanism that an XML processor is able to interpret appropriately.

The TEI Guidelines do not at present provide an obvious embedding mechanism for audio, so for the time being the only content here is a reference to an audio file entity defined by the audiofiles.ent file in the DTD.

For example:

<text xml:id="decten1tlsg01audio">

  • <body>
    • &decten1tlsaudiog01
  • </body>



Orthographic transcription

<text xml:id="decten1tlsg01necteortho">

  • <body>
    • <u who="informantTlsg01"> <anchor id="tlsg01necteortho0000"/>t l s what's that </u>
    • <u who="interviewerTlsg01"> g </u>
    • <u who="informantTlsg01"> e <pause/> five two </u>
    • <u who="interviewerTlsg01"> thanks <pause/> ta <pause/> eh could you tell us first eh where you were born please <event desc="interruption"/> <unclear/> </u>
    • <u who="informantTlsg01"> i was born at eleven victoria street <pause/> gateshead </u>
    • <u who="interviewerTlsg01"> eh <pause/> aye yeah whereabouts is that again...
  • </body>



Phonetic transcription

<text xml:id="decten1tlsg01phonetic">

  • <body>
    • <u who="informantTlsg01"><anchor id="tlsg01phonetic0000"/>01304 02941 02641 02201 00626 02741 08760 02301 02081 02781 00244 02561 02021 02741 02561 00144 02421 02263 00626 02861 17801 02621 02262 02861 00023 02301 02442 01123 02301 02623 02365 02603 00342 02301 09040 02521 00823 02623 02442 11202 02741 02623 09030 08440 08580 02603 02541 02801 00342 02301 28803...
  • </body>



Part-of-speech tagged

<text xml:id="decten1tlsg01tagged">

  • <body>
    • <u who="informantTlsg01"><anchor id="0000"/> <s> <w type="VVN"lemma="see"> seen </w> <w type="II"lemma="at"> at </w> <w type="AT"lemma="the"> the </w> <w type="NN2"lemma="picture"> pictures </w> <w type="VVBDZ"lemma="be"> was </w> <w type="UH"lemma="ehm"> ehm </w> <w type="RR"lemma="so"> so </w> <w type="PPIS1"lemma="i"> i </w> <w type="VVD"lemma="marry"> married </w> <w type="AT1"lemma="an"> an </w> <w type="NN1"lemma="axe"> axe </w> <w type="NN1"lemma="murderer"> murderer </w>...
  • </body>




Within the <group>-based structure of the individual <text>, the real-time alignment scheme is implemented using the <anchor>tag.

In each tag, the 'xml:id' attribute specifies a real-time offset from the start of the audio file in question --  the tag <anchor xml:id="decten1tlsg01necteortho0040"/>, for example, marks a place in the NECTE orthographic transcription corresponding to a time offset of 40 seconds from the start of the corresponding audio decten1tlsaudiog01.

Since, for a given text, such markers are inserted into the correct places not only in the NECTE orthographic but also the various other transcribed representations of it -TLS orthographic, phonetic, tagged-, an XML processor can use them to align all the <text>s enclosed by <group>. For example:

<anchor xml:id="decten1tlsg01necteortho0020"/>where do you mean by that eh </u><u who="informantTlsg01"> that's ehm <pause/> down by eh clarkchapman's </u><u who="interviewerTlsg01"> oh aye like saltmeadows</u><u who="informantTlsg01"> yes saltmeadows </u><u who="interviewerTlsg01"> <unclear/> whereabouts else have you lived since then you know i mean how long did you stay there </u><u who="informantTlsg01"> five year <anchor xml:id="decten1tlsg01necteortho0040"/>

<anchor xml:id="decten1tlsg01phonetic0020"/>02081 02301 08580 02322 01443 02741 02201 01284 08580 02383 02801 00421 02421 02501 00342 02164 02721 02021 02741 02642 04321 02621 00503 02825 02301 02721 00246 02341 12601 02642 02541 01284 02561 02881 01641 … <anchor xml:id="decten1tlsg01phonetic0040"/>


4. Analysis

Much of the analysis of the NECTE / DECTE corpora by their creators has focussed on cluster analysis of the phonetic transcriptions.

The aim has been to use data abstracted from the phonetic transcriptions to develop a methodology for generation of linguistic hypotheses based on discovery of structure in the data abstracted from linguistic corpora using a variety of cluster analytical techniques.

The last part of this talk briefly describes this work.

4.1 Research question

4.2 Data creation and transformation

4.3 Cluster analysis

4.4 Hypothesis formulation


4.1 Research question

    • Is there systematic phonetic variation in the Tyneside speech community as represented by the Tyneside Linguistic Survey speakers in the Newcastle Electronic Corpus of Tyneside English (NECTE), and , if so, does that variation correlate systematically with social variables associated with the speakers?

4.2 Data creation and transformation

To answer the question, we counted the frequency with which each speaker uses each of the 158 phonetic segments in the TLS transcription scheme.

These frequencies were recorded in a 64 x 158 matrix M in which each of the 64 rows represents a speaker, each of the 158 columns represents a phonetic segment, and the value at M(i,j) is the number of times speaker i uses segment j.


Issues addressed with respect to this data have been:

  • Normalization of data values to compensate for different interview lengths among speakers.
  • Dimensionality reduction to derive maximally compact representation of phonetic variation on the data
  • Detection of nonlinearity in the data, which, if found, necessitates nonlinear cluster analytical methods.

4.3 Cluster analysis

Cluster analysis methods use relative distance among vectors in a space to group the vectors into clusters.

Specifically, for a given set of vectors in a space, they first calculate the distances between all pairs of vectors, and then group into clusters all the vectors that are relatively close to one another in the space and relatively far from those in other clusters.

'Relatively close' and 'relatively far' are, of course, vague expressions, but they are precisely defined by the various clustering methods, and for present purposes we can avoid the technicalities and rely on intuitions about relative distance.

For concreteness, we will concentrate on one particular class of methods: hierarchical cluster analysis. Hierarchical cluster analysis represents the relativities of distance among vectors as a constituency tree. The figure on the next slide exemplifies this.


Each row of the data matrix M is a phonetic usage profile for a single TLS speaker.

  • The aim is to see if there is any systematic similarity structure among the 64 speakers.
  • Plotting M in 158-dimensional space would have been impossible, and, without cluster analysis, one would have been left pondering a very large and incomprehensible matrix of numbers.
  • With the aid of cluster analysis, however, structure in the data is clearly visible:
  • There are two main clusters, NG1 and NG2:
  • NG1 consists of large subclusters NG1a and NG1b
  • NG1a itself has two main subclusters NG1a(i) and NG1a(ii).

4d. Hypothesis generation

The fundamental observation is that, because the row vectors of M are phonetic profiles of the TLS speakers, the cluster structure means that the speakers fall into clearly defined groups with specific interrelationships rather than, for example, being randomly distributed around the phonetic space.

Since direct plotting in 158-dimensional space is impossible, one would have had to spend a long time contemplating a very large numerical matrix to get that result

Cluster analysis gave the result quickly and easily, and moreover did so in a way that is scientifically respectable in that the analysis is replicable by anyone.

The first part of the research question is therefore answered in the affirmative, and this permits statement of an empirically-based hypothesis:

There is systematic phonetic variation in the Tyneside speech community as represented by the Tyneside Linguistic Survey speakers in the Newcastle Electronic Corpus of Tyneside English (NECTE).


The NECTE corpus contains information about the social characteristics of the speakers --age, occupation, educational level, and so on.

This information can be correlated with the cluster structure to see if that structure has an interesting sociolinguistic interpretation.

There is a close correlation between the cluster structure and the gender, educational, and occupational attributes of the speakers.

The main phonetic distinction is between clusters NG1 and NG2:

NG2 corresponds to a small group of speakers from Newcastle on the north shore of the Tyne for whom no social data is available, but who are known to have been male and female academics

NG1 comprises mainly but not exclusively working class speakers from Gateshead on the south shore of the Tyne.


The Gateshead speakers are subclustered into

  • NG1a, which contains a mix of male and female manual workers with minimal education and of male and female administrative workers with additional education, and
  • NG1b, which consists of male manual workers and a single female manual worker with minimal education
  • NG1a further subclusters the manual and the administrative workers into NG1a(i) and NG1a(ii) respectively; and so on.
  • The answer to the second part of the research question is therefore affirmative, and allows the empirically based hypothesis to state:

There is systematic phonetic variation in the Tyneside speech community as represented by the Tyneside Linguistic Survey speakers in the Newcastle Electronic Corpus of Tyneside English (NECTE), and that variation correlates systematically with social variables associated with the speakers.