1 / 71

Isa Buchstaller Karen Corrigan Adam Mearns Hermann Moisl Newcastle University

The Diachronic Electronic Corpus of Tyneside English (DECTE): effective practice in using this E-Learning Tool. Isa Buchstaller Karen Corrigan Adam Mearns Hermann Moisl Newcastle University. DECTE is an AHRC-funded linguistic corpus creation project with a twofold aim:

keitha
Download Presentation

Isa Buchstaller Karen Corrigan Adam Mearns Hermann Moisl Newcastle University

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The Diachronic Electronic Corpus of Tyneside English (DECTE): effective practice in using this E-Learning Tool Isa Buchstaller Karen Corrigan Adam Mearns Hermann Moisl Newcastle University

  2. DECTE is an AHRC-funded linguistic corpus creation project with a twofold aim: 1. To develop an earlier, also AHRC-funded corpus called the Newcastle Electronic Corpus of Tyneside English (NECTE) as a research resource by adding new material and a range of interpretative tools. 2. To make DECTE accessible for educational applications in museums, secondary schools, further education, and higher education. This workshop is about using DECTE and linguistic corpora more generally as a teaching resource in higher education. Introduction

  3. The discussion is in 2 main parts: Part 1: Theory The nature of linguistic corpora generally and DECTE in particular Principles on which the DECTE approach to using corpora for teaching is based Implementationof the principles Part 2: Practice A practical session that applies the material covered in Part 1 to the DECTE corpus. Introduction

  4. Part 1. The nature of linguistic corpora A linguistic corpus is a collection of natural language text or speech designed specifically for research and / or teaching applications. Such collections have been made since the scientific study of language came into being. Indo-European linguistics Philology With the growing emphasis on the study of contemporary language in the wake of de Saussure and Chomsky in the 20th century, collections of contemporary text were made, such as the Brown Corpus in the USA and the Survey of English Usage in the UK. These early corpora were paper-based, but, as electronic media developed in the second half of the 20th century, recordings of analog speech and soon thereafter of digital speech and text were made. In recent decades there has been an explosive growth in creation of corpora; many of these are listed by the Linguistic Data Consortium. DECTE is one of these.

  5. Part 1. The nature of linguistic corpora: DECTE DECTE, the Diachronic Electronic Corpus of Tyneside English, is currently being developed into an extensive collection of speech, text and images relating to Tyneside in North-East England. It updates the existing Newcastle Electronic Corpus of Tyneside English (NECTE), which combined and digitized the Tyneside Linguistic Survey (TLS) of the 1960s and the Phonological Variation and Change in Contemporary Spoken English (PVC) corpus of 1994. This material will be augmented with the addition of the monitor corpus NECTE2, which consists of interviews that have been conducted with a range of local informants since 2007. DECTE will constitute a very rare example of a publicly available on-line corpus presenting data spanning five decades in an interactive multi-media format. The visual element of DECTE will link images that capture topics and events of cultural significance, such as local industries and urban regeneration, with the narratives on these subjects found in the NECTE & NECTE2 corpora. We are developing this new resource as a tool that can be used by the general public and at all levels of the education sector from primary to higher. If you have any questions, or you want to share your views on how DECTE should look, please e-mail DECTE@ncl.ac.uk

  6. Part 1.Using corpora for teaching: fundamental principles 1. No gimmickry The use of corpora and associated presentational and interpretative software in teaching of language and linguistics must be more than a gimmick. To make a structural impact a coherent strategy is required. 2. Active learning If corpus-based teaching is to be effective, students must work directly with corpora and be given the tools to do so. 3. Motivation Students must be motivated to take corpus-based work seriously: 3.1 Academic motivation: A clear idea of why understanding of human language from various points of view is important within the context of other sciences, and how corpus-based analysis contributes to this. 3.2 Practical motivation: How corpus-based work provides transferrable skills that will be useful in building a career. 3.3 Methodology: A clear framework for undertaking corpus based linguistic research.

  7. Part 1. Using corpora for teaching: implementation The DECTE project is implementing its principles in the following ways: The academic research-oriented NECTE corpus, on which DECTE is largely based, is being augmented with additional material to make it more suitable for teaching. The formatting of DECTE, TEI-conformant XML, makes it usable by the new generation of XML-aware interpretative software tools. A structured methodology for teaching based on DECTE and similar corpora is being developed. This methodology is based on active engagement with the corpus using software tools, and emphasizes the transferrable skills element of this engagement. (1) and (2) are in progress. The remainder of this workshop describes (3).

  8. Part 1. Using corpora for teaching: methodology At the root of our methodology is the conviction, based on many years in university teaching, that students achievebest results when given a clearly defined research project with a clearly delineated procedure for undertaking and reporting on it. For DECTE, such clarity is provided by regarding the study of language as a science, and therefore adopting scientific methodology. This involves: 1. Providing a coherent framework for research-based project work. This is given by the currently dominant Popperianfalsificationist methodology. 2. Formulating a research question. 3. Framing a hypothesis which answers the research question. 4. Testing the hypothesis 5. Writing a research report using the standard scientific format. In what follows, the DECTE approach is exemplified by looking in detail at step (3).

  9. Part 1. Using corpora for teaching: hypothesis formulation The aim of science is to understand reality. An academic discipline, philosophy of science, is devoted to explicating the nature of science and its relationship to reality, and, perhaps predictably, both are controversial. In practice, however, most scientists explicitly or implicitly assume a view of scientific methodology based on the philosophy of Karl Popper in which one or more non-contradictory hypotheses about some domain of interest are stated, the validity of the hypotheses is tested by observation of the domain, and the hypotheses are either confirmed (but not proven) if they are compatible with observation, or rejected if they are not.

  10. Part 1. Using corpora for teaching: hypothesis formulation Linguistics is a science, and as such uses or should use scientific methodology. The research domain is human language, and, in the process of hypothesis generation, the data comes from observation of language use. Such observation can be based on introspection, since every native speaker is an expert in the usage of his or her language. It can also be based on observation of the linguistic usage of others in either spoken or written form. In some subdisciplines like historical linguistics, sociolinguistics, and dialectology, the latter is in fact the only possible alternative.

  11. Part 1. Using corpora for teaching: hypothesis formulation Traditionally, hypothesis generation based on linguistic corpora has involved the researcher listening to or reading through a corpus, often repeatedly, noting features of interest, and then formulating a hypothesis. The advent of information technology in general and of digital representation of text in particular in the past few decades has made this often-onerous process much easier via a range of computational tools. But, as the amount of digitally-represented language available to linguists has grown, a new problem has emerged: data overload. Actual and potential language corpora are growing ever-larger, and even now they can be on the limit of what the individual researcher can work through efficiently in the traditional way. Moreover, as we shall see, data abstracted from such large corpora can be impenetrable to understanding.

  12. Part 1. Using corpora for teaching: hypothesis formulation One approach to the problem is to deal only with corpora of tractable size, or, equivalently, with tractable subsets of large corpora, but ignoring potential data in so unprincipled a way is not scientifically respectable. The alternative is to use mathematically-based computational tools for data exploration developed in the physical and social sciences, where data overload has long been a problem. This latter alternative is the one presented here. Specifically, the discussion shows how a particular type of computational tool, cluster analysis, can be used in the formulation of hypotheses in corpus-based linguistic research. The discussion of cluster analysis is in threesections. The first describes data abstraction from corpora The second outlines the principles of cluster analysis and how it can be used in the formulation of hypotheses. The third gives a brief account of how cluster analysis works.

  13. Section1. Data abstraction Data are ontologically different from the world. The world is as it is. Data are an interpretation of it for the purpose of scientific study. The weather is not the meteorologist’s data –measurements of such things as air temperature are. A text corpus is not the linguist’s data –measurements of such things as average sentence length are. Data are constructed from observation of things in the world, and the process of construction raises a range of issues that determine the amenability of the data to analysis and the interpretability of the analytical results. The importance of understanding such data issues in cluster analysis can hardly be overstated. On the one hand, nothing can be discovered that is beyond the limits of the data itself. On the other, failure to understand relevant characteristics of data can lead to results and interpretations that are distorted or even worthless.

  14. Section1. Data abstraction: the research question Data can only be created in relation to a research question that is defined on the domain of interest, and that thereby provides an interpretative orientation. Without such an orientation, how does one know what to observe, what is important, and what is not? The domain of interest in the present case is the 63 speakers in DECTE for which phonetic transcriptions exit. The research question is: Is there systematic phonetic variation in the Tyneside speech community as represented by DECTE, and , if so, does that variation correlate systematically with social variables?

  15. Section1. Data abstraction: variable selection Given that data are an interpretation of some domain of interest, what does such an interpretation look like? It is a description of entities in the domain in terms of variables. A variable is a symbol, and as such is a physical entity with a conventional semantics, where a conventional semantics is understood as one in which the designation of a physical thing as a symbol together with  the connection between the symbol and what it represents are determined by agreement within a community. The symbol ‘A’, for example, represents the phoneme /a/ by common assent, not because there is any necessary connection between it and what it represents. Since each variable has a conventional semantics, the set of variables chosen to describe entities constitutes the template in terms of which the domain is interpreted. Selection of appropriate variables is, therefore, crucial to the success of any data analysis.

  16. Section1. Data abstraction: variable selection Which variables are appropriate in any given case? That depends on the nature of the research question. The fundamental principle in variable selection is that the variables must describe all and only those aspects of the domain that are relevant to the research question. In general, this is an unattainable ideal. Any domain can be described by an essentially arbitrary number of finite sets of variables; selection of one particular set can only be done on the basis of personal knowledge of the domain and of the body of scientific theory associated with it, tempered by personal discretion. In other words, there is no algorithm for choosing an optimally relevant set of variables for a research question.

  17. Section1. Data abstraction: variable selection Which variables are suitable to describe the DECTE speakers? Since the research question is about the phonetic level, the first step is to partition each speaker's analog speech signal into a sequence of discrete phonetic segments and to represent those segments symbolically, or, in other words, to transcribe the audio interviews. To do this, one has to decide which features of the audio signal are of interest, and then to define a set of variables to represent those features. A transcription scheme was devised, a short example of which is shown on the next slide.

  18. Section1. Data abstraction: variable selection Two levels of transcription were produced, a highly detailed narrow one designated 'States' in the above figure, and a superordinate ‘Putative Diasystemic Variables’ (PDV) level which collapsed some of the finer distinctions transcribed at the ‘States’ level. We shall be dealing with the less detailed PDV level.

  19. Section1. Data abstraction: variable value assignment The semantics of each variable determines a particular interpretation of the domain of interest, and the domain is 'measured' in terms of the semantics: height in metres = 1.71, weight in kilograms = 70, and so on. Measurement is fundamental in the creation of data because it makes the link between data and the world, and thus allows the results of data analysis to be applied to the understanding of the world. Cluster analysis methods assume numeric measurement as the default case, and for that reason the same is assumed in what follows. Specifically, we shall be interested in the number of times each speaker uses each of the DECTE phonetic variables. The speakers are therefore 'measured' in terms of the frequency with which they use these segments

  20. Section1. Data abstraction: representation If they are to be analyzed using mathematically-based computational methods, the descriptions of the entities in the domain of interest in terms of the selected variables must be mathematically represented. A widely used way of doing this, and the one adopted here, is to use structures from a branch of mathematics known as linear algebra. Vectors are fundamental in data representation. A vector is just a sequence of numbered slots containing numerical values. The figure below shows a four-element vector each element of which contains a real-valued number: 1.6 is the value of the first element v1, 2.4 the value of the second element v2, and so on.

  21. Section1. Data abstraction: representation A single DECTE speaker's frequency of usage of the 158 phonetic segments in the transcription scheme can be represented by a 158-element vector in which each element is associated with a different segment, as in the figure below. This speaker uses the segment at Speaker1 twenty three times, the segment at Speaker2 four times, and so on. Such a vector completely describes the speaker’s phonetic usage in terms of the DECTE transcription scheme, and will be referred to as a ‘speaker profile’.

  22. Section1. Data abstraction: representation The 63 speaker profilescan be assembled into a matrix M, shown below, in which the 63 rows represent the speakers, the 158 columns represent the phonetic segments, and the value at Mijis the number of times speaker i uses segment j (for i = 1..63 and j = 1..158):

  23. Section2. Cluster analysis Once the data matrix has been created, a variety of computational methods can be used to classify its row vectors, and thereby the objects in the domain that the row vectors represent. In the present case, those objects are the DECTE speakers.

  24. Section2. Clusteranalysis: motivation Why cluster analysis? Observation of nature plays a fundamental role in science, as noted earlier. But nature is dauntingly complex, and there is no practical or indeed theoretical hope of describing any aspect of it objectively and exhaustively. The researcher is therefore selective in what s/he observes: a set of variables descriptive of the domain of interest is defined and a series of observations is conducted in which, at each observation, the values of each variable are recorded. A body of data is therefore built up on the basis of which a hypothesis can be generated.

  25. Section 2. Cluster analysis: motivation Let’s say, for example, the DECTE speakers are described by a single variable, the phonetic segment  Ə1, and the values in the variable column are the frequencies with which each of the 24 speakers use that segment. It is easy to see by direct inspection that the speakers fall into two groups: those that use Ə1 relatively frequently and those that use it relatively infrequently. Based on this result, the obvious hypothesis is that there is systematic variation in phonetic usage with respect to Ə1 in the speaker community from which the NECTE speakers were selected.

  26. Section 2. Cluster analysis: motivation If two phonetic variables are used to describe the speakers, direct inspection again shows two groups, those that use both Ə1 and Ə2 relatively frequently and those that do not, and the hypothesis is analogous to the one just stated.

  27. Section 2. Clusteranalysis: motivation There is no theoretical limit on the number of variables that can be used to describe the objects in a domain. As the number of variables and observations grows, so does the difficulty of generating hypotheses from direct inspection of the data. In the DECTE case, the selection of Ə1 and Ə2 in figures 1 and 2 was arbitrary, and the speakers could have been described using more phonetic segment variables. The figure below shows twelve.

  28. Section 2. Cluster analysis: motivation What hypothesis would one formulate from inspection of the data in the figure, taking into account all the variables? There are, moreover, 63 phonetic transcriptions in the DECTE corpus and the transcription scheme contains 158 phonetic segments, so it is possible to describe the phonetic usage of each or 63 speakers in terms of 158 variables.

  29. Section 2. Cluster analysis: motivation These questions are clearly rhetorical, and there is a straightforward moral: human cognitive makeup is unsuited to seeing regularities in anything but the smallest collections of numerical data. To see the regularities we need help, and that is what cluster analysis provides.

  30. Section 2. Cluster analysis Cluster analysis is a family of computational methods for identification and graphical display of structure in data when the data is too large either in terms of the number of variables or of the number of objects described, or both, for it to be readily interpretable by direct inspection. All the members of the family work by partitioning a set of objects in the domain of interest into disjoint subsets in accordance with how relatively similar those objects are in terms of the variables that describe them. The objects of interest in the DECTE data are speakers, and each speaker's phonetic usage is described by profile vectors. Any two speakers' phonetic usage will be more or less similar depending on how similar their respective variable values are: if the values are identical then so are the speakers in terms of their phonetic usage, and the greater the divergence in values the greater the differences in usage.

  31. Section 2. Cluster analysis Cluster analysis of the DECTE data in the preceding data groups 24 of the 63 speakers in the corpus in terms of how similar their frequency of usage of 12 of the 158 phonetic segments is. There are various kinds of cluster analysis; the figure below shows the results from application of two of them to the DECTE data.

  32. Section 2. Cluster analysis Figure (a) shows the similarity relations among speakers in the DECTE data as a tree. The longer the horizontal lines joining two clusters, the more different the clusters are. There are two main groups of speakers, labelled A and B, which differ greatly from one another in terms of phonetic usage, and, though there are differences in usage among the speakers in those two main groups, the differences are minor relative to that between A and B.

  33. Section 2. Cluster analysis Figure (b) shows the cluster structure of the DECTE data as a scatter plot in which relative spatial distance between speaker labels represents the relative similarity of phonetic usage among the speakers. Labels corresponding to the main clusters in figure (a) have been added for ease of cross-reference, and show that this analysis gives essentially the same result as the hierarchical one.

  34. Section 2. Cluster analysis Once the structure of the data has been identified by cluster analysis it can be used for hypothesis generation. Based on the analyses in figures (a) and (b), the hypothesis is that, with respect to the selected phonetic variables, the speakers in the community from which DECTE was drawn fall into two distinct groups; more is said about this later. Cluster analysis can be applied in any research where the data consists of objects described by variables; since most research uses data of this kind, it is very widely applicable. It can usefully be applied where the number of objects and variables is so large that the data cannot easily be interpreted by direct inspection, and the range of applications where this is the case spans most areas of science, engineering, commerce, and, of course, linguistics.

  35. Section 3. How cluster analysis works There is a fundamental relationship between the vectors and matrices which are standardly used to represent data, as described, and geometry. As we have seen, a vector is a sequence of n numbers, and the sequence is conventionally represented as comma-separated numerals between square brackets. the dimensionality of the vector, that is, the number of its components n, defines an n-dimensional vector space. the sequence of n numbers comprising the vector specifies the coordinates of the vector in the vector space. the vector itself is a point at the specified coordinates

  36. Section 3. How cluster analysis works For example, the components of the 2-dimensional vector v = [36 160] in (a) are its coordinates in a 2-dimensional vector space with axes 0..100 and 0..200, counting 36 along the horizontal axis and 160 along the vertical. The components of the 3-dimensional vector v = [36, 160, 30] in (b) are its coordinates in a 3-dimensional vector space with axes 0..100, 0..300, 1..00, counting 36 along the horizontal axis, 160 along the vertical, and 70 along third axis which as shown as a diagonal for perspective.

  37. Section 3. How cluster analysis works More than one vector can exist in a given vector space. Where there is more than one vector in a data set, as is usual, they are standardly collected so as to constitute a matrix in which each row is a vector, as we have seen. Given the 3 x 2 matrix opposite top, the three 2-dimensional row vectors in 2-dimensional vector space look like the figure opposite bottom.

  38. Section 3. How cluster analysis works Consider now the plot of 100 three-dimensional randomly-generated vectors opposite top. Contrast the plot of a known, nonrandom 3-dimensional data set opposite bottom. Visual inspection makes it immediately apparent that the distribution of points in the lower plot is nonrandom: there are three clearly defined groups of vectors such that intra-group distance is small relative to the dimensions of the data space, and inter-group distance relatively large. Cluster analysis is a collection of methods whose aim is to detect such groups in data and to display them graphically in an intuitively accessible way.

  39. Section 3. How cluster analysis works The figure opposite shows the interrelationship of a data matrix, geometrical interpretation of the matrix, and cluster analysis of it.

  40. Conclusion This workshop was about using DECTE and linguistic corpora more generally as a teaching resource in higher education. It was to be in two main parts: 1. Theory, which covered the nature of linguistic corpora generally and DECTE in particular, the principles on which the DECTE approach to using corpora for teaching is based, and application of these principles as exemplified by hypothesis generation using cluster analysis. 2. Practice. This is where we now are; the remainder of the workshop will be a practical session on how to cluster analyze data abstracted from DECTE.

  41. Part 2: Practice This practical session will cluster analyze a data matrix M abstracted from the DECTE phonetic transcriptions. A sample transcription as it appears in DECTE is given below.

  42. Part 2: Practice An analysis of the full DECTE corpus would require a 63 x 158 matrix, but that would generate large cluster trees whichare difficult to display. We’ll therefore use a 12 x 158 matrix for tractability.

  43. Part 2: Practice We shall be using SPSS, a large general purpose statistics package, to do the cluster analysis. The procedure for starting SPSS differs from university to university, so we’ll drop out of this Powerpoint presentation for a moment and go through the Southampton procedure

  44. Part 2: Practice Something like the following will eventually appear.

  45. Part 2: Practice Click 'File' > 'Open'

  46. Part 2: Practice Click 'Data', select 'Files of type' to be 'All files', go to the directory where the matrix is kept, and select M.txt.

  47. Part 2: Practice There now comes a series of popup windows. Click ‘Next’.

  48. Part 2: Practice Click ‘Next’.

  49. Part 2: Practice Click ‘Next’.

  50. Part 2: Practice Click ‘Next’.

More Related