SEL3053: Analyzing Geordie Lecture 7. Data creation

SEL3053: Analyzing Geordie Lecture 7. Data creation This lecture introduces the concept of data, how it is created, and how it can be represented for computational processing. It then goes on to describe creation of the data matrix on which the subsequent analysis of DECTE will be based.

SEL3053: Analyzing Geordie Lecture 7. Data creation 1. The nature of data 'Data' is the plural of 'datum', the past participle of Latin 'dare', 'to give', and means 'things that are given'. A datum is therefore something to be accepted at face value, a true statement about the world. What is a true statement about the world? That question has been debated in philosophical metaphysics since Antiquity and probably before , and, in our own time, has been intensively studied by the disciplines that comprise cognitive science. The issues are complex, controversy abounds, and the associated academic literatures are vast --saying what a true statement about the world might be is anything but straightforward. We can't go into all this, and so will adopt the attitude prevalent in most areas of science: data are abstractions of what we observe using our senses, often with the aid of instruments.

SEL3053: Analyzing Geordie Lecture 7. Data creation The nature of data Data are ontologically different from the world. The world is as it is; data are an interpretation of it for the purpose of scientific study. The weather is not the meteorologist’s data –measurements of such things as air temperature are. A text corpus is not the linguist’s data –measurements of such things as lexical frequency are. Data are constructed from observation of things in the world, and the process of construction raises a range of issues that determine the amenability of the data to analysis and the interpretability of the analytical results.

SEL3053: Analyzing Geordie Lecture 7. Data creation The nature of data The importance to cluster analysis of understanding such data issues can hardly be overstated. On the one hand, nothing can be discovered that is beyond the limits of the data itself. On the other, failure to understand and where necessary to emend relevant characteristics of data can lead to results and interpretations that are distorted or even worthless. For these reasons, a detailed account of data issues is given before moving on to analysis of DECTE.

SEL3053: Analyzing Geordie Lecture 7. Data creation 2. Research question In general, any aspect of the world can be described in an arbitrary numbers of ways and to arbitrary degrees of precision. The implications of this go straight to the heart of the debate on the nature of science and scientific theories referred to in an earlier lecture. To avoid being drawn into that debate this discussion adopts the position that is pretty much standard in scientific practice: the Popperian view that there is no theory-free observation of the world. In essence, this means that there is no such thing as objective observation in science --entities in a domain of inquiry only become relevant to observation in terms of a research question framed using the ontology and axioms of a theory about the domain.

SEL3053: Analyzing Geordie Lecture 7. Data creation 2. Research question For example, in linguistic analysis variables are selected in terms of the discipline of linguistics broadly defined, which includes the division into subdisciplines such as sociolinguistics and dialectology, the subcategorization within subdisciplines such as phonetics through syntax to semantics and pragmatics in formal grammar, and theoretical entities within each subcategory such as constituency structures and movement in syntax. Claims, occasionally seen, that the variables used to describe a corpus are 'theoretically neutral' are naive: even word categories like 'noun' and 'verb' are interpretative constructs that imply a certain view of how language works, and they only appear to be theory-neutral because of familiarity with long-established tradition.

SEL3053: Analyzing Geordie Lecture 7. Data creation 2. Research question Data can, therefore, only be created in relation to a research question that is defined on the domain of interest, and that thereby provides an interpretative orientation --without such an orientation, how does one know what to observe, what is important, and what is not? Take, for example, a domain used by many millions of people each day: the websites on the WWW. To abstract data from it, one first has to know what's required: How many web documents are there globally? How are websites distributed geographically? What types of website are accessed most frequently? And so on. Only once a question of this sort is asked can data be created.

SEL3053: Analyzing Geordie Lecture 7. Data creation 2. Research question For the remainder of this lecture, a research question which search engines like Google address every day will be assumed: Can the many millions of websites that currently exist on the Web be classified in accordance with their conceptual content, so that only those relevant to any given user query are returned?

SEL3053: Analyzing Geordie Lecture 7. Data creation 3. Variable selection Given that data are an interpretation of some domain of interest, what does such an interpretation look like? It is a description of entities in the domain in terms of variables. A variable is a symbol, and as such is a physical entity with a conventional semantics, where a conventional semantics is understood as one in which the designation of a physical thing as a symbol together with the connection between the symbol and what it represents are determined by agreement within a community. The symbol ‘A’, for example, represents the phoneme /a/ by common assent, not because there is any necessary connection between it and what it represents.

SEL3053: Analyzing Geordie Lecture 7. Data creation 3. Variable selection Since each variable has a conventional semantics, the set of variables chosen to describe entities in a domain constitutes the template in terms of which the domain is interpreted. Selection of appropriate variables is, therefore, crucial to the success of any data analysis. Which variables are appropriate in any given case? That depends on the nature of the research question. The fundamental principle in variable selection is that the variables must describe all and only those aspects of the domain that are relevant to the research question.

SEL3053: Analyzing Geordie Lecture 7. Data creation 3. Variable selection In general, this is an unattainable ideal. Any domain can be described by an essentially arbitrary number of finite sets of variables; selection of one particular set can only be done on the basis of personal knowledge of the domain and of the body of scientific theory associated with it, tempered by personal discretion. In other words, there is no algorithm for choosing an optimally relevant set of variables for a research question. Relative to the WWW domain and the above research question, what variables would be appropriate?

SEL3053: Analyzing Geordie Lecture 7. Data creation 3. Variable selection Classification of natural language documents on the basis of semantic content has been and remains a central issue in the Information retrieval research community. A foundational principle is that documents cannot only be classified on the basis of their semantics, but on the basis of their lexical semantics more specifically: documents containing words like 'field', 'crops', and 'yield' constitute a class because their lexical semantics indicate that they are about the same kind of thing --here farming-- and this class is distinct from another containing, say, 'computer', 'keyboard', and 'mouse'. That document semantics is determined solely by lexical content is, of course, theoretically indefensible. In the currently-dominant paradigm for the modelling of natural language, generative linguistics, the semantics of any linguistic unit more complex than a morpheme is a function of the constituent structure of its syntax.

SEL3053: Analyzing Geordie Lecture 7. Data creation 3. Variable selection Sentence semantics, for example, is determined by Frege's principle of compositionality --the meaning of a sentence is a function both of the meanings of its constituent words and of their 'manner of combination', that is, of the sentence's constituent structure, so that 'dog bites man' differs in meaning from 'man bites dog' even though the words remain the same. And, because documents are multiple-sentence collections, this principle extends to them as well, though clearly document semantics is not an additive function of its constituent sentence semantics. Mainstream Information Retrieval ignores this syntactic component of document semantics on the empirical grounds that document semantics based solely on lexical content suffices for efficient classification in practical applications. This study does the same.

SEL3053: Analyzing Geordie Lecture 7. Data creation 3. Variable selection The variables selected in relation to the research question in this case are, therefore, the words that the documents included in the domain contain. Our domain is the collection of Web documents worldwide, which implies that the number of variables will be huge --even restricting attention to English-language sites, the number of words and therefore variables will be in the many tens of thousands.

SEL3053: Analyzing Geordie Lecture 7. Data creation 4. Variable value assignment The semantics of each selected variable determines a particular interpretation of the domain of interest, and the domain is measured, in a broad sense, in terms of the semantics. That measurement constitutes the values of the variables: height in metres = 1.71, weight in kilograms = 70, and so on. Measurement is fundamental in the creation of data because it makes the link between data and the world, and thus allows the results of data analysis to be applied to the understanding of the world.

SEL3053: Analyzing Geordie Lecture 7. Data creation 4. Variable value assignment Measurement is only possible in terms of some scale. There are various types of measurement scale, and these are discussed in the relevant textbooks, but for present purposes the main dichotomy is between numeric and non-numeric, or quantitative and qualitative. The cluster analysis methods discussed in due course assume numeric measurement as the default case, and for that reason the same is done here.

SEL3053: Analyzing Geordie Lecture 7. Data creation 4. Variable value assignment The variables in our Web example are the words that websites contain. What kind of value should be attached to these variables? In other words, how should they be 'measured'? The standard approach in Information Retrieval is simply to count the number of times any given word occurs in each of the websites. The idea here is that if an author uses a word repeatedly in a document then that document is more likely to be about what the word denotes than it is to be about the denotation of an infrequently occurring word, and that documents can be distinguished from one another on the basis of such 'aboutness'.

SEL3053: Analyzing Geordie Lecture 7. Data creation 5. Data representation Having decided on a set of variables and on how the domain they are intended to describe should be measured, the next step is to represent the data in a format that can be computationally analyzed. There is a de facto standard way of doing this, but to explain it two mathematical ideas need to be introduced.

SEL3053: Analyzing GeordieLecture 7. Data creation 5. Data representation Having decided on a set of variables and on how the domain they are intended to describe should be measured, the next step is to represent the data in a format that can be computationally analyzed. There is a de facto standard way of doing this, but to explain it two mathematical ideas need to be introduced.

SEL3053: Analyzing GeordieLecture 7. Data creation 5. Data representation 5.1 Vector A vector is a sequence of n numbers each of which is indexed by its position in the sequence. The figure immediately below shows n = 6 real-valued numbers, where the first number v1 is 2.1, the second v2 is 5.1, and so on.

SEL3053: Analyzing GeordieLecture 7. Data creation 5. Data representation 5.1 Vector Vectors are a standard data structure in computer science and are extensively used in numerical computation, and so are a good way of representing data for computational analysis. How? Returning to the website example, let's say we have m = 580,000,000 documents, and n = 78,532 variables; the numbers are for exemplification only, and were plucked out of thin air. Then each document can be represented by a frequency profile vector:

SEL3053: Analyzing GeordieLecture 7. Data creation 5. Data representation 5.2 Matrix Let's say such a frequency vector has been created for each of the web documents. This yields 580,000,000 vectors. How can one keep track of so many? The answer is another standard data structure in computer science, the matrix. A matrix is, in essence, just a list of vectors. For example:

SEL3053: Analyzing GeordieLecture 7. Data creation 5. Data representation 5.2 Matrix The individual document frequency vectors are simply listed from 1 to however many documents there are. For convenience of reference, matrices are given names. If a matrix is called, say, M, then Mi is the i'th row; M3 is the third row below. Mi,j is the value of the j'th variable in row i; M3,1278 is the number of times the j'th word computer occurs in Document 3.

SEL3053: Analyzing Geordie Lecture 7. Data creation