Data Mining Chapter 1

Data MiningChapter 1 Kirk Scott

Iris virginica

Iris versicolor

Iris setosa

1.1 Data Mining and Machine Learning

Definition of Data Mining • The process of discovering patterns in data. • (The patterns discovered must be meaningful in that they lead to some advantage, usually an economic one.) • …Useful patterns allow us to make predictions on new data.

Automatic or Semi-Automatic Pattern Discovery • Pattern discovery by, or with the help of computers is of interest • Hence machine “learning” • The “learning” part comes from the algorithms used • Stay tuned for a brief discussion of whether this is learning

Expression of Patterns • 1. Black box • 2. Transparent box • A transparent box expression reveals the structure of the pattern • The structure can be examined, reasoned about, and used to inform future decisions

Black Box vs. Transparent Box • Black vs. transparent box is not a trivial distinction • Some modern computational approaches are black box in nature • They seek “answers” without necessarily revealing the structure of the problem at hand

Genetic Algorithms • The book in particular says that genetic algorithms are beyond the realm of consideration • They are explicitly designed for optimization, not the revelation of structure • They exemplify black box thinking • They give a single (sub) optimal answer without providing additional information about the problem being optimized

What is the book about? • Techniques for finding and describing structural patterns in data • Description of the patterns is an inherent part of data mining as it will be considered • Techniques that lead to black box predictors will not be considered

Describing Structural Patterns • First example: • Contact lens table, page 6 in textbook • The first 5 columns are effectively 5 factors under consideration • The 6th column is the result • Inspection reveals that this is essentially a comprehensive listing of all possible combinations

The information in the table could be stored syntactically in the form of a set of rules • For example: • If tear production rate = reduced then recommendation = none • Otherwise, if age = young and astigmatic = no then recommendation = soft

The same information could be encapsulated graphically in a decision tree • The representation is up to the person working with the scenario • The point is that the ability to represent signifies that the structure has been revealed

Input and Output • The 5 factors represent input • The 6th column represents output, or “prediction” • Overall, this distinction is typically present in a data mining problem

Completeness • This example is simplistic because it is complete • In interesting, practical, applied problems, not all combinations may be present • Individual data values will be missing • The goal of data mining is to be able to generalize from the given data so that correct predictions can be made for other cases where the results aren’t specified

Machine Learning • High level learning in humans seems to presuppose self-conscious adaptability • Whether or not machines are capable of learning in the human sense is open to question • No current machine/software combination demonstrates behavior exactly analogous to the behavior seen in biological systems

My personal take on this is that the phrase “machine learning” is an unfortunate example of inflated naming • The phrase “data mining” is neutral • To ask whether a machine can mine is not as fraught with difficulty as asking whether it can learn

IEEE publishes a research journal with the title “Transactions on Knowledge and Data Discovery” • Similarly, to ask whether a machine can discover is not as fraught with difficulty as asking whether it can learn

“Machine learning” isn’t as inflated as “Artificial intelligence” • It is a step back from that level of hype • In the field of artificial intelligence researchers have scaled back their expectations of what they might accomplish • They have not been able to mimic general human intelligence

If you look at data mining from an AI point of view, you might see it as a step along the road to mimicking learning and term it machine learning • If you come from a database management systems background you might say the point of view is results oriented rather than process oriented

The data are not animate • The machines are not animate • The algorithms are not animate

The human mind cannot readily discern some patterns, whether due to the quantity of the data or the subtlety and complexity of the patterns • Human programmers devise algorithms which are able to discern such patterns • This is not so different from devising an algorithm to solve any problem which a computer might solve more easily than a human being

Naming Problems in Math • Consider these terms from math: • Imaginary numbers • Complex numbers • Chaos theory • I’ve always marveled at how harmful tendentious naming can be to achieving understanding of what’s going on • You might as well talk about magic spells

A Biological Perspective • Although I don’t worship at the church of St. Charles of Darwin, the biologists make an interesting point: • When reasoning about animals, it’s a mistake to anthropomorphize • In others, don’t ascribe human characteristics to non-human organisms

The shortcoming of this point of view is that biologists tend towards the rigidity of theologists: • Animals are not like us; they are no more than machines • A dog cannot feel anything approaching human emotion, etc., etc.

Whatever the truth of the emotional state of dogs, isn’t this a valid question: • Why do certain computer scientists persistent in naming technologies in such a way that they seem to be anthropomorphizing machines? • The day may come when there is truth to this, but isn’t it a wee bit premature?

Ideas on this question?

Practically Speaking, What Does Data Mining Consist of? • Algorithms that examine data sets and extract knowledge • This time the authors have chosen the relatively neutral words examine and extract • The word knowledge is admittedly a bit tendentious itself

In more detail, the idea is simply this: • The computer is programmed to implement an algorithm • The algorithm is designed to run through or process all of a data set • The goal of the algorithm is to summarize the data set

The summary is a generalization • For our purposes, the summary consists of inferences that can be drawn about the data set • The inferences may be between values for the attributes of a given data point • They may also be about the relationships among various data points in the set

The contact lens data set illustrated the idea that an inference is a rule • For our purposes, such a rule takes the form of “if x, then y” • It is a comprehensive collection of such rules that summarizes the structure of the data points in the set

The Value of the Summary • A set of rules is definitely useful for predictions • Once again the distinction is made between black and transparent box techniques • The structural description is also of great value because it helps the human understand the data and what it means

In this sense, at least, a degree of learning is evident in the whole complex • If the human consumer of the end product ultimately learns something previously unknown about the data set— • Then the machine and algorithm must have genuinely learned something about the data that was previously unknown

It may be of value to contrast this with transparent box systems again • A transparent box system achieves results without a structural description • This is still a valid question: • Just because the end result is not human learning, does that in fact lessen in any way the degree of learning that the computer system achieves?

There are no fixed answers to these side questions • This is a 400 level course • There’s no time like now to at least think about such things, before you step out the door with your sparkling new sheepskin • Now, back to the grindstone

1.2 Simple Examples: The Weather and Other Problems

The book introduces some comparatively simple (and unrealistic) data sets by way of further illustration • Note that complex, real data sets tend to be proprietary anyway • Databases and data sets are among the most valuable assets of organizations that have them • This kind of stuff isn’t given away

The Weather Problem • This is a fictitious data set • Four symbolic categories (non-numeric attributes) determine whether a game should be played • There are 36 possible combinations • A table exists with only 14 of the combinations • See the following overhead

An (Incomplete) Table of Data Point Values

Decision Lists • More ideas about representing structure with rules • Sets of rules can be arranged in order • You work your way through the list • You accept the outcome of a rule if it applies • Otherwise you continue down the list • This is called a decision list

The following rules represent the idea • No claim is made that this set of rules is complete or necessarily very good

In an ordered list of rules: • Each rule is potentially (most likely) fragmentary • It doesn’t include every factor • Because they are ordered, the rules depend on each other • Individual rules, applied in isolation, do not necessarily (most likely don’t) give the correct result

Handling Numeric Data • If all of the determining attributes are numeric, you can refer to a numeric attribute problem • If some are numeric and some are categorical, you can refer to a mixed attribute problem • Consider the table of the weather problem with some numerical attributes on the following overhead

If numeric attributes are present, the decision rules typically become inequalities rather than equalities • For example

Classification vs. Association • The premise of the foregoing discussion: • A set of independent variables determines the value of a dependent variable • This is reminiscent of multivariate statistical regression • It is classification • If relationships can be found among the supposedly independent variables, this is association

These are examples of association rules taken from the original weather data set • It is not hard to imagine that outlook, temperature, humidity, and wind depend on each other in whole or in part

The data set is fictitious • The data set is not exhaustive • Unlike the contact lens example, it is not a list of all possible combinations • It is presumably based on some set of real observations

Data Mining Chapter 1