1 / 116

Data Mining Chapter 1

Data Mining Chapter 1. Kirk Scott. Iris virginica. Iris versicolor. Iris setosa. 1.1 Data Mining and Machine Learning. Definition of Data Mining. The process of discovering patterns in data.

rhea
Download Presentation

Data Mining Chapter 1

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data MiningChapter 1 Kirk Scott

  2. Iris virginica

  3. Iris versicolor

  4. Iris setosa

  5. 1.1 Data Mining and Machine Learning

  6. Definition of Data Mining • The process of discovering patterns in data. • (The patterns discovered must be meaningful in that they lead to some advantage, usually an economic one.) • …Useful patterns allow us to make predictions on new data.

  7. Automatic or Semi-Automatic Pattern Discovery • Pattern discovery by, or with the help of computers is of interest • Hence machine “learning” • The “learning” part comes from the algorithms used • Stay tuned for a brief discussion of whether this is learning

  8. Expression of Patterns • 1. Black box • 2. Transparent box • A transparent box expression reveals the structure of the pattern • The structure can be examined, reasoned about, and used to inform future decisions

  9. Black Box vs. Transparent Box • Black vs. transparent box is not a trivial distinction • Some modern computational approaches are black box in nature • They seek “answers” without necessarily revealing the structure of the problem at hand

  10. Genetic Algorithms • The book in particular says that genetic algorithms are beyond the realm of consideration • They are explicitly designed for optimization, not the revelation of structure • They exemplify black box thinking • They give a single (sub) optimal answer without providing additional information about the problem being optimized

  11. What is the book about? • Techniques for finding and describing structural patterns in data • Description of the patterns is an inherent part of data mining as it will be considered • Techniques that lead to black box predictors will not be considered

  12. Describing Structural Patterns • First example: • Contact lens table, page 6 in textbook • The first 5 columns are effectively 5 factors under consideration • The 6th column is the result • Inspection reveals that this is essentially a comprehensive listing of all possible combinations

  13. The information in the table could be stored syntactically in the form of a set of rules • For example: • If tear production rate = reduced then recommendation = none • Otherwise, if age = young and astigmatic = no then recommendation = soft

  14. The same information could be encapsulated graphically in a decision tree • The representation is up to the person working with the scenario • The point is that the ability to represent signifies that the structure has been revealed

  15. Input and Output • The 5 factors represent input • The 6th column represents output, or “prediction” • Overall, this distinction is typically present in a data mining problem

  16. Completeness • This example is simplistic because it is complete • In interesting, practical, applied problems, not all combinations may be present • Individual data values will be missing • The goal of data mining is to be able to generalize from the given data so that correct predictions can be made for other cases where the results aren’t specified

  17. Machine Learning • High level learning in humans seems to presuppose self-conscious adaptability • Whether or not machines are capable of learning in the human sense is open to question • No current machine/software combination demonstrates behavior exactly analogous to the behavior seen in biological systems

  18. My personal take on this is that the phrase “machine learning” is an unfortunate example of inflated naming • The phrase “data mining” is neutral • To ask whether a machine can mine is not as fraught with difficulty as asking whether it can learn

  19. IEEE publishes a research journal with the title “Transactions on Knowledge and Data Discovery” • Similarly, to ask whether a machine can discover is not as fraught with difficulty as asking whether it can learn

  20. “Machine learning” isn’t as inflated as “Artificial intelligence” • It is a step back from that level of hype • In the field of artificial intelligence researchers have scaled back their expectations of what they might accomplish • They have not been able to mimic general human intelligence

  21. If you look at data mining from an AI point of view, you might see it as a step along the road to mimicking learning and term it machine learning • If you come from a database management systems background you might say the point of view is results oriented rather than process oriented

  22. The data are not animate • The machines are not animate • The algorithms are not animate

  23. The human mind cannot readily discern some patterns, whether due to the quantity of the data or the subtlety and complexity of the patterns • Human programmers devise algorithms which are able to discern such patterns • This is not so different from devising an algorithm to solve any problem which a computer might solve more easily than a human being

  24. Naming Problems in Math • Consider these terms from math: • Imaginary numbers • Complex numbers • Chaos theory • I’ve always marveled at how harmful tendentious naming can be to achieving understanding of what’s going on • You might as well talk about magic spells

  25. A Biological Perspective • Although I don’t worship at the church of St. Charles of Darwin, the biologists make an interesting point: • When reasoning about animals, it’s a mistake to anthropomorphize • In others, don’t ascribe human characteristics to non-human organisms

  26. The shortcoming of this point of view is that biologists tend towards the rigidity of theologists: • Animals are not like us; they are no more than machines • A dog cannot feel anything approaching human emotion, etc., etc.

  27. Whatever the truth of the emotional state of dogs, isn’t this a valid question: • Why do certain computer scientists persistent in naming technologies in such a way that they seem to be anthropomorphizing machines? • The day may come when there is truth to this, but isn’t it a wee bit premature?

  28. Ideas on this question?

  29. Practically Speaking, What Does Data Mining Consist of? • Algorithms that examine data sets and extract knowledge • This time the authors have chosen the relatively neutral words examine and extract • The word knowledge is admittedly a bit tendentious itself

  30. In more detail, the idea is simply this: • The computer is programmed to implement an algorithm • The algorithm is designed to run through or process all of a data set • The goal of the algorithm is to summarize the data set

  31. The summary is a generalization • For our purposes, the summary consists of inferences that can be drawn about the data set • The inferences may be between values for the attributes of a given data point • They may also be about the relationships among various data points in the set

  32. The contact lens data set illustrated the idea that an inference is a rule • For our purposes, such a rule takes the form of “if x, then y” • It is a comprehensive collection of such rules that summarizes the structure of the data points in the set

  33. The Value of the Summary • A set of rules is definitely useful for predictions • Once again the distinction is made between black and transparent box techniques • The structural description is also of great value because it helps the human understand the data and what it means

  34. In this sense, at least, a degree of learning is evident in the whole complex • If the human consumer of the end product ultimately learns something previously unknown about the data set— • Then the machine and algorithm must have genuinely learned something about the data that was previously unknown

  35. It may be of value to contrast this with transparent box systems again • A transparent box system achieves results without a structural description • This is still a valid question: • Just because the end result is not human learning, does that in fact lessen in any way the degree of learning that the computer system achieves?

  36. There are no fixed answers to these side questions • This is a 400 level course • There’s no time like now to at least think about such things, before you step out the door with your sparkling new sheepskin • Now, back to the grindstone

  37. 1.2 Simple Examples: The Weather and Other Problems

  38. The book introduces some comparatively simple (and unrealistic) data sets by way of further illustration • Note that complex, real data sets tend to be proprietary anyway • Databases and data sets are among the most valuable assets of organizations that have them • This kind of stuff isn’t given away

  39. The Weather Problem • This is a fictitious data set • Four symbolic categories (non-numeric attributes) determine whether a game should be played • There are 36 possible combinations • A table exists with only 14 of the combinations • See the following overhead

  40. An (Incomplete) Table of Data Point Values

  41. Decision Lists • More ideas about representing structure with rules • Sets of rules can be arranged in order • You work your way through the list • You accept the outcome of a rule if it applies • Otherwise you continue down the list • This is called a decision list

  42. The following rules represent the idea • No claim is made that this set of rules is complete or necessarily very good

  43. In an ordered list of rules: • Each rule is potentially (most likely) fragmentary • It doesn’t include every factor • Because they are ordered, the rules depend on each other • Individual rules, applied in isolation, do not necessarily (most likely don’t) give the correct result

  44. Handling Numeric Data • If all of the determining attributes are numeric, you can refer to a numeric attribute problem • If some are numeric and some are categorical, you can refer to a mixed attribute problem • Consider the table of the weather problem with some numerical attributes on the following overhead

  45. If numeric attributes are present, the decision rules typically become inequalities rather than equalities • For example

  46. Classification vs. Association • The premise of the foregoing discussion: • A set of independent variables determines the value of a dependent variable • This is reminiscent of multivariate statistical regression • It is classification • If relationships can be found among the supposedly independent variables, this is association

  47. These are examples of association rules taken from the original weather data set • It is not hard to imagine that outlook, temperature, humidity, and wind depend on each other in whole or in part

  48. The data set is fictitious • The data set is not exhaustive • Unlike the contact lens example, it is not a list of all possible combinations • It is presumably based on some set of real observations

More Related