CSE 711: DATA MINING Sargur N. Srihari E-mail: srihari@cedar.buffalo Phone: 645-6164, ext. 113

CSE 711: DATA MINING Sargur N. Srihari E-mail: srihari@cedar.buffalo.edu Phone: 645-6164, ext. 113

CSE 711 Texts Required Text 1. Witten, I. H., and E. Frank, Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations, Morgan Kaufmann, 2000. Recommended Texts 1. Adriaans, P., and D. Zantinge, Data Mining, Addison- Wesley,1998.

CSE 711 Texts 2. Groth, R., Data Mining: A Hands-on Approach for Business Professionals, Prentice-Hall PTR,1997. 3. Kennedy, R., Y. Lee, et al., Solving Data Mining Problems through Pattern Recognition, Prentice-Hall PTR, 1998. 4. Weiss, S., and N. Indurkhya, Predictive Data Mining: A Practical Guide, Morgan Kaufmann, 1998.

Introduction • Challenge: How to manage ever-increasing amounts of information • Solution: Data Mining and Knowledge Discovery Databases (KDD)

Information as a Production Factor • Most international organizations produce more information in a week than many people could read in a lifetime

Data Mining Motivation • Mechanical production of data need for mechanical consumption of data • Large databases = vast amounts of information • Difficulty lies in accessing it

KDD and Data Mining • KDD: Extraction of knowledge from data • Official definition: “non-trivial extraction of implicit, previously unknown & potentially useful knowledge from data” • Data Mining: Discovery stage of the KDD process

Data Mining • Process of discovering patterns, automatically or semi-automatically, in large quantities of data • Patterns discovered must be useful: meaningful in that they lead to some advantage, usually economic

KDD and Data Mining Machine learning Export systems KDD Database Statistics Visualization Figure 1.1 Data mining is a multi-disciplinary field.

Data Mining vs. Query Tools • SQL: When you know exactly what you are looking for • Data Mining: When you only vaguely know what you are looking for

Practical Applications • KDD more complicated than initially thought • 80% preparing data • 20% mining data

Data Mining Techniques • Not so much a single technique • More the idea that there is more knowledge hidden in the data than shows itself on the surface

Data Mining Techniques • Any technique that helps to extract more out of data is useful • Query tools • Statistical techniques • Visualization • On-line analytical processing (OLAP) • Case-based learning (k-nearest neighbor)

Data Mining Techniques • Decision trees • Association rules • Neural networks • Genetic algorithms

Machine Learning and theMethodology of Science Analysis Observation Theory Prediction Empirical cycle of scientific research

Machine Learning... Analysis Limited number of observation Theory ‘All swans are white’ Reality: Infinite number of swans Theory formation

Machine Learning... Theory “All swans are white” Single observation Reality: Infinite number of swans Prediction Theory falsification

A Kangaroo in Mist a.) b.) c.) d.) e.) f.) Complexity of search spaces

Association Rules Definition: Given a set of transactions, where each transaction is a set of items, an association rule is an expression XY, where X and Y are sets of an item.

Association Rules Intuitive meaning of such a rule: transactions in the database which contain the items in Xtend also to contain the items in Y.

Association Rules Example: 98% of customers that purchase tires and automotive accessories also buy some automotive services. Here, 98% is called the confidence of the rule. The support of the rule X Y is the percentage of transactions that contain both X and Y.

Association Rules Problem: The problem of mining association rules is to find all rules which satisfy a user-specified minimum support and minimum confidence. Applications include cross-marketing, attached mailing, catalog design, loss leader analysis, add-on sales, store layout and customer segmentation based on buying patterns.

Example Data Sets • Contact Lens (symbolic) • Weather (symbolic data) • Weather ( numeric +symbolic) • Iris (numeric; outcome:symbolic) • CPU Perf.(numeric; outcome:numeric) • Labor Negotiations (missing values) • Soybean

Contact Lens Data

Structural Patterns • Part of structural description • Example is simplistic because all combinations of possible values are represented in table If tear production rate = reduced then recommendation = none Otherwise, if age = young and astigmatic = no then recommendation = soft

Structural Patterns • In most learning situations, the set of examples given as input is far from complete • Part of the job is to generalize to other, new examples

Weather Data

Weather Problem • This creates 36 possible combinations (3 X 3 X 2 X 2 = 36), of which 14 are present in the set of examples If outlook = sunny and humidity = high then play = no If outlook = rainy and windy = true then play = no If outlook = overcast then play = yes If humidity = normal then play = yes If none of the above then play = yes

Weather Data with Some Numeric Attributes

Classification and AssociationRules • Classification Rules: rules which predict the classification of the example in terms of whether to play or not If outlook = sunny and humidity = >83, then play = no

Classification and AssociationRules • Association Rules: rules which strongly associate different attribute values • Association rules which derive from weather table If temperature = cool then humidity = normal If humidity = normal and windy = false then play = yes If outlook = sunny and play = no then humidity = high If windy = false and play = no then outlook = sunny and humidity = high

Rules for Contact Lens Data If tear production rate = reduced then recommendation = none If age = young and astigmatic = no and tear production rate = normal then recommendation = soft If age = pre-presbyopic and astigmatic = no and tear production rate = normal then recommendation = soft If age = presbyopic and spectacle prescription = myope and astigmatic = no then recommendation = none If spectacle prescription = hypermetrope and astigmatic = no and tear production rate = normal then recommendation = soft If spectacle prescription = myope and astigmatic = yes and tear production rate = normal then recommendation = hard If age = young and astigmatic = yes and tear production rate = normal then recommendation = hard If age = pre-presbyopic and spectacle prescription = hypermetrope and astigmatic = yes then recommendation = none If age = presbyopic and spectacle prescription = hypermetrope and astigmatic = yes then recommendation = none

Decision Tree for Contact Lens Data tear production rate astigmatism none spectacle prescription soft hard none

Iris Data

Iris Rules Learned • If petal-length <2.45 then Iris-setosa • If sepal-width <2.10 then Iris-versicolor • If sepal-width < 2.45 and petal-length <4.55 then Iris-versicolor • ...

CPU Performance Data

CPU Performance • Numerical Prediction: outcome as linear sum of weighted attributes • Regression equation: • PRP=-55.9+.049MYCT+.+1.48CHMAX • Regression can discover linear relationships, not non-linear ones

Linear Regression Regression Line Debt Income A simple linear regression for the loan data set

Labor Negotiations Data

Decision Trees for ... Wage increase first year  2.5 > 2.5 Bad Statutory holidays > 10  10 Good Wage increase first year < 4  4 Bad Good

… Labor Negotiations Data Wage increase first year  2.5 > 2.5 Working hours per week Statutory holidays > 36  36 > 10  10 Bad Health plan contribution Good Wage increase first year none full  4 < 4 half Bad Good Bad Bad Good

Soy Bean Data

Two Example Rules If [leaf condition is normal and stem condition is abnormal and stem cankers is below soil line and canker lesion color is brown] then diagnosis is rhizoctonia root rot If [leaf malformation is absent and stem condition is abnormal and stem cankers is below soil line and canker lesion color is brown] then diagnosis is rhizoctonia root rot

Classification Debt No loan Loan Income A simple linear classification boundary for the loan data set; shaded region denotes class “no loan”

Clustering Debt Cluster 1 Cluster 2 Cluster 3 Income A simple clustering of the loan data set into 3 clusters; note that the original labels are replaced by +’s

Non-Linear Classification No Loan Debt Loan Income An example of classification boundaries learned by a non-linear classifier (such as a neural network) for the loan data set

Nearest Neighbor Classifier Debt No Loan Loan Income Classification boundaries for a nearest neighbor classifier for the loan data set

CSE 711: DATA MINING Sargur N. Srihari E-mail: srihari@cedar.buffalo Phone: 645-6164, ext. 113