Create Presentation
Download Presentation

Download

Download Presentation

CSE 711: DATA MINING Sargur N. Srihari E-mail: srihari@cedar.buffalo Phone: 645-6164, ext. 113

179 Views
Download Presentation

Download Presentation
## CSE 711: DATA MINING Sargur N. Srihari E-mail: srihari@cedar.buffalo Phone: 645-6164, ext. 113

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -

**CSE 711:**DATA MINING Sargur N. Srihari E-mail: srihari@cedar.buffalo.edu Phone: 645-6164, ext. 113**CSE 711 Texts**Required Text 1. Witten, I. H., and E. Frank, Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations, Morgan Kaufmann, 2000. Recommended Texts 1. Adriaans, P., and D. Zantinge, Data Mining, Addison- Wesley,1998.**CSE 711 Texts**2. Groth, R., Data Mining: A Hands-on Approach for Business Professionals, Prentice-Hall PTR,1997. 3. Kennedy, R., Y. Lee, et al., Solving Data Mining Problems through Pattern Recognition, Prentice-Hall PTR, 1998. 4. Weiss, S., and N. Indurkhya, Predictive Data Mining: A Practical Guide, Morgan Kaufmann, 1998.**Introduction**• Challenge: How to manage ever-increasing amounts of information • Solution: Data Mining and Knowledge Discovery Databases (KDD)**Information as a Production Factor**• Most international organizations produce more information in a week than many people could read in a lifetime**Data Mining Motivation**• Mechanical production of data need for mechanical consumption of data • Large databases = vast amounts of information • Difficulty lies in accessing it**KDD and Data Mining**• KDD: Extraction of knowledge from data • Official definition: “non-trivial extraction of implicit, previously unknown & potentially useful knowledge from data” • Data Mining: Discovery stage of the KDD process**Data Mining**• Process of discovering patterns, automatically or semi-automatically, in large quantities of data • Patterns discovered must be useful: meaningful in that they lead to some advantage, usually economic**KDD and Data Mining**Machine learning Export systems KDD Database Statistics Visualization Figure 1.1 Data mining is a multi-disciplinary field.**Data Mining vs. Query Tools**• SQL: When you know exactly what you are looking for • Data Mining: When you only vaguely know what you are looking for**Practical Applications**• KDD more complicated than initially thought • 80% preparing data • 20% mining data**Data Mining Techniques**• Not so much a single technique • More the idea that there is more knowledge hidden in the data than shows itself on the surface**Data Mining Techniques**• Any technique that helps to extract more out of data is useful • Query tools • Statistical techniques • Visualization • On-line analytical processing (OLAP) • Case-based learning (k-nearest neighbor)**Data Mining Techniques**• Decision trees • Association rules • Neural networks • Genetic algorithms**Machine Learning and theMethodology of Science**Analysis Observation Theory Prediction Empirical cycle of scientific research**Machine Learning...**Analysis Limited number of observation Theory ‘All swans are white’ Reality: Infinite number of swans Theory formation**Machine Learning...**Theory “All swans are white” Single observation Reality: Infinite number of swans Prediction Theory falsification**A Kangaroo in Mist**a.) b.) c.) d.) e.) f.) Complexity of search spaces**Association Rules**Definition: Given a set of transactions, where each transaction is a set of items, an association rule is an expression XY, where X and Y are sets of an item.**Association Rules**Intuitive meaning of such a rule: transactions in the database which contain the items in Xtend also to contain the items in Y.**Association Rules**Example: 98% of customers that purchase tires and automotive accessories also buy some automotive services. Here, 98% is called the confidence of the rule. The support of the rule X Y is the percentage of transactions that contain both X and Y.**Association Rules**Problem: The problem of mining association rules is to find all rules which satisfy a user-specified minimum support and minimum confidence. Applications include cross-marketing, attached mailing, catalog design, loss leader analysis, add-on sales, store layout and customer segmentation based on buying patterns.**Example Data Sets**• Contact Lens (symbolic) • Weather (symbolic data) • Weather ( numeric +symbolic) • Iris (numeric; outcome:symbolic) • CPU Perf.(numeric; outcome:numeric) • Labor Negotiations (missing values) • Soybean**Structural Patterns**• Part of structural description • Example is simplistic because all combinations of possible values are represented in table If tear production rate = reduced then recommendation = none Otherwise, if age = young and astigmatic = no then recommendation = soft**Structural Patterns**• In most learning situations, the set of examples given as input is far from complete • Part of the job is to generalize to other, new examples**Weather Problem**• This creates 36 possible combinations (3 X 3 X 2 X 2 = 36), of which 14 are present in the set of examples If outlook = sunny and humidity = high then play = no If outlook = rainy and windy = true then play = no If outlook = overcast then play = yes If humidity = normal then play = yes If none of the above then play = yes**Classification and AssociationRules**• Classification Rules: rules which predict the classification of the example in terms of whether to play or not If outlook = sunny and humidity = >83, then play = no**Classification and AssociationRules**• Association Rules: rules which strongly associate different attribute values • Association rules which derive from weather table If temperature = cool then humidity = normal If humidity = normal and windy = false then play = yes If outlook = sunny and play = no then humidity = high If windy = false and play = no then outlook = sunny and humidity = high**Rules for Contact Lens Data**If tear production rate = reduced then recommendation = none If age = young and astigmatic = no and tear production rate = normal then recommendation = soft If age = pre-presbyopic and astigmatic = no and tear production rate = normal then recommendation = soft If age = presbyopic and spectacle prescription = myope and astigmatic = no then recommendation = none If spectacle prescription = hypermetrope and astigmatic = no and tear production rate = normal then recommendation = soft If spectacle prescription = myope and astigmatic = yes and tear production rate = normal then recommendation = hard If age = young and astigmatic = yes and tear production rate = normal then recommendation = hard If age = pre-presbyopic and spectacle prescription = hypermetrope and astigmatic = yes then recommendation = none If age = presbyopic and spectacle prescription = hypermetrope and astigmatic = yes then recommendation = none**Decision Tree for Contact Lens Data**tear production rate astigmatism none spectacle prescription soft hard none**Iris Rules Learned**• If petal-length <2.45 then Iris-setosa • If sepal-width <2.10 then Iris-versicolor • If sepal-width < 2.45 and petal-length <4.55 then Iris-versicolor • ...**CPU Performance**• Numerical Prediction: outcome as linear sum of weighted attributes • Regression equation: • PRP=-55.9+.049MYCT+.+1.48CHMAX • Regression can discover linear relationships, not non-linear ones**Linear Regression**Regression Line Debt Income A simple linear regression for the loan data set**Decision Trees for ...**Wage increase first year 2.5 > 2.5 Bad Statutory holidays > 10 10 Good Wage increase first year < 4 4 Bad Good**… Labor Negotiations Data**Wage increase first year 2.5 > 2.5 Working hours per week Statutory holidays > 36 36 > 10 10 Bad Health plan contribution Good Wage increase first year none full 4 < 4 half Bad Good Bad Bad Good**Two Example Rules**If [leaf condition is normal and stem condition is abnormal and stem cankers is below soil line and canker lesion color is brown] then diagnosis is rhizoctonia root rot If [leaf malformation is absent and stem condition is abnormal and stem cankers is below soil line and canker lesion color is brown] then diagnosis is rhizoctonia root rot**Classification**Debt No loan Loan Income A simple linear classification boundary for the loan data set; shaded region denotes class “no loan”**Clustering**Debt Cluster 1 Cluster 2 Cluster 3 Income A simple clustering of the loan data set into 3 clusters; note that the original labels are replaced by +’s**Non-Linear Classification**No Loan Debt Loan Income An example of classification boundaries learned by a non-linear classifier (such as a neural network) for the loan data set**Nearest Neighbor Classifier**Debt No Loan Loan Income Classification boundaries for a nearest neighbor classifier for the loan data set