Data Mining: A Closer Look

1 / 42

# Data Mining: A Closer Look - PowerPoint PPT Presentation

Data Mining: A Closer Look. Chapter 2. 2.1 Data Mining Strategies (p35). Moh!. Classification. Learning is supervised. The dependent variable is categorical. Well-defined classes. Current rather than future behavior. Estimation. Learning is supervised.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## Data Mining: A Closer Look

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

Chapter 2

Moh!

### Classification

Learning is supervised.

The dependent variable is categorical.

Well-defined classes.

Current rather than future behavior.

### Estimation

Learning is supervised.

The dependent variable is numeric.

Well-defined output classes or variable.

Current rather than future behavior.(???)

### Prediction

The emphasis is on predicting future rather than current outcomes.

The output attribute may be categorical or numeric.

The output variable must correspond to the variable to be predicted (the dependent variable). The input variables are the predictor variables, (or independent variables).

Hence any supervised classification model, or supervised estimation model may be used for prediction if the variables are suitably chosen. That is:

if the output variable is “current” and

the input variables are previous attribute values

If you can classify/estimate the present from the past, then you can predict the future from the present!!!

3.5 Choosing a Data Mining Technique

Initial Considerations

• Is learning supervised or unsupervised?
• Is explanation required?
• What is the interaction between input and output attributes?
• What are the data types of the input and output attributes?

### Further Considerationswhich we might prefer to ignore

Do We Know the Distribution of the Data?

Do We Know Which Attributes Best Define the Data?

Does the Data Contain Missing Values?

Is Time an Issue?

Which Technique Is Most Likely to Give a Best Test Set Accuracy?

Methods of Supervised classification
• Decision Trees
• Production Rules
• Instance based methods
• Multiple Discriminant Analysis
• Naïve Bayes methods
• Neural methods

Today we consider only the first three, which are machine learning based; the last three are statistically based.

### A Healthy Class Rule for the Cardiology Patient Dataset

Healthy

High Heart Rate

IF 169 <= Maximum Heart Rate <=202

THEN Concept Class = Healthy

Rule accuracy: 85.07% High Heart rate is quite a good predictor of health

Rule coverage: 34.55% But there are other ways of being healthy.

Rule accuracy is a between-class measure.

Rule coverage is a within-class measure.

A Sick Class Rule for the Cardiology Patient Dataset

IF Thal = Rev & Chest Pain Type = Asymptomatic

THEN Concept Class = Sick

Rule accuracy: 91.14%

Rule coverage: 52.17%

Acceptance/rejection of the “Life Insurance Promotion” offer is the output variable.

A Hypothesis for the Insurance Promotion

For credit card holders,

A combination of one or more of the attributes

can differentiate those who say yes to the life insurance promotion

from those who say no.

### 2.5 Evaluating Supervised Model Performance

• The Confusion Matrix
• A matrix used to summarize the results of a supervised classification.
• Entries along the main diagonal are correct classifications.
• Entries other than those on the main diagonal are classification errors.

True Classes

c11 is the number with true class “1” which are correctly classified as class “1”

c12 is the number with true class “1” which are mis-classified as class “2”

Etc..

### Two-Class Error Analysis

A Simple Confusion Matrix

Table 2.6

Computed

Computed

Accept

Reject

Accept

True

False

True

Accept

Reject

Reject

False

True

Accept

Reject

Comparing Models by Measuring Lift

Targetted Sample

Representative sample

Figure 2.4 Targeted vs. mass mailing

### Computing Lift

The acceptance rate of those predicted to accept is 540/23,460 = 2.3%

The overall acceptance rate in the population is 1000/100,000

= 1%

Therefore the lift in the response rate from using the classification model for targetted sampling/marketting is 2.3/1 = 2.3.

### Basic Data Mining Techniques : Chapter 3

An Algorithm for Building Decision Trees

1. Let T be the set of training instances.2. Choose an attribute that best differentiates the instances in T.3. Create a tree node whose value is the chosen attribute. Create child links from this node where each link represents a unique value for the chosen attribute.Use the child link values to further subdivide the instances into subclasses.4. For each subclass created in step 3: If the instances in the subclass satisfy predefined criteria or if the set of remaining attribute choices for this path is null, specify the classification for new instances following this decision path. If the subclass does not satisfy the criteria and there is at least one attribute to further subdivide the path of the tree, let T be the current set of subclass instances and return to step 2.

Don’t worry too much about this. It is just “algorithm speak”, which we do not concern ourselves with.

3.1 Decision Trees

Decision Tree Rules

Rules for the Tree in Figure 3.4

IF Age <=43 & Sex = Male & Credit Card Insurance = NoTHEN Life Insurance Promotion = No

• IF Sex = Female & 19 <=Age <= 43
• THEN Life Insurance Promotion = Yes
• Rule Accuracy: 100.00%
• Rule Coverage: 66.67%

### A Production Rule for theCredit Card Promotion Database

IF Sex = Female & 19 <=Age <= 43

THEN Life Insurance Promotion = Yes

Rule Accuracy: 100.00%

Rule Coverage: 66.67%

### A Simplified Rule Obtained by Removing Attribute Age

IF Sex = Male & Credit Card Insurance = No THEN Life Insurance Promotion = No

Easy to understand.

Map nicely to a set of production rules.

Applied to real problems.

Make no prior assumptions about the data.

Able to process both numerical and categorical data.

Output attribute must be categorical.

Limited to one output attribute.

Decision tree algorithms are unstable.

Trees created from numeric datasets can be complex.

### An Excel-based Data Mining Tool

Chapter 4

4.1 The iData Analyzer

4.2 ESX: A Multipurpose Tool for Data Mining

A Live Demonstration

Laboratory Exercise.

### 4.5 A Six-Step Approach for Supervised Learning

Step 1: Choose an Output Attribute

Step 2: Perform the Mining Session

Step 3: Read and Interpret Summary Results

Step 4: Read and Interpret Test Set Results

Step 5: Read and Interpret Class Results

Step 6: Visualize and Interpret Class Rules

Read and Interpret Test Set Results

Figure 4.12 Test set instance classification

4.7 Instance Typicality
• The typicality of instance I is the “average” “similarity” of I to the other members of its cluster or class.
• definitions of “average” and “similarity” in iDA are secret!!!
• Typicality values lie between 0 and 1.
• 1 indicates the class “prototype”
• 0 indicates the class “outlier”

### Typicality Scores

Identify prototypical and outlier instances.

Select a best set of training instances.

Used to compute individual instance classification confidence scores.

CLASS SIMILARITY is the average similarity of members of a class with other members of the same class.

Other Definitions

Given class C and categorical attribute A with values v1, v2,…vn, then the

• Class C predictability score for A = v2 (say) is the proportion of instances in C with A = v2. This is concerned with the predictability of A = v2 in the class C.
• Class C predictiveness is the proportion of instances with A = v2 which are in class C. This is concerned with the predictabiity of the Class C from A = v2.