Chapter 4 an excel based data mining tool idata analyzer
Download
1 / 45

Chapter 4 An Excel-based Data Mining Tool (iData Analyzer) - PowerPoint PPT Presentation


  • 221 Views
  • Uploaded on

Chapter 4 An Excel-based Data Mining Tool (iData Analyzer). Jason C. H. Chen, Ph.D. Professor of MIS School of Business Administration Gonzaga University Spokane, WA 99223 chen@jepson.gonzaga.edu. Objectives.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Chapter 4 An Excel-based Data Mining Tool (iData Analyzer)' - diallo


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Chapter 4 an excel based data mining tool idata analyzer

Chapter 4An Excel-based Data Mining Tool(iData Analyzer)

Jason C. H. Chen, Ph.D.

Professor of MIS

School of Business Administration

Gonzaga University

Spokane, WA 99223

chen@jepson.gonzaga.edu


Objectives
Objectives

  • This chapter will introduce you the iData Analyzer(iDA) and how to use two of learner models contained in your iDA software of data mining tools.

  • In Section 4.1 overviews the iDA Model for Knowledge Discovery.

  • In Section 4.2, introduces an exemplar-based data mining tool, ESX, capable of both supervised learning and unsupervised clustering.

  • The way of representing datasets and how to use ESX to perform unsupervised clustering and building supervised learning models and others will be also introduced in this chapter.


4 1 the idata analyzer
4.1 The iData Analyzer

  • iDA provides support for the business or technical analyst by offering a visual learning environment, an integrated tool set, and data mining process support.

  • iDA consists of the following components:

    • Preprocessor

    • Heuristic agent (for larger Large Dataset)

    • ESX

    • Neural Network

    • Rule Maker

    • Report Generator

See p.107 and Appendix A-2 for the instructions of installation


Limitations
Limitations

  • The commercial version of iDA is bounded by the size of a single MS Excel spreadsheet, i.e., up to 65,536 rows and 256 columns

  • The iDA input format uses the first three rows of a spreadsheet to house information about individual attributes

    • Up to 65,533 data instances in attribute-value format can be mined

    • The student version allows a maximum of 7,000 data instances (i.e., 7003 rows)

After completing the installation if the security setting is high, you should change it to medium and click OK.



4 2 esx a multipurpose tool for data mining
4.2 ESX: A Multipurpose Tool for Data Mining

  • ESX can help create target data, find irregularities in data, perform data mining, and offer insight about the practical value of discovered knowledge.

  • Features of ESX learner model are:

    • It supports both supervised learning and unsupervised clustering

    • It supports an automated method for dealing with missing attribute value

    • It does not make statistical assumptions about the nature of data to be processed

    • It can point out inconsistencies and unusual values in data



4 3 idav format for data mining

Character

Usage

I

The attribute is used as an input attribute

U

The attribute is not used (categorical attribute with several unique values are of little predictive value)

D

The attribute is not used for classification or clustering, but attribute value summary information is displayed in all output reports

O

The attribute is used as an output attribute. For supervised learning with ESX exactly one categorical attribute is selected as the output attribute.

4.3 iDAV Format for Data Mining

Second Row: C: categorical; R: real-valued

Third Row (see Table 4.2 below)

Table 4.2 – Values for Attribute Usage



4 4 a five step approach for unsupervised clustering

4.4 A Five-step Approach for Unsupervised Clustering

Step 1: Enter the Data to be Mined

Step 2: Perform a Data Mining Session

Step 3: Read and Interpret Summary Results

Step 4: Read and Interpret Individual Class Results

Step 5: Visualize Individual Class Rules




Chapter 4 an excel based data mining tool idata analyzer

Figure 4.5 – Unsupervised settings for ESX (#4,p.116)

Value for instance similarity:

A value closer to 100 encourages the formation of new clusters

A value closer to 0 favors new instances to enter existing clusters

The real-valued tolerance setting helps determine the similarity criteria for real-valued attributes. A setting of 1.0 is usually appropriate.


Chapter 4 an excel based data mining tool idata analyzer

#6 A message box indicating that eight clusters were formed.

This tells us the data has been successfully mine.


Chapter 4 an excel based data mining tool idata analyzer

#6, #7 (p.116)As a general rule, an unsupervised clustering of more than five or six clusters is likely to be less than optimal.



Chapter 4 an excel based data mining tool idata analyzer

Re-rule feature value to

Covering set rules: RuleMaker will generate a set of best-defining rules for each class.

Minimum correctness rule (50-100): if 80, the rules generated must have an error rate less than or equal to 20%

Minimum coverage (10-100): if 10, RuleMaker will generate rules that cover 10% or more of the instances in each class.

Attribute significance (start with 80-90): values close to 100 will allow RuleMaker to consider only those attribute values most highly predictive of class membership for rule generation.


Chapter 4 an excel based data mining tool idata analyzer

30 value to

#10 (p.117) Set minimum rule coverage at 30

Minimum correctness rule (50-100): if 80, the rules generated must have an error rate less than or equal to 20%

Minimum coverage (10-100): if 10, RuleMaker will generate rules that cover 10% or more of the instances in each class.

Attribute significance: values close to 100 will allow RuleMaker to consider only those attribute values most highly predictive of class membership for rule generation.


A production rule for the credit card promotion database

A Production Rule for the value to Credit Card Promotion Database

IF Sex = Female & 19 <=Age <= 43

THEN Life Insurance Promotion = Yes

Rule Accuracy: 100.00%

Rule Coverage: 66.67%

Question: Can we assume that two-thirds of all females in the specified age range will take advantage of the promotion?

  • Rule accuracy is a between-class measure.

  • Rule coverage is a within-class measure.


Output reports unsupervised clustering
Output Reports: value to Unsupervised Clustering

  • RES SUM: This sheet contains summary statistics about attribute values and offers several heuristics to help us determine the quality of a data mining session.

  • RES CLS: this sheet has information about the clusters formed as a result of an unsupervised mining session

  • RUL TYP: Instances are listed by their cluster number. The typicality of instance i is the average similarity of i to the other members of its cluster.

  • RES RUL: The production rules generated for each cluster are contained in this sheet.




Step 3 read and interpret summary results p 117 sheet1 res sum
Step 3: Read and Interpret Summary Results value to (p.117)(Sheet1 RES SUM)

  • Class Resemblance Scores (RES)

  • Domain Resemblance Score

  • Domain Predictability


Chapter 4 an excel based data mining tool idata analyzer

Instances of Class 1 have a best within-class fit value to

Similarity value

(within the class)

Step 3: Read and Interpret Summary Results (p.119)

In general, the within-class RES scores should be higher than the domain RES. It should be true for most of the classes.



Chapter 4 an excel based data mining tool idata analyzer

Figure 4.9 - value to Statistics for numerical attributes and common categorical attribute values

Step 3: Read and Interpret Summary Results (cont.)


Step 4 read and interpret individual class results p 121 sheet1 res cls
Step 4: Read and Interpret Individual Class Results value to (p.121)(Sheet1 RES CLS)

  • Typicality

    • is defined as the average similarity of an instance to all other members of its cluster or class

  • Class Predictability is a within-class measure.

    • the percent of class instances having a particular value for a categorical attribute

  • Class Predictiveness is a between-class measure

    • it is defined as probability an instance resides in a specified class given the instance has the value for the chosen attribute


Chapter 4 an excel based data mining tool idata analyzer

within-class value to

between-class

Figure 4.10 – Class 3 Summary Results



Chapter 4 an excel based data mining tool idata analyzer

Step 5: Visualize Individual Class Rules for Class 3

IF life ins Promo = Yes

THEN Class = 3

:rule accuracy 77.78%

:rule coverage 100.00%


4 5 a six step approach for supervised learning
4.5 A Six-Step Approach for Supervised Learning for Class 3

  • Step 1: Choose an Output Attribute

    • Launch a fresh life insurance promotion

  • Step 2: Perform the Mining Session

  • Step 3: Read and Interpret Summary Results

  • Step 4: Read and Interpret Test Set Results

  • Step 5: Read and Interpret Class Results

  • Step 6: Visualize and Interpret Class Rules


Step 2 perform the mining session
Step 2: Perform the Mining Session for Class 3

Filename: CreditCardPromotion-supervised.xls

O: output; D: Display-Only


Chapter 4 an excel based data mining tool idata analyzer

Step 2(#4): Select the number of instances for training and a real-valued tolerance setting (p.127)


Chapter 4 an excel based data mining tool idata analyzer

Step 3 – Read and Interpret Summary Results a real-valued tolerance setting

Domain statistics for categorical attributes tells us that 80% of the training instances represent individuals without credit card insurance.


Chapter 4 an excel based data mining tool idata analyzer

Step 3 – Read and Interpret Summary Results a real-valued tolerance setting (cont.)


Step 4 read and interpret test set results
Step 4 - Read and Interpret Test Set Results a real-valued tolerance setting


Chapter 4 an excel based data mining tool idata analyzer

Step 5 - Read and Interpret Results for Individual Classes a real-valued tolerance setting (p.130)


Chapter 4 an excel based data mining tool idata analyzer

Sheet1 RUL TYP a real-valued tolerance setting

In Class Yes (Life Ins. Promo)

Instances of Credit Card Ins = Yes is 40% (2/5)


Chapter 4 an excel based data mining tool idata analyzer

Step 6 – Visualize and Interpret Class Rules a real-valued tolerance setting (p.130)

Re-rule feature

Covering set rules: RuleMaker will generate a set of best-defining rules for each class.

Minimum correctness rule (50-100): if 80, the rules generated must have an error rate less than or equal to 20%

Minimum coverage (10-100): if 10, RuleMaker will generate rules that cover 10% or more of the instances in each class.

Attribute significance (start with 80-90): values close to 100 will allow RuleMaker to consider only those attribute values most highly predictive of class membership for rule generation.


4 6 techniques for generating rules

4.6 Techniques for Generating Rules a real-valued tolerance setting

Define the scope of the rules.

Choose the instances.

Set the minimum rule correctness.

Define the minimum rule coverage.

Choose an attribute significance value.


Typicality scores

4.7 Instance Typicality a real-valued tolerance setting

Typicality Scores

Identify prototypical and outlier instances.

Select a best set of training instances.

Used to compute individual instance classification confidence scores.


Chapter 4 an excel based data mining tool idata analyzer

Figure 4.13 Instance Typicality a real-valued tolerance setting


4 8 special considerations and features
4.8 Special Considerations and Features a real-valued tolerance setting

  • Avoid Mining Delays

  • The Quick Mine Feature

    • Supervised with more than 2000 training set instances, “quick mine” feature will be asked

    • Unsupervised with more than 2000 data instances. ESX is given a random selection of 500 instances.

  • Erroneous and Missing Data


Homework
Homework a real-valued tolerance setting

  • Use EXS (and iDA) to perform a supervised data mining session using the CardiologyCategorical.xls data file.

  • Save output file as CardiologyCategorical-supervised.xls

  • Lab#4 (p.141)

  • Turn in

    • 1. Spreadsheet file (CardiologyCategorical-supervised.xls) that contains the outcome of data mining session

    • 2. Word file that includes (and explains) answers to all questions (a. thru n.)