Loading in 5 sec....

An introduction to data mining --Who should provide Cake for PGF?PowerPoint Presentation

An introduction to data mining --Who should provide Cake for PGF?

- By
**xandy** - Follow User

- 117 Views
- Uploaded on

Download Presentation
## PowerPoint Slideshow about ' An introduction to data mining --Who should provide Cake for PGF?' - xandy

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

Outline

- Overview of data mining
- Background
- Dataset and Tables
- What is data mining

- Decision tree
- Decision tree analysis
- Common Uses of Data Mining

Backgroud

- We got notice from some secret organization saying that at North England there is a extremely dangerous group hiding in Newcastle, named NCL-MS

About this dataset

- • It is a tiny subset of the PG students Personal Secrets.
- • Some important data missing, while I gained from CIS. Thanks to Robin Henderson.
- Used Attributes
- This color=Real valued This color=Symbol valued
- Successfully loaded the dataset with 10 attributes and 15records

What can we do with the dataset?

- Well, we can look at histograms..
- Female
- Male
- Pure
- Applied
- Stats

Contingency Tables

- A better name for a histogram:
- A One-dimensional Contingency Table

- Recipe for making a k-dimensional contingency table:
- 1. Pick k attributes from your dataset. Call them a1,a2, … ak.
- 2. For every possible combination of values, a1,=x1, a2,=x2,… ak,=xk ,record how frequently that combination occurs
- Fun fact: A database person would call this a “k-dimensional datacube”

A 2-d Contingency Table

For each pair of

values for

attributes

(year, wealth)

we can see how

many records

match.

3-d contingency tables

- • These are harder to look at!
- 1st year 2nd 3rd 4th
Rich

Poor

Male

F

On-Line Analytical Processing (OLAP)

- Software packages and database add-ons to do this are known as OLAP tools
- They usually include point and click navigation to view slices and aggregates of contingency tables
- They usually include nice histogram visualization

Time to stop and think

- • Why would people want to look at contingency tables?

Let’s continue to think

- With 10 attributes, how many 1-d contingency tables are there?
- • How many 2-d contingency tables?
- • How many 3-d tables?
- • With 100 attributes how many 3-d tables are there?

Let’s continue to think

- With 10 attributes, how many 1-d contingency tables are there? 10
- • How many 2-d contingency tables? 10 * 9 / 2 = 45
- • How many 3-d tables? 120
- • With 100 attributes how many 3-d tables are there? 161,700

Manually looking at contingencytables

- • Looking at one contingency table: can be as much fun as reading an interesting book
- • Looking at ten tables: as much fun as watching BBC One
- • Looking at 100 tables: as much fun as watching an infomercial
- • Looking at 100,000 tables: as much fun as a three-week November vacation in Sunderland with a dying weasel.

Data Mining

- Data Mining is all about automating the process of searching for patterns in the data.
- Which patterns are interesting?
- Which might be mere illusions?
- And how can they be exploited?

That is what we’ll look at now.

And the answer will turn out to be pgf cake with decision tree learning.

Aim

- • Information Gain for measuring association between inputs and outputs
- • Learning a decision tree classifier from data

Goal of Data Mining

- — Simplification and automation of the overall statisticalprocess, from data source(s) to model application— Changed over the years— Replace statistician ð Better models, less grunge work— 1 + 1 = 0— Many different data mining algorithms / tools available— Statistical expertise required to compare different techniques— Build intelligence into the software

Methods

• Decision Trees

• Nearest Neighbour Classification

• Neural Networks

• Rule Induction

• K-means Clustering

Learning Decision Trees

- A Decision Tree is a tree-structured plan of a set of attributes to test in order to predict the output.
- To decide which attribute should be tested first, simply find the one with the highest information gain.
- Then recurse…

Recursion Step

Records for 1st year students

Records for 2nd

year students

Records for 4th year students

The final tree

Don’t Split a node if all matching records have the same output value

Predict rich

Predict rich

Predict rich

Predict rich

The final tree

Predict rich

Predict rich

Don’t split a node if none of the attributes can create multiple non-empty children

Predict rich

Predict rich

Base Cases

- • Base Case One: If all records in current data subset have the same output then don’t recurse
- • Base Case Two: If all records have exactly the same set of input attributes then don’t recurse

Basic Decision Tree BuildingSummarized

- Build Tree (Dataset, Output)
- If all output values are the same in Dataset, return a leaf node that says “predict this unique output”
- If all input values are the same, return a leaf node that says “predict the majority output”
- Else find attribute X with highest Info Gain
- Suppose X has nX distinct values (i.e. X has aritynX).
- Create and return a non-leaf node with nX children.
- The i’th child should be built by calling
Build Tree (DSi, Output)

- Where DSi built consists of all those records in Dataset for which X = ith distinct value of X.

Data mining is not

- — Data warehousing— SQL / Ad Hoc Queries / Reporting— Software Agents— Online Analytical Processing (OLAP)— Data Visualization

Uses

- — Direct mail marketing— Web site personalization— Credit card fraud detection— Gas & jewelry— Bioinformatics— Text analysis— SAS lie detector— Market basket analysis— Beer & baby diapers:

Reference：

- Andrew Moore
- http://www.autonlab.org/tutorials/
- Doug Alexander
- http://www.laits.utexas.edu/~norman/BUS.FOR/course.mat/Alex/
- Informationgain
- http://en.wikipedia.org/wiki/Information_gain_in_decision_trees

Download Presentation

Connecting to Server..