An introduction to data mining --Who should provide Cake for PGF?

Download Presentation

An introduction to data mining --Who should provide Cake for PGF?

Loading in 2 Seconds...

- 98 Views
- Uploaded on
- Presentation posted in: General

An introduction to data mining --Who should provide Cake for PGF?

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

An introduction to data mining--Who should provide Cake for PGF?

Peng Yin

MI 6

16/10/2011

- Overview of data mining
- Background
- Dataset and Tables
- What is data mining

- Decision tree
- Decision tree analysis
- Common Uses of Data Mining

- We got notice from some secret organization saying that at North England there is a extremely dangerous group hiding in Newcastle, named NCL-MS

- • It is a tiny subset of the PG students Personal Secrets.
- • Some important data missing, while I gained from CIS. Thanks to Robin Henderson.
- Used Attributes
- This color=Real valued This color=Symbol valued
- Successfully loaded the dataset with 10 attributes and 15records

- Well, we can look at histograms..
- Female
- Male
- Pure
- Applied
- Stats

- A better name for a histogram:
- A One-dimensional Contingency Table

- Recipe for making a k-dimensional contingency table:
- 1. Pick k attributes from your dataset. Call them a1,a2, … ak.
- 2. For every possible combination of values, a1,=x1, a2,=x2,… ak,=xk ,record how frequently that combination occurs
- Fun fact: A database person would call this a “k-dimensional datacube”

For each pair of

values for

attributes

(year, wealth)

we can see how

many records

match.

• Easier to see

“interesting” things if we

stretch out the Histogram bars

- • These are harder to look at!
- 1st year 2nd 3rd 4th
Rich

Poor

Male

F

- Software packages and database add-ons to do this are known as OLAP tools
- They usually include point and click navigation to view slices and aggregates of contingency tables
- They usually include nice histogram visualization

- • Why would people want to look at contingency tables?

- With 10 attributes, how many 1-d contingency tables are there?
- • How many 2-d contingency tables?
- • How many 3-d tables?
- • With 100 attributes how many 3-d tables are there?

- With 10 attributes, how many 1-d contingency tables are there? 10
- • How many 2-d contingency tables? 10 * 9 / 2 = 45
- • How many 3-d tables? 120
- • With 100 attributes how many 3-d tables are there? 161,700

- • Looking at one contingency table: can be as much fun as reading an interesting book
- • Looking at ten tables: as much fun as watching BBC One
- • Looking at 100 tables: as much fun as watching an infomercial
- • Looking at 100,000 tables: as much fun as a three-week November vacation in Sunderland with a dying weasel.

- Data Mining is all about automating the process of searching for patterns in the data.
- Which patterns are interesting?
- Which might be mere illusions?
- And how can they be exploited?

That is what we’ll look at now.

And the answer will turn out to be pgf cake with decision tree learning.

- • Information Gain for measuring association between inputs and outputs
- • Learning a decision tree classifier from data

- — Simplification and automation of the overall statisticalprocess, from data source(s) to model application— Changed over the years— Replace statistician ð Better models, less grunge work— 1 + 1 = 0— Many different data mining algorithms / tools available— Statistical expertise required to compare different techniques— Build intelligence into the software

• Decision Trees

• Nearest Neighbour Classification

• Neural Networks

• Rule Induction

• K-means Clustering

- A Decision Tree is a tree-structured plan of a set of attributes to test in order to predict the output.
- To decide which attribute should be tested first, simply find the one with the highest information gain.
- Then recurse…

Records for 1st year students

Records for 2nd

year students

Records for 4th year students

Predict rich

Predict rich

Predict rich

Predict rich

Don’t Split a node if all matching records have the same output value

Predict rich

Predict rich

Predict rich

Predict rich

Predict rich

Predict rich

Don’t split a node if none of the attributes can create multiple non-empty children

Predict rich

Predict rich

- • Base Case One: If all records in current data subset have the same output then don’t recurse
- • Base Case Two: If all records have exactly the same set of input attributes then don’t recurse

- Build Tree (Dataset, Output)
- If all output values are the same in Dataset, return a leaf node that says “predict this unique output”
- If all input values are the same, return a leaf node that says “predict the majority output”
- Else find attribute X with highest Info Gain
- Suppose X has nX distinct values (i.e. X has aritynX).
- Create and return a non-leaf node with nX children.
- The i’th child should be built by calling
Build Tree (DSi, Output)

- Where DSi built consists of all those records in Dataset for which X = ith distinct value of X.

- — Data warehousing— SQL / Ad Hoc Queries / Reporting— Software Agents— Online Analytical Processing (OLAP)— Data Visualization

- — Direct mail marketing— Web site personalization— Credit card fraud detection— Gas & jewelry— Bioinformatics— Text analysis— SAS lie detector— Market basket analysis— Beer & baby diapers:

Look at all

the

information

gains…

- Andrew Moore
- http://www.autonlab.org/tutorials/
- Doug Alexander
- http://www.laits.utexas.edu/~norman/BUS.FOR/course.mat/Alex/
- Informationgain
- http://en.wikipedia.org/wiki/Information_gain_in_decision_trees