An introduction to data mining who should provide cake for pgf
Download
1 / 42

An introduction to data mining --Who should provide Cake for PGF? - PowerPoint PPT Presentation


  • 117 Views
  • Uploaded on

An introduction to data mining --Who should provide Cake for PGF?. Peng Yin MI 6 16/10/2011. Outline. Overview of data mining Background Data set and Tables What is data mining Decision tree Decision tree analysis Common Uses of Data Mining. Backgroud.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' An introduction to data mining --Who should provide Cake for PGF?' - xandy


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
An introduction to data mining who should provide cake for pgf

An introduction to data mining--Who should provide Cake for PGF?

Peng Yin

MI 6

16/10/2011


Outline
Outline

  • Overview of data mining

    • Background

    • Dataset and Tables

    • What is data mining

  • Decision tree

    • Decision tree analysis

    • Common Uses of Data Mining


Backgroud
Backgroud

  • We got notice from some secret organization saying that at North England there is a extremely dangerous group hiding in Newcastle, named NCL-MS


About this dataset
About this dataset

  • • It is a tiny subset of the PG students Personal Secrets.

  • • Some important data missing, while I gained from CIS. Thanks to Robin Henderson.

  • Used Attributes

  • This color=Real valued This color=Symbol valued

  • Successfully loaded the dataset with 10 attributes and 15records


What can we do with the dataset
What can we do with the dataset?

  • Well, we can look at histograms..

  • Female

  • Male

  • Pure

  • Applied

  • Stats


Contingency tables
Contingency Tables

  • A better name for a histogram:

    • A One-dimensional Contingency Table

  • Recipe for making a k-dimensional contingency table:

  • 1. Pick k attributes from your dataset. Call them a1,a2, … ak.

  • 2. For every possible combination of values, a1,=x1, a2,=x2,… ak,=xk ,record how frequently that combination occurs

  • Fun fact: A database person would call this a “k-dimensional datacube”


A 2 d contingency table
A 2-d Contingency Table

For each pair of

values for

attributes

(year, wealth)

we can see how

many records

match.


A 2 d contingency table1
A 2-d Contingency Table

• Easier to see

“interesting” things if we

stretch out the Histogram bars


3 d contingency tables
3-d contingency tables

  • • These are harder to look at!

  • 1st year 2nd 3rd 4th

    Rich

    Poor

    Male

    F


On line analytical processing olap
On-Line Analytical Processing (OLAP)

  • Software packages and database add-ons to do this are known as OLAP tools

  • They usually include point and click navigation to view slices and aggregates of contingency tables

  • They usually include nice histogram visualization


Time to stop and think
Time to stop and think

  • • Why would people want to look at contingency tables?


Let s continue to think
Let’s continue to think

  • With 10 attributes, how many 1-d contingency tables are there?

  • • How many 2-d contingency tables?

  • • How many 3-d tables?

  • • With 100 attributes how many 3-d tables are there?


Let s continue to think1
Let’s continue to think

  • With 10 attributes, how many 1-d contingency tables are there? 10

  • • How many 2-d contingency tables? 10 * 9 / 2 = 45

  • • How many 3-d tables? 120

  • • With 100 attributes how many 3-d tables are there? 161,700


Manually looking at contingency tables
Manually looking at contingencytables

  • • Looking at one contingency table: can be as much fun as reading an interesting book

  • • Looking at ten tables: as much fun as watching BBC One

  • • Looking at 100 tables: as much fun as watching an infomercial

  • • Looking at 100,000 tables: as much fun as a three-week November vacation in Sunderland with a dying weasel.


Data mining
Data Mining

  • Data Mining is all about automating the process of searching for patterns in the data.

  • Which patterns are interesting?

  • Which might be mere illusions?

  • And how can they be exploited?

That is what we’ll look at now.

And the answer will turn out to be pgf cake with decision tree learning.


Aim

  • • Information Gain for measuring association between inputs and outputs

  • • Learning a decision tree classifier from data


Goal of data mining
Goal of Data Mining

  • — Simplification and automation of the overall statisticalprocess, from data source(s) to model application— Changed over the years— Replace statistician ð Better models, less grunge work— 1 + 1 = 0— Many different data mining algorithms / tools available— Statistical expertise required to compare different techniques— Build intelligence into the software


Methods
Methods

• Decision Trees

• Nearest Neighbour Classification

• Neural Networks

• Rule Induction

• K-means Clustering


Learning decision trees
Learning Decision Trees

  • A Decision Tree is a tree-structured plan of a set of attributes to test in order to predict the output.

  • To decide which attribute should be tested first, simply find the one with the highest information gain.

  • Then recurse…



Recursion step
Recursion Step

Records for 1st year students

Records for 2nd

year students

Records for 4th year students




The final tree
The final tree

Predict rich

Predict rich

Predict rich

Predict rich


The final tree1
The final tree

Don’t Split a node if all matching records have the same output value

Predict rich

Predict rich

Predict rich

Predict rich


The final tree2
The final tree

Predict rich

Predict rich

Don’t split a node if none of the attributes can create multiple non-empty children

Predict rich

Predict rich


Base cases
Base Cases

  • • Base Case One: If all records in current data subset have the same output then don’t recurse

  • • Base Case Two: If all records have exactly the same set of input attributes then don’t recurse


Basic decision tree building summarized
Basic Decision Tree BuildingSummarized

  • Build Tree (Dataset, Output)

  • If all output values are the same in Dataset, return a leaf node that says “predict this unique output”

  • If all input values are the same, return a leaf node that says “predict the majority output”

  • Else find attribute X with highest Info Gain

  • Suppose X has nX distinct values (i.e. X has aritynX).

    • Create and return a non-leaf node with nX children.

    • The i’th child should be built by calling

      Build Tree (DSi, Output)

  • Where DSi built consists of all those records in Dataset for which X = ith distinct value of X.


Data mining is not
Data mining is not

  • — Data warehousing— SQL / Ad Hoc Queries / Reporting— Software Agents— Online Analytical Processing (OLAP)— Data Visualization


Uses

  • — Direct mail marketing— Web site personalization— Credit card fraud detection— Gas & jewelry— Bioinformatics— Text analysis— SAS lie detector— Market basket analysis— Beer & baby diapers:








Look at all

the

information

gains…


Reference
Reference

  • Andrew Moore

  • http://www.autonlab.org/tutorials/

  • Doug Alexander

  • http://www.laits.utexas.edu/~norman/BUS.FOR/course.mat/Alex/

  • Informationgain

  • http://en.wikipedia.org/wiki/Information_gain_in_decision_trees


ad