An introduction to data mining who should provide cake for pgf
This presentation is the property of its rightful owner.
Sponsored Links
1 / 42

An introduction to data mining --Who should provide Cake for PGF? PowerPoint PPT Presentation


  • 88 Views
  • Uploaded on
  • Presentation posted in: General

An introduction to data mining --Who should provide Cake for PGF?. Peng Yin MI 6 16/10/2011. Outline. Overview of data mining Background Data set and Tables What is data mining Decision tree Decision tree analysis Common Uses of Data Mining. Backgroud.

Download Presentation

An introduction to data mining --Who should provide Cake for PGF?

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


An introduction to data mining who should provide cake for pgf

An introduction to data mining--Who should provide Cake for PGF?

Peng Yin

MI 6

16/10/2011


Outline

Outline

  • Overview of data mining

    • Background

    • Dataset and Tables

    • What is data mining

  • Decision tree

    • Decision tree analysis

    • Common Uses of Data Mining


Backgroud

Backgroud

  • We got notice from some secret organization saying that at North England there is a extremely dangerous group hiding in Newcastle, named NCL-MS


About this dataset

About this dataset

  • • It is a tiny subset of the PG students Personal Secrets.

  • • Some important data missing, while I gained from CIS. Thanks to Robin Henderson.

  • Used Attributes

  • This color=Real valued This color=Symbol valued

  • Successfully loaded the dataset with 10 attributes and 15records


What can we do with the dataset

What can we do with the dataset?

  • Well, we can look at histograms..

  • Female

  • Male

  • Pure

  • Applied

  • Stats


Contingency tables

Contingency Tables

  • A better name for a histogram:

    • A One-dimensional Contingency Table

  • Recipe for making a k-dimensional contingency table:

  • 1. Pick k attributes from your dataset. Call them a1,a2, … ak.

  • 2. For every possible combination of values, a1,=x1, a2,=x2,… ak,=xk ,record how frequently that combination occurs

  • Fun fact: A database person would call this a “k-dimensional datacube”


A 2 d contingency table

A 2-d Contingency Table

For each pair of

values for

attributes

(year, wealth)

we can see how

many records

match.


A 2 d contingency table1

A 2-d Contingency Table

• Easier to see

“interesting” things if we

stretch out the Histogram bars


3 d contingency tables

3-d contingency tables

  • • These are harder to look at!

  • 1st year 2nd 3rd 4th

    Rich

    Poor

    Male

    F


On line analytical processing olap

On-Line Analytical Processing (OLAP)

  • Software packages and database add-ons to do this are known as OLAP tools

  • They usually include point and click navigation to view slices and aggregates of contingency tables

  • They usually include nice histogram visualization


Time to stop and think

Time to stop and think

  • • Why would people want to look at contingency tables?


Let s continue to think

Let’s continue to think

  • With 10 attributes, how many 1-d contingency tables are there?

  • • How many 2-d contingency tables?

  • • How many 3-d tables?

  • • With 100 attributes how many 3-d tables are there?


Let s continue to think1

Let’s continue to think

  • With 10 attributes, how many 1-d contingency tables are there? 10

  • • How many 2-d contingency tables? 10 * 9 / 2 = 45

  • • How many 3-d tables? 120

  • • With 100 attributes how many 3-d tables are there? 161,700


Manually looking at contingency tables

Manually looking at contingencytables

  • • Looking at one contingency table: can be as much fun as reading an interesting book

  • • Looking at ten tables: as much fun as watching BBC One

  • • Looking at 100 tables: as much fun as watching an infomercial

  • • Looking at 100,000 tables: as much fun as a three-week November vacation in Sunderland with a dying weasel.


Data mining

Data Mining

  • Data Mining is all about automating the process of searching for patterns in the data.

  • Which patterns are interesting?

  • Which might be mere illusions?

  • And how can they be exploited?

That is what we’ll look at now.

And the answer will turn out to be pgf cake with decision tree learning.


An introduction to data mining who should provide cake for pgf

Aim

  • • Information Gain for measuring association between inputs and outputs

  • • Learning a decision tree classifier from data


Goal of data mining

Goal of Data Mining

  • — Simplification and automation of the overall statisticalprocess, from data source(s) to model application— Changed over the years— Replace statistician ð Better models, less grunge work— 1 + 1 = 0— Many different data mining algorithms / tools available— Statistical expertise required to compare different techniques— Build intelligence into the software


Methods

Methods

• Decision Trees

• Nearest Neighbour Classification

• Neural Networks

• Rule Induction

• K-means Clustering


Learning decision trees

Learning Decision Trees

  • A Decision Tree is a tree-structured plan of a set of attributes to test in order to predict the output.

  • To decide which attribute should be tested first, simply find the one with the highest information gain.

  • Then recurse…


A decision stump

A Decision Stump


Recursion step

Recursion Step

Records for 1st year students

Records for 2nd

year students

Records for 4th year students


Recursion step1

Recursion Step


Second lever of tree

Second lever of tree


The final tree

The final tree

Predict rich

Predict rich

Predict rich

Predict rich


The final tree1

The final tree

Don’t Split a node if all matching records have the same output value

Predict rich

Predict rich

Predict rich

Predict rich


The final tree2

The final tree

Predict rich

Predict rich

Don’t split a node if none of the attributes can create multiple non-empty children

Predict rich

Predict rich


Base cases

Base Cases

  • • Base Case One: If all records in current data subset have the same output then don’t recurse

  • • Base Case Two: If all records have exactly the same set of input attributes then don’t recurse


Basic decision tree building summarized

Basic Decision Tree BuildingSummarized

  • Build Tree (Dataset, Output)

  • If all output values are the same in Dataset, return a leaf node that says “predict this unique output”

  • If all input values are the same, return a leaf node that says “predict the majority output”

  • Else find attribute X with highest Info Gain

  • Suppose X has nX distinct values (i.e. X has aritynX).

    • Create and return a non-leaf node with nX children.

    • The i’th child should be built by calling

      Build Tree (DSi, Output)

  • Where DSi built consists of all those records in Dataset for which X = ith distinct value of X.


Data mining is not

Data mining is not

  • — Data warehousing— SQL / Ad Hoc Queries / Reporting— Software Agents— Online Analytical Processing (OLAP)— Data Visualization


An introduction to data mining who should provide cake for pgf

Uses

  • — Direct mail marketing— Web site personalization— Credit card fraud detection— Gas & jewelry— Bioinformatics— Text analysis— SAS lie detector— Market basket analysis— Beer & baby diapers:


Who should provide the cake for pgf

Who should provide the Cake for PGF?


Who should provide the cake for pgf1

Who should provide the Cake for PGF?


Who should provide the cake for pgf2

Who should provide the Cake for PGF?


Who should provide the cake for pgf3

Who should provide the Cake for PGF?


Who should provide the cake for pgf4

Who should provide the Cake for PGF?


Who should provide the cake for pgf5

Who should provide the Cake for PGF?


An introduction to data mining who should provide cake for pgf

Look at all

the

information

gains…


Reference

Reference:

  • Andrew Moore

  • http://www.autonlab.org/tutorials/

  • Doug Alexander

  • http://www.laits.utexas.edu/~norman/BUS.FOR/course.mat/Alex/

  • Informationgain

  • http://en.wikipedia.org/wiki/Information_gain_in_decision_trees


  • Login