An introduction to data mining who should provide cake for pgf
Sponsored Links
This presentation is the property of its rightful owner.
1 / 42

An introduction to data mining --Who should provide Cake for PGF? PowerPoint PPT Presentation


  • 96 Views
  • Uploaded on
  • Presentation posted in: General

An introduction to data mining --Who should provide Cake for PGF?. Peng Yin MI 6 16/10/2011. Outline. Overview of data mining Background Data set and Tables What is data mining Decision tree Decision tree analysis Common Uses of Data Mining. Backgroud.

Download Presentation

An introduction to data mining --Who should provide Cake for PGF?

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


An introduction to data mining--Who should provide Cake for PGF?

Peng Yin

MI 6

16/10/2011


Outline

  • Overview of data mining

    • Background

    • Dataset and Tables

    • What is data mining

  • Decision tree

    • Decision tree analysis

    • Common Uses of Data Mining


Backgroud

  • We got notice from some secret organization saying that at North England there is a extremely dangerous group hiding in Newcastle, named NCL-MS


About this dataset

  • • It is a tiny subset of the PG students Personal Secrets.

  • • Some important data missing, while I gained from CIS. Thanks to Robin Henderson.

  • Used Attributes

  • This color=Real valued This color=Symbol valued

  • Successfully loaded the dataset with 10 attributes and 15records


What can we do with the dataset?

  • Well, we can look at histograms..

  • Female

  • Male

  • Pure

  • Applied

  • Stats


Contingency Tables

  • A better name for a histogram:

    • A One-dimensional Contingency Table

  • Recipe for making a k-dimensional contingency table:

  • 1. Pick k attributes from your dataset. Call them a1,a2, … ak.

  • 2. For every possible combination of values, a1,=x1, a2,=x2,… ak,=xk ,record how frequently that combination occurs

  • Fun fact: A database person would call this a “k-dimensional datacube”


A 2-d Contingency Table

For each pair of

values for

attributes

(year, wealth)

we can see how

many records

match.


A 2-d Contingency Table

• Easier to see

“interesting” things if we

stretch out the Histogram bars


3-d contingency tables

  • • These are harder to look at!

  • 1st year 2nd 3rd 4th

    Rich

    Poor

    Male

    F


On-Line Analytical Processing (OLAP)

  • Software packages and database add-ons to do this are known as OLAP tools

  • They usually include point and click navigation to view slices and aggregates of contingency tables

  • They usually include nice histogram visualization


Time to stop and think

  • • Why would people want to look at contingency tables?


Let’s continue to think

  • With 10 attributes, how many 1-d contingency tables are there?

  • • How many 2-d contingency tables?

  • • How many 3-d tables?

  • • With 100 attributes how many 3-d tables are there?


Let’s continue to think

  • With 10 attributes, how many 1-d contingency tables are there? 10

  • • How many 2-d contingency tables? 10 * 9 / 2 = 45

  • • How many 3-d tables? 120

  • • With 100 attributes how many 3-d tables are there? 161,700


Manually looking at contingencytables

  • • Looking at one contingency table: can be as much fun as reading an interesting book

  • • Looking at ten tables: as much fun as watching BBC One

  • • Looking at 100 tables: as much fun as watching an infomercial

  • • Looking at 100,000 tables: as much fun as a three-week November vacation in Sunderland with a dying weasel.


Data Mining

  • Data Mining is all about automating the process of searching for patterns in the data.

  • Which patterns are interesting?

  • Which might be mere illusions?

  • And how can they be exploited?

That is what we’ll look at now.

And the answer will turn out to be pgf cake with decision tree learning.


Aim

  • • Information Gain for measuring association between inputs and outputs

  • • Learning a decision tree classifier from data


Goal of Data Mining

  • — Simplification and automation of the overall statisticalprocess, from data source(s) to model application— Changed over the years— Replace statistician ð Better models, less grunge work— 1 + 1 = 0— Many different data mining algorithms / tools available— Statistical expertise required to compare different techniques— Build intelligence into the software


Methods

• Decision Trees

• Nearest Neighbour Classification

• Neural Networks

• Rule Induction

• K-means Clustering


Learning Decision Trees

  • A Decision Tree is a tree-structured plan of a set of attributes to test in order to predict the output.

  • To decide which attribute should be tested first, simply find the one with the highest information gain.

  • Then recurse…


A Decision Stump


Recursion Step

Records for 1st year students

Records for 2nd

year students

Records for 4th year students


Recursion Step


Second lever of tree


The final tree

Predict rich

Predict rich

Predict rich

Predict rich


The final tree

Don’t Split a node if all matching records have the same output value

Predict rich

Predict rich

Predict rich

Predict rich


The final tree

Predict rich

Predict rich

Don’t split a node if none of the attributes can create multiple non-empty children

Predict rich

Predict rich


Base Cases

  • • Base Case One: If all records in current data subset have the same output then don’t recurse

  • • Base Case Two: If all records have exactly the same set of input attributes then don’t recurse


Basic Decision Tree BuildingSummarized

  • Build Tree (Dataset, Output)

  • If all output values are the same in Dataset, return a leaf node that says “predict this unique output”

  • If all input values are the same, return a leaf node that says “predict the majority output”

  • Else find attribute X with highest Info Gain

  • Suppose X has nX distinct values (i.e. X has aritynX).

    • Create and return a non-leaf node with nX children.

    • The i’th child should be built by calling

      Build Tree (DSi, Output)

  • Where DSi built consists of all those records in Dataset for which X = ith distinct value of X.


Data mining is not

  • — Data warehousing— SQL / Ad Hoc Queries / Reporting— Software Agents— Online Analytical Processing (OLAP)— Data Visualization


Uses

  • — Direct mail marketing— Web site personalization— Credit card fraud detection— Gas & jewelry— Bioinformatics— Text analysis— SAS lie detector— Market basket analysis— Beer & baby diapers:


Who should provide the Cake for PGF?


Who should provide the Cake for PGF?


Who should provide the Cake for PGF?


Who should provide the Cake for PGF?


Who should provide the Cake for PGF?


Who should provide the Cake for PGF?


Look at all

the

information

gains…


Reference:

  • Andrew Moore

  • http://www.autonlab.org/tutorials/

  • Doug Alexander

  • http://www.laits.utexas.edu/~norman/BUS.FOR/course.mat/Alex/

  • Informationgain

  • http://en.wikipedia.org/wiki/Information_gain_in_decision_trees


  • Login