An introduction to data mining --Who should provide Cake for PGF?

An introduction to data mining--Who should provide Cake for PGF?
Peng Yin MI 6 16/10/2011

Outline Overview of data mining Background Dataset and Tables What is data mining Decision tree Decision tree analysis Common Uses of Data Mining

Backgroud We got notice from some secret organization saying that at North England there is a extremely dangerous group hiding in Newcastle, named NCL-MS

About this dataset • It is a tiny subset of the PG students Personal Secrets. • Some important data missing, while I gained from CIS. Thanks to Robin Henderson. Used Attributes This color=Real valued This color=Symbol valued Successfully loaded the dataset with 10 attributes and 15records

What can we do with the dataset? Well, we can look at histograms.. Female Male Pure Applied Stats

Contingency Tables A better name for a histogram: A One-dimensional Contingency Table Recipe for making a k-dimensional contingency table: 1. Pick k attributes from your dataset. Call them a1,a2, … ak. 2. For every possible combination of values, a1,=x1, a2,=x2,… ak,=xk ,record how frequently that combination occurs Fun fact: A database person would call this a “k-dimensional datacube”

A 2-d Contingency Table For each pair of values for attributes (year, wealth) we can see how many records match.

A 2-d Contingency Table • Easier to see “interesting” things if we stretch out the Histogram bars

3-d contingency tables • These are harder to look at! 1st year 2nd 3rd 4th Rich Poor Male F

On-Line Analytical Processing (OLAP) Software packages and database add-ons to do this are known as OLAP tools They usually include point and click navigation to view slices and aggregates of contingency tables They usually include nice histogram visualization

Time to stop and think • Why would people want to look at contingency tables?

Let’s continue to think With 10 attributes, how many 1-d contingency tables are there? • How many 2-d contingency tables? • How many 3-d tables? • With 100 attributes how many 3-d tables are there?

Let’s continue to think With 10 attributes, how many 1-d contingency tables are there? 10 • How many 2-d contingency tables? 10 * 9 / 2 = 45 • How many 3-d tables? 120 • With 100 attributes how many 3-d tables are there? 161,700

Manually looking at contingencytables • Looking at one contingency table: can be as much fun as reading an interesting book • Looking at ten tables: as much fun as watching BBC One • Looking at 100 tables: as much fun as watching an infomercial • Looking at 100,000 tables: as much fun as a three-week November vacation in Sunderland with a dying weasel.

Data Mining Data Mining is all about automating the process of searching for patterns in the data. Which patterns are interesting? Which might be mere illusions? And how can they be exploited? That is what we’ll look at now. And the answer will turn out to be pgf cake with decision tree learning.

Aim • Information Gain for measuring association between inputs and outputs • Learning a decision tree classifier from data

Goal of Data Mining — Simplification and automation of the overall statisticalprocess, from data source(s) to model application— Changed over the years— Replace statistician ð Better models, less grunge work— 1 + 1 = 0— Many different data mining algorithms / tools available— Statistical expertise required to compare different techniques— Build intelligence into the software

Methods • Decision Trees • Nearest Neighbour Classification • Neural Networks • Rule Induction • K-means Clustering

Learning Decision Trees A Decision Tree is a tree-structured plan of a set of attributes to test in order to predict the output. To decide which attribute should be tested first, simply find the one with the highest information gain. Then recurse…

A Decision Stump

Recursion Step Records for 1st year students Records for 2nd year students Records for 4th year students

Recursion Step

Second lever of tree

The final tree Predict rich Predict rich Predict rich Predict rich

The final tree Don’t Split a node if all matching records have the same output value Predict rich Predict rich Predict rich Predict rich

The final tree Predict rich Predict rich Don’t split a node if none of the attributes can create multiple non-empty children Predict rich Predict rich

Base Cases • Base Case One: If all records in current data subset have the same output then don’t recurse • Base Case Two: If all records have exactly the same set of input attributes then don’t recurse

Basic Decision Tree BuildingSummarized Build Tree (Dataset, Output) If all output values are the same in Dataset, return a leaf node that says “predict this unique output” If all input values are the same, return a leaf node that says “predict the majority output” Else find attribute X with highest Info Gain Suppose X has nX distinct values (i.e. X has aritynX). Create and return a non-leaf node with nX children. The i’th child should be built by calling Build Tree (DSi, Output) Where DSi built consists of all those records in Dataset for which X = ith distinct value of X.

Data mining is not — Data warehousing— SQL / Ad Hoc Queries / Reporting— Software Agents— Online Analytical Processing (OLAP)— Data Visualization

Uses — Direct mail marketing— Web site personalization— Credit card fraud detection— Gas & jewelry— Bioinformatics— Text analysis— SAS lie detector— Market basket analysis— Beer & baby diapers:

Who should provide the Cake for PGF?

Look at all the information gains…

Reference： Andrew Moore http://www.autonlab.org/tutorials/ Doug Alexander http://www.laits.utexas.edu/~norman/BUS.FOR/course.mat/Alex/ Informationgain http://en.wikipedia.org/wiki/Information_gain_in_decision_trees

An introduction to data mining --Who should provide Cake for PGF?